# corpus

package
Version: v0.0.0-...-2db8e3e Latest Latest

Go to latest
Published: Apr 18, 2018 License: MIT

## Documentation ¶

### Constants ¶

This section is empty.

### Variables ¶

This section is empty.

### Functions ¶

#### func CombineInts ¶

`func CombineInts(ints []int) int`

CombineInts takes a int slice, and tries to make it one integer. It works by taking advantage of english - anything more than 1000 has a repeated pattern e.g.

```one hundred and fifty thousand two hundred and two
```

there are 2 repeated patterns (one hundred and fifty) and (two hundred and two)

This allows us to repeatedly combine by addition or multiplication until there is one left

#### func CosineSimilarity ¶

`func CosineSimilarity(a, b []string) float64`

CosineSimilarity measures the cosine similarity of two strings.

#### func DamerauLevenshtein ¶

`func DamerauLevenshtein(s1 string, s2 string) (distance int)`

DamerauLevenshtein calculates the Damerau-Levensthtein distance between two strings. See more at https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance

#### func LongestCommonPrefix ¶

`func LongestCommonPrefix(strs ...string) string`

LongestCommonPrefix takes a slice of strings, and finds the longest common prefix

#### func Pluralize ¶

`func Pluralize(word string) string`

Pluralize pluralizes words based on rules known

#### func Singularize ¶

`func Singularize(word string) string`

Singularize singularizes words based on rules known

#### func StrsToInts ¶

`func StrsToInts(strs []string) (retVal []int, err error)`

StrsToInts converts a string slice into an int slice, with the help of NumberWords. The function assumes all helper words like "and" have been stripped.

```"One hundred and five" -> []string{"one", "hundred", "five"}
```

This is a very primitive method, and doesn't take into account other words like "a hundred" or "a couple of hundred"

#### func ViterbiSplit ¶

`func ViterbiSplit(input string, c *Corpus) []string`

ViterbiSplit is a Viterbi algorithm for splitting words given a corpus

### Types ¶

#### type ConsOpt ¶

`type ConsOpt func(c *Corpus) error`

ConsOpt is a construction option for manual creation of a Corpus

#### func WithSize ¶

`func WithSize(size int) ConsOpt`

WithSize preallocates all the things in Corpus

#### func WithWords ¶

`func WithWords(a []string) ConsOpt`

WithWords creates a corpus from a

#### type Corpus ¶

```type Corpus struct {
// contains filtered or unexported fields
}```

Corpus is a data structure holding the relevant metadata and information for a corpus of text. It serves as vocabulary with ID for lookup. This is very useful as neural networks rely on the IDs rather than the text themselves

#### func Construct ¶

`func Construct(opts ...ConsOpt) (*Corpus, error)`

Construct creates a Corpus given the construction options. This allows for more flexibility

#### func GenerateCorpus ¶

`func GenerateCorpus(sentenceTags []treebank.SentenceTag) *Corpus`

GenerateCorpus creates a Corpus given a set of SentenceTag from a training set.

#### func New ¶

`func New() *Corpus`

New creates a new *Corpus

`func (c *Corpus) Add(word string) int`

Add adds a word to the corpus and returns its ID. If a word was previously in the corpus, it merely updates the frequency count and returns the ID

#### func (*Corpus) GobDecode ¶

`func (c *Corpus) GobDecode(buf []byte) error`

GobDecode implements GobDecoder for *Corpus

#### func (*Corpus) GobEncode ¶

`func (c *Corpus) GobEncode() ([]byte, error)`

GobEncode implements GobEncoder for *Corpus

#### func (*Corpus) IDFreq ¶

`func (c *Corpus) IDFreq(id int) int`

IDFreq returns the frequency of a word given an ID. If the word isn't in the corpus it returns 0.

#### func (*Corpus) Id ¶

`func (c *Corpus) Id(word string) (int, bool)`

ID returns the ID of a word and whether or not it was found in the corpus

`func (c *Corpus) LoadOneGram(r io.Reader) error`

LoadOneGram loads a 1_gram.txt file, which is a tab separated file which lists the frequency counts of words. Example:

```the	23135851162
of	13151942776
and	12997637966
to	12136980858
a	9081174698
in	8469404971
for	5933321709
```

#### func (*Corpus) MaxWordLength ¶

`func (c *Corpus) MaxWordLength() int`

MaxWordLength returns the length of the longest known word in the corpus.

#### func (*Corpus) Merge ¶

`func (c *Corpus) Merge(other *Corpus)`

Merge combines two corpuses. The receiver is the one that is mutated.

#### func (*Corpus) Size ¶

`func (c *Corpus) Size() int`

Size returns the size of the corpus.

#### func (*Corpus) TotalFreq ¶

`func (c *Corpus) TotalFreq() int`

TotalFreq returns the total number of words ever seen by the corpus. This number includes the count of repeat words.

#### func (*Corpus) Word ¶

`func (c *Corpus) Word(id int) (string, bool)`

Word returns the word given the ID, and whether or not it was found in the corpus

#### func (*Corpus) WordFreq ¶

`func (c *Corpus) WordFreq(word string) int`

WordFreq returns the frequency of the word. If the word wasn't in the corpus, it returns 0.

#### func (*Corpus) WordProb ¶

`func (c *Corpus) WordProb(word string) (float64, bool)`

WordProb returns the probability of a word appearing in the corpus.

#### type LDAModel ¶

```type LDAModel struct {
// params
Alpha tensor.Tensor   // is a Row
Eta   tensor.Tensor   // is a Col
Kappa gorgonia.Scalar // Decay
Tau0  gorgonia.Scalar // offset

// parameters needed for working
Topics      int
ChunkSize   int
Terms       int
UpdateEvery int
EvalEvery   int

// consts
Iterations     int
GammaThreshold float64

MinimumProb float64

// track current progress

// type
Dtype tensor.Dtype
}```

LDAModel ... TODO https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation