corpus

package
v0.0.0-...-2db8e3e Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 18, 2018 License: MIT Imports: 17 Imported by: 0

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func CombineInts

func CombineInts(ints []int) int

CombineInts takes a int slice, and tries to make it one integer. It works by taking advantage of english - anything more than 1000 has a repeated pattern e.g.

one hundred and fifty thousand two hundred and two

there are 2 repeated patterns (one hundred and fifty) and (two hundred and two)

This allows us to repeatedly combine by addition or multiplication until there is one left

func CosineSimilarity

func CosineSimilarity(a, b []string) float64

CosineSimilarity measures the cosine similarity of two strings.

func DamerauLevenshtein

func DamerauLevenshtein(s1 string, s2 string) (distance int)

DamerauLevenshtein calculates the Damerau-Levensthtein distance between two strings. See more at https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance

func LongestCommonPrefix

func LongestCommonPrefix(strs ...string) string

LongestCommonPrefix takes a slice of strings, and finds the longest common prefix

func Pluralize

func Pluralize(word string) string

Pluralize pluralizes words based on rules known

func Singularize

func Singularize(word string) string

Singularize singularizes words based on rules known

func StrsToInts

func StrsToInts(strs []string) (retVal []int, err error)

StrsToInts converts a string slice into an int slice, with the help of NumberWords. The function assumes all helper words like "and" have been stripped.

"One hundred and five" -> []string{"one", "hundred", "five"}

This is a very primitive method, and doesn't take into account other words like "a hundred" or "a couple of hundred"

func ViterbiSplit

func ViterbiSplit(input string, c *Corpus) []string

ViterbiSplit is a Viterbi algorithm for splitting words given a corpus

Types

type ConsOpt

type ConsOpt func(c *Corpus) error

ConsOpt is a construction option for manual creation of a Corpus

func WithSize

func WithSize(size int) ConsOpt

WithSize preallocates all the things in Corpus

func WithWords

func WithWords(a []string) ConsOpt

WithWords creates a corpus from a

type Corpus

type Corpus struct {
	// contains filtered or unexported fields
}

Corpus is a data structure holding the relevant metadata and information for a corpus of text. It serves as vocabulary with ID for lookup. This is very useful as neural networks rely on the IDs rather than the text themselves

func Construct

func Construct(opts ...ConsOpt) (*Corpus, error)

Construct creates a Corpus given the construction options. This allows for more flexibility

func GenerateCorpus

func GenerateCorpus(sentenceTags []treebank.SentenceTag) *Corpus

GenerateCorpus creates a Corpus given a set of SentenceTag from a training set.

func New

func New() *Corpus

New creates a new *Corpus

func (*Corpus) Add

func (c *Corpus) Add(word string) int

Add adds a word to the corpus and returns its ID. If a word was previously in the corpus, it merely updates the frequency count and returns the ID

func (*Corpus) GobDecode

func (c *Corpus) GobDecode(buf []byte) error

GobDecode implements GobDecoder for *Corpus

func (*Corpus) GobEncode

func (c *Corpus) GobEncode() ([]byte, error)

GobEncode implements GobEncoder for *Corpus

func (*Corpus) IDFreq

func (c *Corpus) IDFreq(id int) int

IDFreq returns the frequency of a word given an ID. If the word isn't in the corpus it returns 0.

func (*Corpus) Id

func (c *Corpus) Id(word string) (int, bool)

ID returns the ID of a word and whether or not it was found in the corpus

func (*Corpus) LoadOneGram

func (c *Corpus) LoadOneGram(r io.Reader) error

LoadOneGram loads a 1_gram.txt file, which is a tab separated file which lists the frequency counts of words. Example:

the	23135851162
of	13151942776
and	12997637966
to	12136980858
a	9081174698
in	8469404971
for	5933321709

func (*Corpus) MaxWordLength

func (c *Corpus) MaxWordLength() int

MaxWordLength returns the length of the longest known word in the corpus.

func (*Corpus) Merge

func (c *Corpus) Merge(other *Corpus)

Merge combines two corpuses. The receiver is the one that is mutated.

func (*Corpus) Size

func (c *Corpus) Size() int

Size returns the size of the corpus.

func (*Corpus) TotalFreq

func (c *Corpus) TotalFreq() int

TotalFreq returns the total number of words ever seen by the corpus. This number includes the count of repeat words.

func (*Corpus) Word

func (c *Corpus) Word(id int) (string, bool)

Word returns the word given the ID, and whether or not it was found in the corpus

func (*Corpus) WordFreq

func (c *Corpus) WordFreq(word string) int

WordFreq returns the frequency of the word. If the word wasn't in the corpus, it returns 0.

func (*Corpus) WordProb

func (c *Corpus) WordProb(word string) (float64, bool)

WordProb returns the probability of a word appearing in the corpus.

type LDAModel

type LDAModel struct {
	// params
	Alpha tensor.Tensor   // is a Row
	Eta   tensor.Tensor   // is a Col
	Kappa gorgonia.Scalar // Decay
	Tau0  gorgonia.Scalar // offset

	// parameters needed for working
	Topics      int
	ChunkSize   int
	Terms       int
	UpdateEvery int
	EvalEvery   int

	// consts
	Iterations     int
	GammaThreshold float64

	MinimumProb float64

	// track current progress
	Updates int

	// type
	Dtype tensor.Dtype
}

LDAModel ... TODO https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL