corpus

package

v0.0.0-...-2db8e3e Latest Latest Go to latest Published: Apr 18, 2018 License: MIT Imports: 17 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/ynqa/lingo

Documentation ¶

Index ¶

func CombineInts(ints []int) int
func CosineSimilarity(a, b []string) float64
func DamerauLevenshtein(s1 string, s2 string) (distance int)
func LongestCommonPrefix(strs ...string) string
func Pluralize(word string) string
func Singularize(word string) string
func StrsToInts(strs []string) (retVal []int, err error)
func ViterbiSplit(input string, c *Corpus) []string
type ConsOpt
- func WithSize(size int) ConsOpt
- func WithWords(a []string) ConsOpt
type Corpus
type LDAModel

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func CombineInts ¶

func CombineInts(ints []int) int

CombineInts takes a int slice, and tries to make it one integer. It works by taking advantage of english - anything more than 1000 has a repeated pattern e.g.

one hundred and fifty thousand two hundred and two

there are 2 repeated patterns (one hundred and fifty) and (two hundred and two)

This allows us to repeatedly combine by addition or multiplication until there is one left

func CosineSimilarity ¶

func CosineSimilarity(a, b []string) float64

CosineSimilarity measures the cosine similarity of two strings.

func DamerauLevenshtein ¶

func DamerauLevenshtein(s1 string, s2 string) (distance int)

DamerauLevenshtein calculates the Damerau-Levensthtein distance between two strings. See more at https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance

func LongestCommonPrefix ¶

func LongestCommonPrefix(strs ...string) string

LongestCommonPrefix takes a slice of strings, and finds the longest common prefix

func Pluralize ¶

func Pluralize(word string) string

Pluralize pluralizes words based on rules known

func Singularize ¶

func Singularize(word string) string

Singularize singularizes words based on rules known

func StrsToInts ¶

func StrsToInts(strs []string) (retVal []int, err error)

StrsToInts converts a string slice into an int slice, with the help of NumberWords. The function assumes all helper words like "and" have been stripped.

"One hundred and five" -> []string{"one", "hundred", "five"}

This is a very primitive method, and doesn't take into account other words like "a hundred" or "a couple of hundred"

func ViterbiSplit ¶

func ViterbiSplit(input string, c *Corpus) []string

ViterbiSplit is a Viterbi algorithm for splitting words given a corpus

Types ¶

type ConsOpt ¶

type ConsOpt func(c *Corpus) error

ConsOpt is a construction option for manual creation of a Corpus

func WithSize ¶

func WithSize(size int) ConsOpt

WithSize preallocates all the things in Corpus

type Corpus ¶

type Corpus struct {
	// contains filtered or unexported fields
}

Corpus is a data structure holding the relevant metadata and information for a corpus of text. It serves as vocabulary with ID for lookup. This is very useful as neural networks rely on the IDs rather than the text themselves

func Construct ¶

func Construct(opts ...ConsOpt) (*Corpus, error)

Construct creates a Corpus given the construction options. This allows for more flexibility

func GenerateCorpus ¶

func GenerateCorpus(sentenceTags []treebank.SentenceTag) *Corpus

GenerateCorpus creates a Corpus given a set of SentenceTag from a training set.

func New ¶

func New() *Corpus

New creates a new *Corpus

func (*Corpus) Add ¶

func (c *Corpus) Add(word string) int

Add adds a word to the corpus and returns its ID. If a word was previously in the corpus, it merely updates the frequency count and returns the ID

func (*Corpus) GobDecode ¶

func (c *Corpus) GobDecode(buf []byte) error

GobDecode implements GobDecoder for *Corpus

func (*Corpus) GobEncode ¶

func (c *Corpus) GobEncode() ([]byte, error)

GobEncode implements GobEncoder for *Corpus

func (*Corpus) IDFreq ¶

func (c *Corpus) IDFreq(id int) int

IDFreq returns the frequency of a word given an ID. If the word isn't in the corpus it returns 0.

func (*Corpus) Id ¶

func (c *Corpus) Id(word string) (int, bool)

ID returns the ID of a word and whether or not it was found in the corpus

func (*Corpus) LoadOneGram ¶

func (c *Corpus) LoadOneGram(r io.Reader) error

LoadOneGram loads a 1_gram.txt file, which is a tab separated file which lists the frequency counts of words. Example:

the	23135851162
of	13151942776
and	12997637966
to	12136980858
a	9081174698
in	8469404971
for	5933321709

func (*Corpus) MaxWordLength ¶

func (c *Corpus) MaxWordLength() int

MaxWordLength returns the length of the longest known word in the corpus.

func (*Corpus) Merge ¶

func (c *Corpus) Merge(other *Corpus)

Merge combines two corpuses. The receiver is the one that is mutated.

func (*Corpus) Size ¶

func (c *Corpus) Size() int

Size returns the size of the corpus.

func (*Corpus) TotalFreq ¶

func (c *Corpus) TotalFreq() int

TotalFreq returns the total number of words ever seen by the corpus. This number includes the count of repeat words.

func (*Corpus) Word ¶

func (c *Corpus) Word(id int) (string, bool)

Word returns the word given the ID, and whether or not it was found in the corpus

func (*Corpus) WordFreq ¶

func (c *Corpus) WordFreq(word string) int

WordFreq returns the frequency of the word. If the word wasn't in the corpus, it returns 0.

func (*Corpus) WordProb ¶

func (c *Corpus) WordProb(word string) (float64, bool)

WordProb returns the probability of a word appearing in the corpus.

type LDAModel ¶

type LDAModel struct {
	// params
	Alpha tensor.Tensor   // is a Row
	Eta   tensor.Tensor   // is a Col
	Kappa gorgonia.Scalar // Decay
	Tau0  gorgonia.Scalar // offset

	// parameters needed for working
	Topics      int
	ChunkSize   int
	Terms       int
	UpdateEvery int
	EvalEvery   int

	// consts
	Iterations     int
	GammaThreshold float64

	MinimumProb float64

	// track current progress
	Updates int

	// type
	Dtype tensor.Dtype
}

LDAModel ... TODO https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL