Documentation ¶
Overview ¶
Package prose is a repository of packages related to text processing, including tokenization, part-of-speech tagging, and named-entity extraction.
Index ¶
- Constants
- func NewIterTokenizer(opts ...TokenizerOptFunc) *iterTokenizer
- func ReadAndDecodeBytes(filename string) (*gob.Decoder, error)
- func ReadBytes(filename string) ([]byte, error)
- type DataSource
- type DocOpt
- type DocOpts
- type Document
- type Entity
- type EntityContext
- type LabeledEntity
- type Model
- type PerceptronTagger
- type Sentence
- type Token
- type TokenTester
- type Tokenizer
- type TokenizerOptFunc
- func UsingContractions(x []string) TokenizerOptFunc
- func UsingEmoticons(x map[string]struct{}) TokenizerOptFunc
- func UsingIsUnsplittable(x TokenTester) TokenizerOptFunc
- func UsingPrefixes(x []string) TokenizerOptFunc
- func UsingSanitizer(x *strings.Replacer) TokenizerOptFunc
- func UsingSpecialRE(x *regexp.Regexp) TokenizerOptFunc
- func UsingSplitCases(x []string) TokenizerOptFunc
- func UsingSuffixes(x []string) TokenizerOptFunc
- type TupleSlice
Examples ¶
Constants ¶
const NoneFeat = "None"
Variables ¶
This section is empty.
Functions ¶
func NewIterTokenizer ¶
func NewIterTokenizer(opts ...TokenizerOptFunc) *iterTokenizer
Constructor for default iterTokenizer
func ReadAndDecodeBytes ¶
ReadAndDecodeBytes reads an embedded file into a gob decoder
Types ¶
type DataSource ¶
type DataSource func(model *Model)
DataSource provides training data to a Model.
func UsingEntities ¶
func UsingEntities(data []EntityContext) DataSource
UsingEntities creates a NER from labeled data.
func UsingEntitiesAndTokenizer ¶
func UsingEntitiesAndTokenizer(data []EntityContext, tokenizer Tokenizer) DataSource
UsingEntities creates a NER from labeled data and custom tokenizer.
type DocOpt ¶
A DocOpt represents a setting that changes the document creation process.
For example, it might disable named-entity extraction:
doc := prose.NewDocument("...", prose.WithExtraction(false))
func UsingModel ¶
UsingModel can enable (the default) or disable named-entity extraction.
func UsingTokenizer ¶
UsingTokenizer specifies the Tokenizer to use.
func WithExtraction ¶
WithExtraction can enable (the default) or disable named-entity extraction.
func WithSegmentation ¶
WithSegmentation can enable (the default) or disable sentence segmentation.
func WithTagging ¶
WithTagging can enable (the default) or disable POS tagging.
func WithTokenization ¶
WithTokenization can enable (the default) or disable tokenization. Deprecated: use UsingTokenizer instead.
type DocOpts ¶
type DocOpts struct { Extract bool // If true, include named-entity extraction Segment bool // If true, include segmentation Tag bool // If true, include POS tagging Tokenizer Tokenizer // If true, include tokenization }
DocOpts controls the Document creation process:
type Document ¶
A Document represents a parsed body of text.
func NewDocument ¶
NewDocument creates a Document according to the user-specified options.
For example,
doc := prose.NewDocument("...")
type Entity ¶
type Entity struct { Text string // The entity's actual content. Label string // The entity's label. }
An Entity represents an individual named-entity.
type EntityContext ¶
type EntityContext struct { // Is this is a correct entity? // // Some annotation software, e.g. Prodigy, include entities "rejected" by // its user. This allows us to handle those cases. Accept bool Spans []LabeledEntity // The entity locations relative to `Text`. Text string // The sentence containing the entities. }
EntityContext represents text containing named-entities.
type LabeledEntity ¶
LabeledEntity represents an externally-labeled named-entity.
type Model ¶
type Model struct { Name string // contains filtered or unexported fields }
A Model holds the structures and data used internally by prose.
func ModelFromData ¶
func ModelFromData(name string, sources ...DataSource) (*Model, error)
ModelFromData creates a new Model from user-provided training data.
func ModelFromDisk ¶
ModelFromDisk loads a Model from the user-provided location.
func ModelFromFS ¶
ModelFromFS loads a model from the
type PerceptronTagger ¶
type PerceptronTagger struct {
// contains filtered or unexported fields
}
perceptronTagger is a port of Textblob's "fast and accurate" POS tagger. See https://github.com/sloria/textblob-aptagger for details.
func NewPerceptronTagger ¶
func NewPerceptronTagger() (*PerceptronTagger, error)
newPerceptronTagger creates a new PerceptronTagger and loads the built-in AveragedPerceptron model.
func (*PerceptronTagger) Tag ¶
func (pt *PerceptronTagger) Tag(tokens []*Token) []*Token
Tag takes a slice of words and returns a slice of tagged tokens.
type Sentence ¶
type Sentence struct {
Text string // The sentence's text.
}
A Sentence represents a segmented portion of text.
type Token ¶
type Token struct { Tag string // The token's part-of-speech tag. Text string // The token's actual content. Label string // The token's IOB label. }
A Token represents an individual token of text such as a word or punctuation symbol.
type TokenTester ¶
type TokenizerOptFunc ¶
type TokenizerOptFunc func(*iterTokenizer)
func UsingContractions ¶
func UsingContractions(x []string) TokenizerOptFunc
Use the provided contractions.
func UsingEmoticons ¶
func UsingEmoticons(x map[string]struct{}) TokenizerOptFunc
Use the provided map of emoticons.
func UsingIsUnsplittable ¶
func UsingIsUnsplittable(x TokenTester) TokenizerOptFunc
UsingIsUnsplittableFN gives a function that tests whether a token is splittable or not.
func UsingSanitizer ¶
func UsingSanitizer(x *strings.Replacer) TokenizerOptFunc
Use the provided sanitizer.
func UsingSpecialRE ¶
func UsingSpecialRE(x *regexp.Regexp) TokenizerOptFunc
Use the provided special regex for unsplittable tokens.
func UsingSplitCases ¶
func UsingSplitCases(x []string) TokenizerOptFunc
Use the provided splitCases.
type TupleSlice ¶
type TupleSlice [][][]string
TupleSlice is a slice of tuples in the form (words, tags).
func ReadTagged ¶
func ReadTagged(text, sep string) TupleSlice
ReadTagged converts pre-tagged input into a TupleSlice suitable for training.
Example ¶
tagged := "Pierre|NNP Vinken|NNP ,|, 61|CD years|NNS" fmt.Println(ReadTagged(tagged, "|"))
Output: [[[Pierre Vinken , 61 years] [NNP NNP , CD NNS]]]
func (TupleSlice) Swap ¶
func (t TupleSlice) Swap(i, j int)
Swap switches the ith and jth elements in a Tuple.