prose

package
v0.3.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 11, 2021 License: MIT, MIT Imports: 21 Imported by: 0

Documentation

Overview

Package prose is a repository of packages related to text processing, including tokenization, part-of-speech tagging, and named-entity extraction.

Index

Examples

Constants

This section is empty.

Variables

This section is empty.

Functions

func Asset

func Asset(name string) ([]byte, error)

Asset loads and returns the asset for the given name. It returns an error if the asset could not be found or could not be loaded.

func AssetDir

func AssetDir(name string) ([]string, error)

AssetDir returns the file names below a certain directory embedded in the file by go-bindata. For example if you run go-bindata on data/... and data contains the following hierarchy:

data/
  foo.txt
  img/
    a.png
    b.png

then AssetDir("data") would return []string{"foo.txt", "img"} AssetDir("data/img") would return []string{"a.png", "b.png"} AssetDir("foo.txt") and AssetDir("notexist") would return an error AssetDir("") will return []string{"data"}.

func AssetInfo

func AssetInfo(name string) (os.FileInfo, error)

AssetInfo loads and returns the asset info for the given name. It returns an error if the asset could not be found or could not be loaded.

func AssetNames

func AssetNames() []string

AssetNames returns the names of the assets.

func MustAsset

func MustAsset(name string) []byte

MustAsset is like Asset but panics when Asset would return an error. It simplifies safe initialization of global variables.

func RestoreAsset

func RestoreAsset(dir, name string) error

RestoreAsset restores an asset under the given directory

func RestoreAssets

func RestoreAssets(dir, name string) error

RestoreAssets restores an asset under the given directory recursively

Types

type DataSource

type DataSource func(model *Model)

DataSource provides training data to a Model.

func UsingEntities

func UsingEntities(data []EntityContext) DataSource

UsingEntities creates a NER from labeled data.

type DocOpt

type DocOpt func(doc *Document, opts *DocOpts)

A DocOpt represents a setting that changes the document creation process.

For example, it might disable named-entity extraction:

doc := prose.NewDocument("...", prose.WithExtraction(false))

func UsingModel

func UsingModel(model *Model) DocOpt

UsingModel can enable (the default) or disable named-entity extraction.

func WithExtraction

func WithExtraction(include bool) DocOpt

WithExtraction can enable (the default) or disable named-entity extraction.

func WithSegmentation

func WithSegmentation(include bool) DocOpt

WithSegmentation can enable (the default) or disable sentence segmentation.

func WithTagging

func WithTagging(include bool) DocOpt

WithTagging can enable (the default) or disable POS tagging.

func WithTokenization

func WithTokenization(include bool) DocOpt

WithTokenization can enable (the default) or disable tokenization.

type DocOpts

type DocOpts struct {
	Extract  bool // If true, include named-entity extraction
	Segment  bool // If true, include segmentation
	Tag      bool // If true, include POS tagging
	Tokenize bool // If true, include tokenization
}

DocOpts controls the Document creation process:

type Document

type Document struct {
	Model *Model
	Text  string
	// contains filtered or unexported fields
}

A Document represents a parsed body of text.

func NewDocument

func NewDocument(text string, opts ...DocOpt) (*Document, error)

NewDocument creates a Document according to the user-specified options.

For example,

doc := prose.NewDocument("...")

func (*Document) Entities

func (doc *Document) Entities() []Entity

Entities returns `doc`'s entities.

func (*Document) Sentences

func (doc *Document) Sentences() []Sentence

Sentences returns `doc`'s sentences.

func (*Document) Tokens

func (doc *Document) Tokens() []Token

Tokens returns `doc`'s tokens.

type Entity

type Entity struct {
	Text  string // The entity's actual content.
	Label string // The entity's label.
}

An Entity represents an individual named-entity.

type EntityContext

type EntityContext struct {
	// Is this is a correct entity?
	//
	// Some annotation software, e.g. Prodigy, include entities "rejected" by
	// its user. This allows us to handle those cases.
	Accept bool

	Spans []LabeledEntity // The entity locations relative to `Text`.
	Text  string          // The sentence containing the entities.
}

EntityContext represents text containing named-entities.

type IterTokenizer

type IterTokenizer struct {
}

IterTokenizer splits a sentence into words.

func NewIterTokenizer

func NewIterTokenizer() *IterTokenizer

NewIterTokenizer is a IterTokenizer constructor.

func (*IterTokenizer) Tokenize

func (t *IterTokenizer) Tokenize(text string) []*Token

Tokenize splits a sentence into a slice of words.

type LabeledEntity

type LabeledEntity struct {
	Start int
	End   int
	Label string
}

LabeledEntity represents an externally-labeled named-entity.

type Model

type Model struct {
	Name string

	Tagger *PerceptronTagger
	// contains filtered or unexported fields
}

A Model holds the structures and data used internally by prose.

func DefaultModel

func DefaultModel(tagging, classifying bool) *Model

DefaultModel ...

func ModelFromData

func ModelFromData(name string, sources ...DataSource) *Model

ModelFromData creates a new Model from user-provided training data.

func ModelFromDisk

func ModelFromDisk(path string) *Model

ModelFromDisk loads a Model from the user-provided location.

func (*Model) Write

func (m *Model) Write(path string) error

Write saves a Model to the user-provided location.

type PerceptronTagger

type PerceptronTagger struct {
	// contains filtered or unexported fields
}

PerceptronTagger is a port of Textblob's "fast and accurate" POS tagger. See https://github.com/sloria/textblob-aptagger for details.

func (*PerceptronTagger) Tag

func (pt *PerceptronTagger) Tag(tokens []*Token) []*Token

Tag takes a slice of words and returns a slice of tagged tokens.

type PunktSentenceTokenizer

type PunktSentenceTokenizer struct {
	// contains filtered or unexported fields
}

PunktSentenceTokenizer is an extension of the Go implementation of the Punkt sentence tokenizer (https://github.com/neurosnap/sentences), with a few minor improvements (see https://github.com/neurosnap/sentences/pull/18).

func NewPunktSentenceTokenizer

func NewPunktSentenceTokenizer() *PunktSentenceTokenizer

NewPunktSentenceTokenizer creates a new PunktSentenceTokenizer and loads its English model.

func (PunktSentenceTokenizer) Segment

func (p PunktSentenceTokenizer) Segment(text string) []Sentence

Segment splits text into sentences.

type Sentence

type Sentence struct {
	Text string // The sentence's text.
}

A Sentence represents a segmented portion of text.

type Token

type Token struct {
	Tag   string // The token's part-of-speech tag.
	Text  string // The token's actual content.
	Label string // The token's IOB label.
}

A Token represents an individual token of text such as a word or punctuation symbol.

type TupleSlice

type TupleSlice [][][]string

TupleSlice is a slice of tuples in the form (words, tags).

func ReadTagged

func ReadTagged(text, sep string) TupleSlice

ReadTagged converts pre-tagged input into a TupleSlice suitable for training.

Example
tagged := "Pierre|NNP Vinken|NNP ,|, 61|CD years|NNS"
fmt.Println(ReadTagged(tagged, "|"))
Output:

[[[Pierre Vinken , 61 years] [NNP NNP , CD NNS]]]

func (TupleSlice) Len

func (t TupleSlice) Len() int

Len returns the length of a Tuple.

func (TupleSlice) Swap

func (t TupleSlice) Swap(i, j int)

Swap switches the ith and jth elements in a Tuple.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL