prose

package

v0.3.0 Latest Latest Go to latest Published: Apr 11, 2021 License: MIT, MIT Imports: 21 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/algao1/basically

Links

Open Source Insights

Documentation ¶

Overview ¶

Package prose is a repository of packages related to text processing, including tokenization, part-of-speech tagging, and named-entity extraction.

Index ¶

func Asset(name string) ([]byte, error)
func AssetDir(name string) ([]string, error)
func AssetInfo(name string) (os.FileInfo, error)
func AssetNames() []string
func MustAsset(name string) []byte
func RestoreAsset(dir, name string) error
func RestoreAssets(dir, name string) error
type DataSource
- func UsingEntities(data []EntityContext) DataSource
type DocOpt
- func UsingModel(model *Model) DocOpt
- func WithExtraction(include bool) DocOpt
- func WithSegmentation(include bool) DocOpt
- func WithTagging(include bool) DocOpt
- func WithTokenization(include bool) DocOpt
type DocOpts
type Document
- func NewDocument(text string, opts ...DocOpt) (*Document, error)
- func (doc *Document) Entities() []Entity
- func (doc *Document) Sentences() []Sentence
- func (doc *Document) Tokens() []Token
type Entity
type EntityContext
type IterTokenizer
- func NewIterTokenizer() *IterTokenizer
- func (t *IterTokenizer) Tokenize(text string) []*Token
type LabeledEntity
type Model
- func DefaultModel(tagging, classifying bool) *Model
- func ModelFromData(name string, sources ...DataSource) *Model
- func ModelFromDisk(path string) *Model
- func (m *Model) Write(path string) error
type PerceptronTagger
- func (pt *PerceptronTagger) Tag(tokens []*Token) []*Token
type PunktSentenceTokenizer
- func NewPunktSentenceTokenizer() *PunktSentenceTokenizer
- func (p PunktSentenceTokenizer) Segment(text string) []Sentence
type Sentence
type Token
type TupleSlice
- func ReadTagged(text, sep string) TupleSlice
- func (t TupleSlice) Len() int
- func (t TupleSlice) Swap(i, j int)

Examples ¶

ReadTagged

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func Asset ¶

func Asset(name string) ([]byte, error)

Asset loads and returns the asset for the given name. It returns an error if the asset could not be found or could not be loaded.

func AssetDir ¶

func AssetDir(name string) ([]string, error)

AssetDir returns the file names below a certain directory embedded in the file by go-bindata. For example if you run go-bindata on data/... and data contains the following hierarchy:

data/
  foo.txt
  img/
    a.png
    b.png

then AssetDir("data") would return []string{"foo.txt", "img"} AssetDir("data/img") would return []string{"a.png", "b.png"} AssetDir("foo.txt") and AssetDir("notexist") would return an error AssetDir("") will return []string{"data"}.

func AssetInfo ¶

func AssetInfo(name string) (os.FileInfo, error)

AssetInfo loads and returns the asset info for the given name. It returns an error if the asset could not be found or could not be loaded.

func AssetNames ¶

func AssetNames() []string

AssetNames returns the names of the assets.

func MustAsset ¶

func MustAsset(name string) []byte

MustAsset is like Asset but panics when Asset would return an error. It simplifies safe initialization of global variables.

func RestoreAsset ¶

func RestoreAsset(dir, name string) error

RestoreAsset restores an asset under the given directory

func RestoreAssets ¶

func RestoreAssets(dir, name string) error

RestoreAssets restores an asset under the given directory recursively

Types ¶

type DataSource ¶

type DataSource func(model *Model)

DataSource provides training data to a Model.

func UsingEntities ¶

func UsingEntities(data []EntityContext) DataSource

UsingEntities creates a NER from labeled data.

type DocOpt ¶

type DocOpt func(doc *Document, opts *DocOpts)

A DocOpt represents a setting that changes the document creation process.

For example, it might disable named-entity extraction:

doc := prose.NewDocument("...", prose.WithExtraction(false))

func UsingModel ¶

func UsingModel(model *Model) DocOpt

UsingModel can enable (the default) or disable named-entity extraction.

func WithExtraction ¶

func WithExtraction(include bool) DocOpt

WithExtraction can enable (the default) or disable named-entity extraction.

func WithSegmentation ¶

func WithSegmentation(include bool) DocOpt

WithSegmentation can enable (the default) or disable sentence segmentation.

func WithTagging ¶

func WithTagging(include bool) DocOpt

WithTagging can enable (the default) or disable POS tagging.

func WithTokenization ¶

func WithTokenization(include bool) DocOpt

WithTokenization can enable (the default) or disable tokenization.

type DocOpts ¶

type DocOpts struct {
	Extract  bool // If true, include named-entity extraction
	Segment  bool // If true, include segmentation
	Tag      bool // If true, include POS tagging
	Tokenize bool // If true, include tokenization
}

DocOpts controls the Document creation process:

type Document ¶

type Document struct {
	Model *Model
	Text  string
	// contains filtered or unexported fields
}

A Document represents a parsed body of text.

func NewDocument ¶

func NewDocument(text string, opts ...DocOpt) (*Document, error)

NewDocument creates a Document according to the user-specified options.

For example,

doc := prose.NewDocument("...")

func (*Document) Entities ¶

func (doc *Document) Entities() []Entity

Entities returns `doc`'s entities.

func (*Document) Sentences ¶

func (doc *Document) Sentences() []Sentence

Sentences returns `doc`'s sentences.

func (*Document) Tokens ¶

func (doc *Document) Tokens() []Token

Tokens returns `doc`'s tokens.

type Entity ¶

type Entity struct {
	Text  string // The entity's actual content.
	Label string // The entity's label.
}

An Entity represents an individual named-entity.

type EntityContext ¶

type EntityContext struct {
	// Is this is a correct entity?
	//
	// Some annotation software, e.g. Prodigy, include entities "rejected" by
	// its user. This allows us to handle those cases.
	Accept bool

	Spans []LabeledEntity // The entity locations relative to `Text`.
	Text  string          // The sentence containing the entities.
}

EntityContext represents text containing named-entities.

type IterTokenizer ¶

type IterTokenizer struct {
}

IterTokenizer splits a sentence into words.

func NewIterTokenizer ¶

func NewIterTokenizer() *IterTokenizer

NewIterTokenizer is a IterTokenizer constructor.

func (*IterTokenizer) Tokenize ¶

func (t *IterTokenizer) Tokenize(text string) []*Token

Tokenize splits a sentence into a slice of words.

type LabeledEntity ¶

type LabeledEntity struct {
	Start int
	End   int
	Label string
}

LabeledEntity represents an externally-labeled named-entity.

type Model ¶

type Model struct {
	Name string

	Tagger *PerceptronTagger
	// contains filtered or unexported fields
}

A Model holds the structures and data used internally by prose.

func DefaultModel ¶

func DefaultModel(tagging, classifying bool) *Model

DefaultModel ...

func ModelFromData ¶

func ModelFromData(name string, sources ...DataSource) *Model

ModelFromData creates a new Model from user-provided training data.

func ModelFromDisk ¶

func ModelFromDisk(path string) *Model

ModelFromDisk loads a Model from the user-provided location.

func (*Model) Write ¶

func (m *Model) Write(path string) error

Write saves a Model to the user-provided location.

type PerceptronTagger ¶

type PerceptronTagger struct {
	// contains filtered or unexported fields
}

PerceptronTagger is a port of Textblob's "fast and accurate" POS tagger. See https://github.com/sloria/textblob-aptagger for details.

func (*PerceptronTagger) Tag ¶

func (pt *PerceptronTagger) Tag(tokens []*Token) []*Token

Tag takes a slice of words and returns a slice of tagged tokens.

type PunktSentenceTokenizer ¶

type PunktSentenceTokenizer struct {
	// contains filtered or unexported fields
}

PunktSentenceTokenizer is an extension of the Go implementation of the Punkt sentence tokenizer (https://github.com/neurosnap/sentences), with a few minor improvements (see https://github.com/neurosnap/sentences/pull/18).

func NewPunktSentenceTokenizer ¶

func NewPunktSentenceTokenizer() *PunktSentenceTokenizer

NewPunktSentenceTokenizer creates a new PunktSentenceTokenizer and loads its English model.

func (PunktSentenceTokenizer) Segment ¶

func (p PunktSentenceTokenizer) Segment(text string) []Sentence

Segment splits text into sentences.

type Sentence ¶

type Sentence struct {
	Text string // The sentence's text.
}

A Sentence represents a segmented portion of text.

type Token ¶

type Token struct {
	Tag   string // The token's part-of-speech tag.
	Text  string // The token's actual content.
	Label string // The token's IOB label.
}

A Token represents an individual token of text such as a word or punctuation symbol.

type TupleSlice ¶

type TupleSlice [][][]string

TupleSlice is a slice of tuples in the form (words, tags).

func ReadTagged ¶

func ReadTagged(text, sep string) TupleSlice

ReadTagged converts pre-tagged input into a TupleSlice suitable for training.

Example ¶

tagged := "Pierre|NNP Vinken|NNP ,|, 61|CD years|NNS"
fmt.Println(ReadTagged(tagged, "|"))

Output:

[[[Pierre Vinken , 61 years] [NNP NNP , CD NNS]]]

func (TupleSlice) Len ¶

func (t TupleSlice) Len() int

Len returns the length of a Tuple.

func (TupleSlice) Swap ¶

func (t TupleSlice) Swap(i, j int)

Swap switches the ith and jth elements in a Tuple.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL