sentences

package module

v1.1.2 Latest Latest Go to latest Published: Feb 6, 2023 License: MIT Imports: 5 Imported by: 12

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/neurosnap/sentences

Links

Open Source Insights

README ¶

MIT

Sentences - A command line sentence tokenizer

This command line utility will convert a blob of text into a list of sentences.

Demo
Docs

Features

Supports multiple languages (english, czech, dutch, estonian, finnish, german, greek, italian, norwegian, polish, portuguese, slovene, and turkish)
Zero dependencies
Extendable
Fast

Install

arch

aur

mac

brew tap neurosnap/sentences
brew install sentences

other

Or you can find the pre-built binaries on the github releases page.

using golang

go get github.com/neurosnap/sentences
go install github.com/neurosnap/sentences/cmd/sentences

Command

Command line

Get it

go get github.com/neurosnap/sentences

Use it

import (
    "fmt"
    "os"

    "github.com/neurosnap/sentences"
)

func main() {
    text := `A perennial also-ran, Stallings won his seat when longtime lawmaker David Holmes
    died 11 days after the filing deadline. Suddenly, Stallings was a shoo-in, not
    the long shot. In short order, the Legislature attempted to pass a law allowing
    former U.S. Rep. Carolyn Cheeks Kilpatrick to file; Stallings challenged the
    law in court and won. Kilpatrick mounted a write-in campaign, but Stallings won.`

    // download the training data from this repo (./data) and save it somewhere
    b, _ := os.ReadFile("./path/to/english.json")

    // load the training data
    training, _ := sentences.LoadTraining(b)

    // create the default sentence tokenizer
    tokenizer := sentences.NewSentenceTokenizer(training)
    sentences := tokenizer.Tokenize(text)

    for _, s := range sentences {
        fmt.Println(s.Text)
    }
}

English

This package attempts to fix some problems I noticed for english.

import (
    "fmt"

    "github.com/neurosnap/sentences/english"
)

func main() {
    text := "Hi there. Does this really work?"

    tokenizer, err := english.NewSentenceTokenizer(nil)
    if err != nil {
        panic(err)
    }

    sentences := tokenizer.Tokenize(text)
    for _, s := range sentences {
        fmt.Println(s.Text)
    }
}

Contributing

I need help maintaining this library. If you are interested in contributing to this library then please start by looking at the golden-rules branch which tests the Golden Rules for english sentence tokenization created by the Pragmatic Segmenter library.

Create an issue for a particular failing test and submit an issue/PR.

I'm happy to help anyone willing to contribute.

Customize

sentences was built around composability, most major components of this package can be extended.

Eager to make ad-hoc changes but don't know how to start? Have a look at github.com/neurosnap/sentences/english for a solid example.

Notice

I have not tested this tokenizer in any other language besides English. By default the command line utility loads english. I welcome anyone willing to test the other languages to submit updates as needed.

A primary goal for this package is to be multilingual so I'm willing to help in any way possible.

This library is a port of the nltk's punkt tokenizer.

A Punkt Tokenizer

An unsupervised multilingual sentence boundary detection library for golang. The way the punkt system accomplishes this goal is through training the tokenizer with text in that given language. Once the likelihoods of abbreviations, collocations, and sentence starters are determined, finding sentence boundaries becomes easier.

There are many problems that arise when tokenizing text into sentences, the primary issue being abbreviations. The punkt system attempts to determine whether a word is an abbreviation, an end to a sentence, or even both through training the system with text in the given language. The punkt system incorporates both token- and type-based analysis on the text through two different phases of annotation.

Unsupervised multilingual sentence boundary detection

Performance

Using Brown Corpus which is annotated American English text, we compare this package with other libraries across multiple programming languages.

Library	Avg Speed (s, 10 runs)	Accuracy (%)
Sentences	1.96	98.95
NLTK	5.22	99.21

Documentation ¶

Overview ¶

Package sentences is a golang package that will convert a blob of text into a list of sentences.

This package attempts to support a multitude of languages: Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Italian, Norwegian, Polish, Portuguese, Slovene, Spanish, Swedish, and Turkish.

An unsupervised multilingual sentence boundary detection library for golang. The goal of this library is to be able to break up any text into a list of sentences in multiple languages. The way the punkt system accomplishes this goal is through training the tokenizer with text in that given language. Once the likelihoods of abbreviations, collocations, and sentence starters are determined, finding sentence boundaries becomes easier.

There are many problems that arise when tokenizing text into sentences, the primary issue being abbreviations. The punkt system attempts to determine whether a word is an abbreviation, an end to a sentence, or even both through training the system with text in the given language. The punkt system incorporates both token- and type-based analysis on the text through two different phases of annotation.

Original research article: http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=BAE5C34E5C3B9DC60DFC4D93B85D8BB1?doi=10.1.1.85.5017&rep=rep1&type=pdf

Index ¶

func IsCjkPunct(r rune) bool
type AnnotateTokens
- func NewAnnotations(s *Storage, p PunctStrings, word WordTokenizer) []AnnotateTokens
type DefaultPunctStrings
- func NewPunctStrings() *DefaultPunctStrings
- func (p *DefaultPunctStrings) HasSentencePunct(text string) bool
- func (p *DefaultPunctStrings) NonPunct() string
- func (p *DefaultPunctStrings) Punctuation() string
type DefaultSentenceTokenizer
- func NewSentenceTokenizer(s *Storage) *DefaultSentenceTokenizer
- func NewTokenizer(s *Storage, word WordTokenizer, lang PunctStrings) *DefaultSentenceTokenizer
- func (s *DefaultSentenceTokenizer) AnnotateTokens(tokens []*Token, annotate ...AnnotateTokens) []*Token
- func (s *DefaultSentenceTokenizer) AnnotatedTokens(text string) []*Token
- func (s *DefaultSentenceTokenizer) SentencePositions(text string) []int
- func (s *DefaultSentenceTokenizer) Tokenize(text string) []*Sentence
type DefaultTokenGrouper
- func (p *DefaultTokenGrouper) Group(tokens []*Token) [][2]*Token
type DefaultWordTokenizer
- func NewWordTokenizer(p PunctStrings) *DefaultWordTokenizer
- func (p *DefaultWordTokenizer) FirstLower(t *Token) bool
- func (p *DefaultWordTokenizer) FirstUpper(t *Token) bool
- func (p *DefaultWordTokenizer) HasPeriodFinal(t *Token) bool
- func (p *DefaultWordTokenizer) HasSentEndChars(t *Token) bool
- func (p *DefaultWordTokenizer) HasUnreliableEndChars(t *Token) bool
- func (p *DefaultWordTokenizer) IsAlpha(t *Token) bool
- func (p *DefaultWordTokenizer) IsCoordinatePartOne(t *Token) bool
- func (p *DefaultWordTokenizer) IsCoordinatePartTwo(t *Token) bool
- func (p *DefaultWordTokenizer) IsEllipsis(t *Token) bool
- func (p *DefaultWordTokenizer) IsInitial(t *Token) bool
- func (p *DefaultWordTokenizer) IsListNumber(t *Token) bool
- func (p *DefaultWordTokenizer) IsNonPunct(t *Token) bool
- func (p *DefaultWordTokenizer) IsNumber(t *Token) bool
- func (p *DefaultWordTokenizer) Tokenize(text string, onlyPeriodContext bool) []*Token
- func (p *DefaultWordTokenizer) Type(t *Token) string
- func (p *DefaultWordTokenizer) TypeNoPeriod(t *Token) string
- func (p *DefaultWordTokenizer) TypeNoSentPeriod(t *Token) string
type Ortho
type OrthoContext
- func (o *OrthoContext) Heuristic(token *Token) int
type PunctStrings
type Sentence
- func (s Sentence) String() string
type SentenceTokenizer
type SetString
- func (ss SetString) Add(str string)
- func (ss SetString) Array() []string
- func (ss SetString) Has(str string) bool
- func (ss SetString) Remove(str string)
type Storage
- func LoadTraining(data []byte) (*Storage, error)
- func NewStorage() *Storage
- func (p *Storage) IsAbbr(tokens ...string) bool
type Token
- func NewToken(token string) *Token
- func (p *Token) String() string
type TokenBasedAnnotation
- func (a *TokenBasedAnnotation) Annotate(tokens []*Token) []*Token
type TokenExistential
type TokenFirst
type TokenGrouper
type TokenParser
type TokenType
type TypeBasedAnnotation
- func NewTypeBasedAnnotation(s *Storage, p PunctStrings, e TokenExistential) *TypeBasedAnnotation
- func (a *TypeBasedAnnotation) Annotate(tokens []*Token) []*Token
type WordTokenizer

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func IsCjkPunct ¶ added in v1.0.7

func IsCjkPunct(r rune) bool

Types ¶

type AnnotateTokens ¶

type AnnotateTokens interface {
	Annotate([]*Token) []*Token
}

AnnotateTokens is an interface used for the sentence tokenizer to add properties to any given token during tokenization.

func NewAnnotations ¶

func NewAnnotations(s *Storage, p PunctStrings, word WordTokenizer) []AnnotateTokens

NewAnnotations is the default AnnotateTokens struct that the tokenizer uses

type DefaultPunctStrings ¶

type DefaultPunctStrings struct{}

DefaultPunctStrings are used to detect punctuation in the sentence tokenizer.

func NewPunctStrings ¶

func NewPunctStrings() *DefaultPunctStrings

NewPunctStrings creates a default set of properties

func (*DefaultPunctStrings) HasSentencePunct ¶

func (p *DefaultPunctStrings) HasSentencePunct(text string) bool

HasSentencePunct does the supplied text have a known sentence punctuation character?

func (*DefaultPunctStrings) NonPunct ¶

func (p *DefaultPunctStrings) NonPunct() string

NonPunct regex string to detect non-punctuation.

func (*DefaultPunctStrings) Punctuation ¶

func (p *DefaultPunctStrings) Punctuation() string

Punctuation characters

type DefaultSentenceTokenizer ¶

type DefaultSentenceTokenizer struct {
	*Storage
	WordTokenizer
	PunctStrings
	Annotations []AnnotateTokens
}

DefaultSentenceTokenizer is a sentence tokenizer which uses an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences and then uses that model to find sentence boundaries.

func NewSentenceTokenizer ¶

func NewSentenceTokenizer(s *Storage) *DefaultSentenceTokenizer

NewSentenceTokenizer are the sane defaults for the sentence tokenizer

func NewTokenizer ¶

func NewTokenizer(s *Storage, word WordTokenizer, lang PunctStrings) *DefaultSentenceTokenizer

NewTokenizer wraps around DST doing the work for customizing the tokenizer

func (*DefaultSentenceTokenizer) AnnotateTokens ¶

func (s *DefaultSentenceTokenizer) AnnotateTokens(tokens []*Token, annotate ...AnnotateTokens) []*Token

AnnotateTokens given a set of tokens augmented with markers for line-start and paragraph-start, returns an iterator through those tokens with full annotation including predicted sentence breaks.

func (*DefaultSentenceTokenizer) AnnotatedTokens ¶

func (s *DefaultSentenceTokenizer) AnnotatedTokens(text string) []*Token

AnnotatedTokens are the fully annotated word tokens. This allows for adhoc adjustments to the tokens

func (*DefaultSentenceTokenizer) SentencePositions ¶

func (s *DefaultSentenceTokenizer) SentencePositions(text string) []int

SentencePositions returns an array of positions instead of returning an array of sentences.

func (*DefaultSentenceTokenizer) Tokenize ¶

func (s *DefaultSentenceTokenizer) Tokenize(text string) []*Sentence

Tokenize splits text input into sentence tokens.

type DefaultTokenGrouper ¶

type DefaultTokenGrouper struct{}

DefaultTokenGrouper is the default implementation of TokenGrouper

func (*DefaultTokenGrouper) Group ¶

func (p *DefaultTokenGrouper) Group(tokens []*Token) [][2]*Token

Group is the primary logic for implementing TokenGrouper

type DefaultWordTokenizer ¶

type DefaultWordTokenizer struct {
	PunctStrings
}

DefaultWordTokenizer is the default implementation of the WordTokenizer

func NewWordTokenizer ¶

func NewWordTokenizer(p PunctStrings) *DefaultWordTokenizer

NewWordTokenizer creates a new DefaultWordTokenizer

func (*DefaultWordTokenizer) FirstLower ¶

func (p *DefaultWordTokenizer) FirstLower(t *Token) bool

FirstLower is true if the token's first character is lowercase

func (*DefaultWordTokenizer) FirstUpper ¶

func (p *DefaultWordTokenizer) FirstUpper(t *Token) bool

FirstUpper is true if the token's first character is uppercase.

func (*DefaultWordTokenizer) HasPeriodFinal ¶

func (p *DefaultWordTokenizer) HasPeriodFinal(t *Token) bool

HasPeriodFinal is true if the last character in the word is a period

func (*DefaultWordTokenizer) HasSentEndChars ¶

func (p *DefaultWordTokenizer) HasSentEndChars(t *Token) bool

HasSentEndChars finds any punctuation excluding the period final

func (*DefaultWordTokenizer) HasUnreliableEndChars ¶ added in v1.1.0

func (p *DefaultWordTokenizer) HasUnreliableEndChars(t *Token) bool

Find any punctuation that might mean the end of a sentence but doesn't have to

func (*DefaultWordTokenizer) IsAlpha ¶

func (p *DefaultWordTokenizer) IsAlpha(t *Token) bool

IsAlpha is true if the token text is all alphabetic.

func (*DefaultWordTokenizer) IsCoordinatePartOne ¶ added in v1.1.2

func (p *DefaultWordTokenizer) IsCoordinatePartOne(t *Token) bool

IsCoordinatePartTwo is true if the token text might be the second part of a coordiate.

func (*DefaultWordTokenizer) IsCoordinatePartTwo ¶ added in v1.1.2

func (p *DefaultWordTokenizer) IsCoordinatePartTwo(t *Token) bool

IsCoordinatePartTwo is true if the token text might be the second part of a coordiate.

func (*DefaultWordTokenizer) IsEllipsis ¶

func (p *DefaultWordTokenizer) IsEllipsis(t *Token) bool

IsEllipsis is true if the token text is that of an ellipsis.

func (*DefaultWordTokenizer) IsInitial ¶

func (p *DefaultWordTokenizer) IsInitial(t *Token) bool

IsInitial is true if the token text is that of an initial.

func (*DefaultWordTokenizer) IsListNumber ¶ added in v1.1.0

func (p *DefaultWordTokenizer) IsListNumber(t *Token) bool

IsInitial is true if the token text is that of a list number.

func (*DefaultWordTokenizer) IsNonPunct ¶

func (p *DefaultWordTokenizer) IsNonPunct(t *Token) bool

IsNonPunct is true if the token is either a number or is alphabetic.

func (*DefaultWordTokenizer) IsNumber ¶

func (p *DefaultWordTokenizer) IsNumber(t *Token) bool

IsNumber is true if the token text is that of a number.

func (*DefaultWordTokenizer) Tokenize ¶

func (p *DefaultWordTokenizer) Tokenize(text string, onlyPeriodContext bool) []*Token

Tokenize breaks text into words while preserving their character position, whether it starts a new line, and new paragraph.

func (*DefaultWordTokenizer) Type ¶

func (p *DefaultWordTokenizer) Type(t *Token) string

Type returns a case-normalized representation of the token.

func (*DefaultWordTokenizer) TypeNoPeriod ¶

func (p *DefaultWordTokenizer) TypeNoPeriod(t *Token) string

TypeNoPeriod is the type with its final period removed if it has one.

func (*DefaultWordTokenizer) TypeNoSentPeriod ¶

func (p *DefaultWordTokenizer) TypeNoSentPeriod(t *Token) string

TypeNoSentPeriod is the type with its final period removed if it is marked as a sentence break.

type Ortho ¶

type Ortho interface {
	Heuristic(*Token) int
}

Ortho creates a promise for structs to implement an orthogonal heuristic method.

type OrthoContext ¶

type OrthoContext struct {
	*Storage
	PunctStrings
	TokenType
	TokenFirst
}

OrthoContext determines whether a token is capitalized, sentence starter, etc.

func (*OrthoContext) Heuristic ¶

func (o *OrthoContext) Heuristic(token *Token) int

Heuristic decides whether the given token is the first token in a sentence.

type PunctStrings ¶

type PunctStrings interface {
	NonPunct() string
	Punctuation() string
	HasSentencePunct(string) bool
}

PunctStrings implements all the functions necessary for punctuation strings. They are used to detect punctuation in the sentence tokenizer.

type Sentence ¶

type Sentence struct {
	Start int    `json:"start"`
	End   int    `json:"end"`
	Text  string `json:"text"`
}

Sentence container to hold sentences, provides the character positions as well as the text for that sentence.

func (Sentence) String ¶

func (s Sentence) String() string

type SentenceTokenizer ¶

type SentenceTokenizer interface {
	AnnotateTokens([]*Token, ...AnnotateTokens) []*Token
	Tokenize(string) []*Sentence
}

SentenceTokenizer interface is used by the Tokenize function, can be extended to correct sentence boundaries that punkt misses.

type SetString ¶

type SetString map[string]int

SetString is an implementation of a set of strings probably not the best way to do this but oh well.

func (SetString) Add ¶

func (ss SetString) Add(str string)

Add adds a string key to the set

func (SetString) Array ¶

func (ss SetString) Array() []string

Array returns and array of keys from the set

func (SetString) Has ¶

func (ss SetString) Has(str string) bool

Has checks whether a key exists in the set

func (SetString) Remove ¶

func (ss SetString) Remove(str string)

Remove deletes a string key from the set

type Storage ¶

type Storage struct {
	AbbrevTypes  SetString `json:"AbbrevTypes"`
	Collocations SetString `json:"Collocations"`
	SentStarters SetString `json:"SentStarters"`
	OrthoContext SetString `json:"OrthoContext"`
}

Storage stores data used to perform sentence boundary detection with punkt This is where all the training data gets stored for future use

func LoadTraining ¶

func LoadTraining(data []byte) (*Storage, error)

LoadTraining is the primary function to load JSON training data. By default, the sentence tokenizer loads in english automatically, but other languages could be loaded into a binary file using the `make <lang>` command.

func NewStorage ¶

func NewStorage() *Storage

NewStorage creates the default storage container

func (*Storage) IsAbbr ¶

func (p *Storage) IsAbbr(tokens ...string) bool

IsAbbr detemines if any of the tokens are an abbreviation

type Token ¶

type Token struct {
	Tok       string
	Position  int
	SentBreak bool
	ParaStart bool
	LineStart bool
	Abbr      bool
	// contains filtered or unexported fields
}

Token stores a token of text with annotations produced during sentence boundary detection.

func NewToken ¶

func NewToken(token string) *Token

NewToken is the default implementation of the Token struct

func (*Token) String ¶

func (p *Token) String() string

String is the string representation of Token

type TokenBasedAnnotation ¶

type TokenBasedAnnotation struct {
	*Storage
	PunctStrings
	TokenParser
	TokenGrouper
	Ortho
}

TokenBasedAnnotation performs a token-based classification (section 4) over the given tokens, making use of the orthographic heuristic (4.1.1), collocation heuristic (4.1.2) and frequent sentence starter heuristic (4.1.3).

func (*TokenBasedAnnotation) Annotate ¶

func (a *TokenBasedAnnotation) Annotate(tokens []*Token) []*Token

Annotate iterates groups tokens in pairs of two and then iterates over them to apply token annotation

type TokenExistential ¶

type TokenExistential interface {
	// True if the token text is all alphabetic.
	IsAlpha(*Token) bool
	// True if the token text is that of an ellipsis.
	IsEllipsis(*Token) bool
	// True if the token text is that of an initial.
	IsInitial(*Token) bool
	// True if the token text is that of an number as part of a list.
	IsListNumber(*Token) bool
	// True if the token text is that of a number.
	IsNumber(*Token) bool
	// True if the token is either a number or is alphabetic.
	IsNonPunct(*Token) bool
	// True if the token is first part of a coordinate.
	IsCoordinatePartOne(*Token) bool
	// True if the token is second part of a coordinate.
	IsCoordinatePartTwo(*Token) bool
	// Does this token end with a period?
	HasPeriodFinal(*Token) bool
	// Does this token end with a punctuation and a quote?
	HasSentEndChars(*Token) bool
	// Does this token end with ambigiuous punctuation?
	HasUnreliableEndChars(*Token) bool
}

TokenExistential are helpers to determine what type of token we are dealing with.

type TokenFirst ¶

type TokenFirst interface {
	// True if the token's first character is lowercase
	FirstLower(*Token) bool
	// True if the token's first character is uppercase.
	FirstUpper(*Token) bool
}

TokenFirst are helpers to determine the case of the token's first letter

type TokenGrouper ¶

type TokenGrouper interface {
	Group([]*Token) [][2]*Token
}

TokenGrouper two adjacent tokens together.

type TokenParser ¶

type TokenParser interface {
	TokenType
	TokenFirst
	TokenExistential
}

TokenParser is the primary token interface that determines the context and type of a tokenized word.

type TokenType ¶

type TokenType interface {
	Type(*Token) string
	// The type with its final period removed if it has one.
	TypeNoPeriod(*Token) string
	// The type with its final period removed if it is marked as a sentence break.
	TypeNoSentPeriod(*Token) string
}

TokenType are helpers to get the type of a token

type TypeBasedAnnotation ¶

type TypeBasedAnnotation struct {
	*Storage
	PunctStrings
	TokenExistential
}

TypeBasedAnnotation performs the first pass of annotation, which makes decisions based purely based on the word type of each word:

'?', '!', and '.' are marked as sentence breaks.
sequences of two or more periods are marked as ellipsis.
any word ending in '.' that's a known abbreviation is marked as an abbreviation.
any other word ending in '.' is marked as a sentence break.

Return these annotations as a tuple of three sets:

sentbreak_toks: The indices of all sentence breaks.
abbrev_toks: The indices of all abbreviations.
ellipsis_toks: The indices of all ellipsis marks.

func NewTypeBasedAnnotation ¶

func NewTypeBasedAnnotation(s *Storage, p PunctStrings, e TokenExistential) *TypeBasedAnnotation

NewTypeBasedAnnotation creates an instance of the TypeBasedAnnotation struct

func (*TypeBasedAnnotation) Annotate ¶

func (a *TypeBasedAnnotation) Annotate(tokens []*Token) []*Token

Annotate iterates over all tokens and applies the type annotation on them

type WordTokenizer ¶

type WordTokenizer interface {
	TokenParser
	Tokenize(string, bool) []*Token
}

WordTokenizer is the primary interface for tokenizing words

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
cmd
sentences
data
english
utils

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL