sentences

package module
v1.1.2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 6, 2023 License: MIT Imports: 5 Imported by: 12

README

release GODOC MIT Go Report Card

Sentences - A command line sentence tokenizer

This command line utility will convert a blob of text into a list of sentences.

Features

  • Supports multiple languages (english, czech, dutch, estonian, finnish, german, greek, italian, norwegian, polish, portuguese, slovene, and turkish)
  • Zero dependencies
  • Extendable
  • Fast

Install

arch

aur

mac
brew tap neurosnap/sentences
brew install sentences
other

Or you can find the pre-built binaries on the github releases page.

using golang
go get github.com/neurosnap/sentences
go install github.com/neurosnap/sentences/cmd/sentences

Command

Command line

Get it

go get github.com/neurosnap/sentences

Use it

import (
    "fmt"
    "os"

    "github.com/neurosnap/sentences"
)

func main() {
    text := `A perennial also-ran, Stallings won his seat when longtime lawmaker David Holmes
    died 11 days after the filing deadline. Suddenly, Stallings was a shoo-in, not
    the long shot. In short order, the Legislature attempted to pass a law allowing
    former U.S. Rep. Carolyn Cheeks Kilpatrick to file; Stallings challenged the
    law in court and won. Kilpatrick mounted a write-in campaign, but Stallings won.`

    // download the training data from this repo (./data) and save it somewhere
    b, _ := os.ReadFile("./path/to/english.json")

    // load the training data
    training, _ := sentences.LoadTraining(b)

    // create the default sentence tokenizer
    tokenizer := sentences.NewSentenceTokenizer(training)
    sentences := tokenizer.Tokenize(text)

    for _, s := range sentences {
        fmt.Println(s.Text)
    }
}

English

This package attempts to fix some problems I noticed for english.

import (
    "fmt"

    "github.com/neurosnap/sentences/english"
)

func main() {
    text := "Hi there. Does this really work?"

    tokenizer, err := english.NewSentenceTokenizer(nil)
    if err != nil {
        panic(err)
    }

    sentences := tokenizer.Tokenize(text)
    for _, s := range sentences {
        fmt.Println(s.Text)
    }
}

Contributing

I need help maintaining this library. If you are interested in contributing to this library then please start by looking at the golden-rules branch which tests the Golden Rules for english sentence tokenization created by the Pragmatic Segmenter library.

Create an issue for a particular failing test and submit an issue/PR.

I'm happy to help anyone willing to contribute.

Customize

sentences was built around composability, most major components of this package can be extended.

Eager to make ad-hoc changes but don't know how to start? Have a look at github.com/neurosnap/sentences/english for a solid example.

Notice

I have not tested this tokenizer in any other language besides English. By default the command line utility loads english. I welcome anyone willing to test the other languages to submit updates as needed.

A primary goal for this package is to be multilingual so I'm willing to help in any way possible.

This library is a port of the nltk's punkt tokenizer.

A Punkt Tokenizer

An unsupervised multilingual sentence boundary detection library for golang. The way the punkt system accomplishes this goal is through training the tokenizer with text in that given language. Once the likelihoods of abbreviations, collocations, and sentence starters are determined, finding sentence boundaries becomes easier.

There are many problems that arise when tokenizing text into sentences, the primary issue being abbreviations. The punkt system attempts to determine whether a word is an abbreviation, an end to a sentence, or even both through training the system with text in the given language. The punkt system incorporates both token- and type-based analysis on the text through two different phases of annotation.

Unsupervised multilingual sentence boundary detection

Performance

Using Brown Corpus which is annotated American English text, we compare this package with other libraries across multiple programming languages.

Library Avg Speed (s, 10 runs) Accuracy (%)
Sentences 1.96 98.95
NLTK 5.22 99.21

Documentation

Overview

Package sentences is a golang package that will convert a blob of text into a list of sentences.

This package attempts to support a multitude of languages: Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Italian, Norwegian, Polish, Portuguese, Slovene, Spanish, Swedish, and Turkish.

An unsupervised multilingual sentence boundary detection library for golang. The goal of this library is to be able to break up any text into a list of sentences in multiple languages. The way the punkt system accomplishes this goal is through training the tokenizer with text in that given language. Once the likelihoods of abbreviations, collocations, and sentence starters are determined, finding sentence boundaries becomes easier.

There are many problems that arise when tokenizing text into sentences, the primary issue being abbreviations. The punkt system attempts to determine whether a word is an abbreviation, an end to a sentence, or even both through training the system with text in the given language. The punkt system incorporates both token- and type-based analysis on the text through two different phases of annotation.

Original research article: http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=BAE5C34E5C3B9DC60DFC4D93B85D8BB1?doi=10.1.1.85.5017&rep=rep1&type=pdf

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func IsCjkPunct added in v1.0.7

func IsCjkPunct(r rune) bool

Types

type AnnotateTokens

type AnnotateTokens interface {
	Annotate([]*Token) []*Token
}

AnnotateTokens is an interface used for the sentence tokenizer to add properties to any given token during tokenization.

func NewAnnotations

func NewAnnotations(s *Storage, p PunctStrings, word WordTokenizer) []AnnotateTokens

NewAnnotations is the default AnnotateTokens struct that the tokenizer uses

type DefaultPunctStrings

type DefaultPunctStrings struct{}

DefaultPunctStrings are used to detect punctuation in the sentence tokenizer.

func NewPunctStrings

func NewPunctStrings() *DefaultPunctStrings

NewPunctStrings creates a default set of properties

func (*DefaultPunctStrings) HasSentencePunct

func (p *DefaultPunctStrings) HasSentencePunct(text string) bool

HasSentencePunct does the supplied text have a known sentence punctuation character?

func (*DefaultPunctStrings) NonPunct

func (p *DefaultPunctStrings) NonPunct() string

NonPunct regex string to detect non-punctuation.

func (*DefaultPunctStrings) Punctuation

func (p *DefaultPunctStrings) Punctuation() string

Punctuation characters

type DefaultSentenceTokenizer

type DefaultSentenceTokenizer struct {
	*Storage
	WordTokenizer
	PunctStrings
	Annotations []AnnotateTokens
}

DefaultSentenceTokenizer is a sentence tokenizer which uses an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences and then uses that model to find sentence boundaries.

func NewSentenceTokenizer

func NewSentenceTokenizer(s *Storage) *DefaultSentenceTokenizer

NewSentenceTokenizer are the sane defaults for the sentence tokenizer

func NewTokenizer

func NewTokenizer(s *Storage, word WordTokenizer, lang PunctStrings) *DefaultSentenceTokenizer

NewTokenizer wraps around DST doing the work for customizing the tokenizer

func (*DefaultSentenceTokenizer) AnnotateTokens

func (s *DefaultSentenceTokenizer) AnnotateTokens(tokens []*Token, annotate ...AnnotateTokens) []*Token

AnnotateTokens given a set of tokens augmented with markers for line-start and paragraph-start, returns an iterator through those tokens with full annotation including predicted sentence breaks.

func (*DefaultSentenceTokenizer) AnnotatedTokens

func (s *DefaultSentenceTokenizer) AnnotatedTokens(text string) []*Token

AnnotatedTokens are the fully annotated word tokens. This allows for adhoc adjustments to the tokens

func (*DefaultSentenceTokenizer) SentencePositions

func (s *DefaultSentenceTokenizer) SentencePositions(text string) []int

SentencePositions returns an array of positions instead of returning an array of sentences.

func (*DefaultSentenceTokenizer) Tokenize

func (s *DefaultSentenceTokenizer) Tokenize(text string) []*Sentence

Tokenize splits text input into sentence tokens.

type DefaultTokenGrouper

type DefaultTokenGrouper struct{}

DefaultTokenGrouper is the default implementation of TokenGrouper

func (*DefaultTokenGrouper) Group

func (p *DefaultTokenGrouper) Group(tokens []*Token) [][2]*Token

Group is the primary logic for implementing TokenGrouper

type DefaultWordTokenizer

type DefaultWordTokenizer struct {
	PunctStrings
}

DefaultWordTokenizer is the default implementation of the WordTokenizer

func NewWordTokenizer

func NewWordTokenizer(p PunctStrings) *DefaultWordTokenizer

NewWordTokenizer creates a new DefaultWordTokenizer

func (*DefaultWordTokenizer) FirstLower

func (p *DefaultWordTokenizer) FirstLower(t *Token) bool

FirstLower is true if the token's first character is lowercase

func (*DefaultWordTokenizer) FirstUpper

func (p *DefaultWordTokenizer) FirstUpper(t *Token) bool

FirstUpper is true if the token's first character is uppercase.

func (*DefaultWordTokenizer) HasPeriodFinal

func (p *DefaultWordTokenizer) HasPeriodFinal(t *Token) bool

HasPeriodFinal is true if the last character in the word is a period

func (*DefaultWordTokenizer) HasSentEndChars

func (p *DefaultWordTokenizer) HasSentEndChars(t *Token) bool

HasSentEndChars finds any punctuation excluding the period final

func (*DefaultWordTokenizer) HasUnreliableEndChars added in v1.1.0

func (p *DefaultWordTokenizer) HasUnreliableEndChars(t *Token) bool

Find any punctuation that might mean the end of a sentence but doesn't have to

func (*DefaultWordTokenizer) IsAlpha

func (p *DefaultWordTokenizer) IsAlpha(t *Token) bool

IsAlpha is true if the token text is all alphabetic.

func (*DefaultWordTokenizer) IsCoordinatePartOne added in v1.1.2

func (p *DefaultWordTokenizer) IsCoordinatePartOne(t *Token) bool

IsCoordinatePartTwo is true if the token text might be the second part of a coordiate.

func (*DefaultWordTokenizer) IsCoordinatePartTwo added in v1.1.2

func (p *DefaultWordTokenizer) IsCoordinatePartTwo(t *Token) bool

IsCoordinatePartTwo is true if the token text might be the second part of a coordiate.

func (*DefaultWordTokenizer) IsEllipsis

func (p *DefaultWordTokenizer) IsEllipsis(t *Token) bool

IsEllipsis is true if the token text is that of an ellipsis.

func (*DefaultWordTokenizer) IsInitial

func (p *DefaultWordTokenizer) IsInitial(t *Token) bool

IsInitial is true if the token text is that of an initial.

func (*DefaultWordTokenizer) IsListNumber added in v1.1.0

func (p *DefaultWordTokenizer) IsListNumber(t *Token) bool

IsInitial is true if the token text is that of a list number.

func (*DefaultWordTokenizer) IsNonPunct

func (p *DefaultWordTokenizer) IsNonPunct(t *Token) bool

IsNonPunct is true if the token is either a number or is alphabetic.

func (*DefaultWordTokenizer) IsNumber

func (p *DefaultWordTokenizer) IsNumber(t *Token) bool

IsNumber is true if the token text is that of a number.

func (*DefaultWordTokenizer) Tokenize

func (p *DefaultWordTokenizer) Tokenize(text string, onlyPeriodContext bool) []*Token

Tokenize breaks text into words while preserving their character position, whether it starts a new line, and new paragraph.

func (*DefaultWordTokenizer) Type

func (p *DefaultWordTokenizer) Type(t *Token) string

Type returns a case-normalized representation of the token.

func (*DefaultWordTokenizer) TypeNoPeriod

func (p *DefaultWordTokenizer) TypeNoPeriod(t *Token) string

TypeNoPeriod is the type with its final period removed if it has one.

func (*DefaultWordTokenizer) TypeNoSentPeriod

func (p *DefaultWordTokenizer) TypeNoSentPeriod(t *Token) string

TypeNoSentPeriod is the type with its final period removed if it is marked as a sentence break.

type Ortho

type Ortho interface {
	Heuristic(*Token) int
}

Ortho creates a promise for structs to implement an orthogonal heuristic method.

type OrthoContext

type OrthoContext struct {
	*Storage
	PunctStrings
	TokenType
	TokenFirst
}

OrthoContext determines whether a token is capitalized, sentence starter, etc.

func (*OrthoContext) Heuristic

func (o *OrthoContext) Heuristic(token *Token) int

Heuristic decides whether the given token is the first token in a sentence.

type PunctStrings

type PunctStrings interface {
	NonPunct() string
	Punctuation() string
	HasSentencePunct(string) bool
}

PunctStrings implements all the functions necessary for punctuation strings. They are used to detect punctuation in the sentence tokenizer.

type Sentence

type Sentence struct {
	Start int    `json:"start"`
	End   int    `json:"end"`
	Text  string `json:"text"`
}

Sentence container to hold sentences, provides the character positions as well as the text for that sentence.

func (Sentence) String

func (s Sentence) String() string

type SentenceTokenizer

type SentenceTokenizer interface {
	AnnotateTokens([]*Token, ...AnnotateTokens) []*Token
	Tokenize(string) []*Sentence
}

SentenceTokenizer interface is used by the Tokenize function, can be extended to correct sentence boundaries that punkt misses.

type SetString

type SetString map[string]int

SetString is an implementation of a set of strings probably not the best way to do this but oh well.

func (SetString) Add

func (ss SetString) Add(str string)

Add adds a string key to the set

func (SetString) Array

func (ss SetString) Array() []string

Array returns and array of keys from the set

func (SetString) Has

func (ss SetString) Has(str string) bool

Has checks whether a key exists in the set

func (SetString) Remove

func (ss SetString) Remove(str string)

Remove deletes a string key from the set

type Storage

type Storage struct {
	AbbrevTypes  SetString `json:"AbbrevTypes"`
	Collocations SetString `json:"Collocations"`
	SentStarters SetString `json:"SentStarters"`
	OrthoContext SetString `json:"OrthoContext"`
}

Storage stores data used to perform sentence boundary detection with punkt This is where all the training data gets stored for future use

func LoadTraining

func LoadTraining(data []byte) (*Storage, error)

LoadTraining is the primary function to load JSON training data. By default, the sentence tokenizer loads in english automatically, but other languages could be loaded into a binary file using the `make <lang>` command.

func NewStorage

func NewStorage() *Storage

NewStorage creates the default storage container

func (*Storage) IsAbbr

func (p *Storage) IsAbbr(tokens ...string) bool

IsAbbr detemines if any of the tokens are an abbreviation

type Token

type Token struct {
	Tok       string
	Position  int
	SentBreak bool
	ParaStart bool
	LineStart bool
	Abbr      bool
	// contains filtered or unexported fields
}

Token stores a token of text with annotations produced during sentence boundary detection.

func NewToken

func NewToken(token string) *Token

NewToken is the default implementation of the Token struct

func (*Token) String

func (p *Token) String() string

String is the string representation of Token

type TokenBasedAnnotation

type TokenBasedAnnotation struct {
	*Storage
	PunctStrings
	TokenParser
	TokenGrouper
	Ortho
}

TokenBasedAnnotation performs a token-based classification (section 4) over the given tokens, making use of the orthographic heuristic (4.1.1), collocation heuristic (4.1.2) and frequent sentence starter heuristic (4.1.3).

func (*TokenBasedAnnotation) Annotate

func (a *TokenBasedAnnotation) Annotate(tokens []*Token) []*Token

Annotate iterates groups tokens in pairs of two and then iterates over them to apply token annotation

type TokenExistential

type TokenExistential interface {
	// True if the token text is all alphabetic.
	IsAlpha(*Token) bool
	// True if the token text is that of an ellipsis.
	IsEllipsis(*Token) bool
	// True if the token text is that of an initial.
	IsInitial(*Token) bool
	// True if the token text is that of an number as part of a list.
	IsListNumber(*Token) bool
	// True if the token text is that of a number.
	IsNumber(*Token) bool
	// True if the token is either a number or is alphabetic.
	IsNonPunct(*Token) bool
	// True if the token is first part of a coordinate.
	IsCoordinatePartOne(*Token) bool
	// True if the token is second part of a coordinate.
	IsCoordinatePartTwo(*Token) bool
	// Does this token end with a period?
	HasPeriodFinal(*Token) bool
	// Does this token end with a punctuation and a quote?
	HasSentEndChars(*Token) bool
	// Does this token end with ambigiuous punctuation?
	HasUnreliableEndChars(*Token) bool
}

TokenExistential are helpers to determine what type of token we are dealing with.

type TokenFirst

type TokenFirst interface {
	// True if the token's first character is lowercase
	FirstLower(*Token) bool
	// True if the token's first character is uppercase.
	FirstUpper(*Token) bool
}

TokenFirst are helpers to determine the case of the token's first letter

type TokenGrouper

type TokenGrouper interface {
	Group([]*Token) [][2]*Token
}

TokenGrouper two adjacent tokens together.

type TokenParser

type TokenParser interface {
	TokenType
	TokenFirst
	TokenExistential
}

TokenParser is the primary token interface that determines the context and type of a tokenized word.

type TokenType

type TokenType interface {
	Type(*Token) string
	// The type with its final period removed if it has one.
	TypeNoPeriod(*Token) string
	// The type with its final period removed if it is marked as a sentence break.
	TypeNoSentPeriod(*Token) string
}

TokenType are helpers to get the type of a token

type TypeBasedAnnotation

type TypeBasedAnnotation struct {
	*Storage
	PunctStrings
	TokenExistential
}

TypeBasedAnnotation performs the first pass of annotation, which makes decisions based purely based on the word type of each word:

  • '?', '!', and '.' are marked as sentence breaks.
  • sequences of two or more periods are marked as ellipsis.
  • any word ending in '.' that's a known abbreviation is marked as an abbreviation.
  • any other word ending in '.' is marked as a sentence break.

Return these annotations as a tuple of three sets:

  • sentbreak_toks: The indices of all sentence breaks.
  • abbrev_toks: The indices of all abbreviations.
  • ellipsis_toks: The indices of all ellipsis marks.

func NewTypeBasedAnnotation

func NewTypeBasedAnnotation(s *Storage, p PunctStrings, e TokenExistential) *TypeBasedAnnotation

NewTypeBasedAnnotation creates an instance of the TypeBasedAnnotation struct

func (*TypeBasedAnnotation) Annotate

func (a *TypeBasedAnnotation) Annotate(tokens []*Token) []*Token

Annotate iterates over all tokens and applies the type annotation on them

type WordTokenizer

type WordTokenizer interface {
	TokenParser
	Tokenize(string, bool) []*Token
}

WordTokenizer is the primary interface for tokenizing words

Directories

Path Synopsis
cmd

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL