split

package module
v0.1.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Sep 25, 2021 License: MIT Imports: 11 Imported by: 0

README

split

Package split provides an extensible interface for splitting strings in meaningful ways: words, sentences, paragraphs, and more.

Documentation

Overview

Package split provides an extensible interface for splitting strings in meaningful ways: words, sentences, paragraphs, and more.

The MIT License (MIT)

Copyright (c) 2015 Kevin S. Dias

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Index

Examples

Constants

This section is empty.

Variables

This section is empty.

Functions

func NewIterTokenizer

func NewIterTokenizer(opts ...TokenizerOptFunc) (*iterTokenizer, error)

Constructor for default iterTokenizer

Types

type PragmaticSegmenter

type PragmaticSegmenter struct {
	// contains filtered or unexported fields
}

PragmaticSegmenter is a multilingual, rule-based sentence boundary detector.

This is a port of the Ruby library by Kevin S. Dias (https://github.com/diasks2/pragmatic_segmenter).

func NewPragmaticSegmenter

func NewPragmaticSegmenter(lang string) (*PragmaticSegmenter, error)

NewPragmaticSegmenter creates a new PragmaticSegmenter according to the specified language. If the given language is not supported, an error will be returned.

Languages are specified by their two-character ISO 639-1 code. The supported languages are "en" (English), "es" (Spanish), "fr" (French) ... (WIP)

func (*PragmaticSegmenter) Tokenize

func (p *PragmaticSegmenter) Tokenize(text string) []string

Tokenize splits text into sentences.

type PunktSentenceTokenizer

type PunktSentenceTokenizer struct {
	// contains filtered or unexported fields
}

PunktSentenceTokenizer is an extension of the Go implementation of the Punkt sentence tokenizer (https://github.com/neurosnap/sentences), with a few minor improvements (see https://github.com/neurosnap/sentences/pull/18).

func NewPunktSentenceTokenizer

func NewPunktSentenceTokenizer() (*PunktSentenceTokenizer, error)

NewPunktSentenceTokenizer creates a new PunktSentenceTokenizer and loads its English model.

func (PunktSentenceTokenizer) Split

func (p PunktSentenceTokenizer) Split(text string) []string

Tokenize splits text into sentences.

type RegexpTokenizer

type RegexpTokenizer struct {
	// contains filtered or unexported fields
}

RegexpTokenizer splits a string into substrings using a regular expression.

func NewBlanklineTokenizer

func NewBlanklineTokenizer() (*RegexpTokenizer, error)

NewBlanklineTokenizer is a RegexpTokenizer constructor.

This tokenizer splits on any sequence of blank lines.

Example
t, err := NewBlanklineTokenizer()
if err != nil {
	panic(err)
}
fmt.Println(t.Split("They'll save and invest more.\n\nThanks!"))
Output:

[They'll save and invest more. Thanks!]

func NewRegexpTokenizer

func NewRegexpTokenizer(pattern string, gaps, discard bool) (*RegexpTokenizer, error)

NewRegexpTokenizer is a RegexpTokenizer constructor that takes three arguments: a pattern to base the tokenizer on, a boolean value indicating whether or not to look for separators between tokens, and boolean value indicating whether or not to discard empty tokens.

func NewWordBoundaryTokenizer

func NewWordBoundaryTokenizer() (*RegexpTokenizer, error)

NewWordBoundaryTokenizer is a RegexpTokenizer constructor.

This tokenizer splits text into a sequence of word-like tokens.

Example
t, err := NewWordBoundaryTokenizer()
if err != nil {
	panic(err)
}
fmt.Println(t.Split("They'll save and invest more."))
Output:

[They'll save and invest more]

func NewWordPunctTokenizer

func NewWordPunctTokenizer() (*RegexpTokenizer, error)

NewWordPunctTokenizer is a RegexpTokenizer constructor.

This tokenizer splits text into a sequence of alphabetic and non-alphabetic characters.

Example
t, err := NewWordPunctTokenizer()
if err != nil {
	panic(err)
}
fmt.Println(t.Split("They'll save and invest more."))
Output:

[They ' ll save and invest more .]

func (RegexpTokenizer) Split

func (r RegexpTokenizer) Split(text string) []string

Split splits text into a slice of tokens according to its regexp pattern.

type Splitter

type Splitter interface {
	Split(s string) []string
}

Splitter splits a string into substrings according to specially-defined rules.

type TokenTester

type TokenTester func(string) bool

type TokenizerOptFunc

type TokenizerOptFunc func(*iterTokenizer)

func UsingContractions

func UsingContractions(x []string) TokenizerOptFunc

Use the provided contractions.

func UsingEmoticons

func UsingEmoticons(x map[string]int) TokenizerOptFunc

Use the provided map of emoticons.

func UsingIsUnsplittable

func UsingIsUnsplittable(x TokenTester) TokenizerOptFunc

UsingIsUnsplittableFN gives a function that tests whether a token is splittable or not.

func UsingPrefixes

func UsingPrefixes(x []string) TokenizerOptFunc

Use the provided prefixes.

func UsingSanitizer

func UsingSanitizer(x *strings.Replacer) TokenizerOptFunc

Use the provided sanitizer.

func UsingSpecialRE

func UsingSpecialRE(x *regexp.Regexp) TokenizerOptFunc

Use the provided special regex for unsplittable tokens.

func UsingSplitCases

func UsingSplitCases(x []string) TokenizerOptFunc

Use the provided splitCases.

func UsingSuffixes

func UsingSuffixes(x []string) TokenizerOptFunc

Use the provided suffixes.

type TreebankWordTokenizer

type TreebankWordTokenizer struct {
}

TreebankWordTokenizer splits a sentence into words.

This implementation is a port of the Sed script written by Robert McIntyre, which is available at https://gist.github.com/jdkato/fc8b8c4266dba22d45ac85042ae53b1e.

func NewTreebankWordTokenizer

func NewTreebankWordTokenizer() (*TreebankWordTokenizer, error)

NewTreebankWordTokenizer is a TreebankWordTokenizer constructor.

Example
t, err := NewTreebankWordTokenizer()
if err != nil {
	panic(err)
}
fmt.Println(t.Split("They'll save and invest more."))
Output:

[They 'll save and invest more .]

func (TreebankWordTokenizer) Split

func (t TreebankWordTokenizer) Split(text string) []string

Tokenize splits a sentence into a slice of words.

This tokenizer performs the following steps: (1) split on contractions (e.g., "don't" -> [do n't]), (2) split on non-terminating punctuation, (3) split on single quotes when followed by whitespace, and (4) split on periods that appear at the end of lines.

NOTE: As mentioned above, this function expects a sentence (not raw text) as input.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL