analysis

package
v0.1.3 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 6, 2020 License: Apache-2.0 Imports: 9 Imported by: 50

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func BuildTermFromRunes

func BuildTermFromRunes(runes []rune) []byte

func BuildTermFromRunesOptimistic

func BuildTermFromRunesOptimistic(buf []byte, runes []rune) []byte

BuildTermFromRunesOptimistic will build a term from the provided runes AND optimistically attempt to encode into the provided buffer if at any point it appears the buffer is too small, a new buffer is allocated and that is used instead this should be used in cases where frequently the new term is the same length or shorter than the original term (in number of bytes)

func DeleteRune

func DeleteRune(in []rune, pos int) []rune

func InsertRune

func InsertRune(in []rune, pos int, r rune) []rune

func RunesEndsWith

func RunesEndsWith(input []rune, suffix string) bool

func TruncateRunes

func TruncateRunes(input []byte, num int) []byte

Types

type Analyzer

type Analyzer struct {
	CharFilters  []CharFilter
	Tokenizer    Tokenizer
	TokenFilters []TokenFilter
}

func (*Analyzer) Analyze

func (a *Analyzer) Analyze(input []byte) TokenStream

type CharFilter

type CharFilter interface {
	Filter([]byte) []byte
}

type Token

type Token struct {
	// Start specifies the byte offset of the beginning of the term in the
	// field.
	Start int

	// End specifies the byte offset of the end of the term in the field.
	End  int
	Term []byte

	// PositionIncr specifies the position of this token relative to the previous.
	PositionIncr int
	Type         TokenType
	KeyWord      bool
}

Token represents one occurrence of a term at a particular location in a field.

func (*Token) String

func (t *Token) String() string

type TokenFilter

type TokenFilter interface {
	Filter(TokenStream) TokenStream
}

A TokenFilter adds, transforms or removes tokens from a token stream.

type TokenFreq

type TokenFreq struct {
	TermVal   []byte
	Locations []*TokenLocation
	// contains filtered or unexported fields
}

TokenFreq represents all the occurrences of a term in all fields of a document.

func (*TokenFreq) EachLocation

func (tf *TokenFreq) EachLocation(location segment.VisitLocation)

func (*TokenFreq) Frequency

func (tf *TokenFreq) Frequency() int

func (*TokenFreq) Size

func (tf *TokenFreq) Size() int

func (*TokenFreq) Term

func (tf *TokenFreq) Term() []byte

type TokenFrequencies

type TokenFrequencies map[string]*TokenFreq

TokenFrequencies maps document terms to their combined frequencies from all fields.

func TokenFrequency

func TokenFrequency(tokens TokenStream, includeTermVectors bool, startOffset int) (
	tokenFreqs TokenFrequencies, position int)

func (TokenFrequencies) MergeAll

func (tfs TokenFrequencies) MergeAll(remoteField string, other TokenFrequencies)

func (TokenFrequencies) MergeOneBytes

func (tfs TokenFrequencies) MergeOneBytes(remoteField string, tfk []byte, tf *TokenFreq)

func (TokenFrequencies) Size

func (tfs TokenFrequencies) Size() int

type TokenLocation

type TokenLocation struct {
	FieldVal    string
	StartVal    int
	EndVal      int
	PositionVal int
}

TokenLocation represents one occurrence of a term at a particular location in a field. Start, End and Position have the same meaning as in analysis.Token. Field and ArrayPositions identify the field value in the source document. See document.Field for details.

func (*TokenLocation) End

func (tl *TokenLocation) End() int

func (*TokenLocation) Field

func (tl *TokenLocation) Field() string

func (*TokenLocation) Pos

func (tl *TokenLocation) Pos() int

func (*TokenLocation) Size

func (tl *TokenLocation) Size() int

func (*TokenLocation) Start

func (tl *TokenLocation) Start() int

type TokenMap

type TokenMap map[string]bool

func NewTokenMap

func NewTokenMap() TokenMap

func (TokenMap) AddToken

func (t TokenMap) AddToken(token string)

func (TokenMap) LoadBytes

func (t TokenMap) LoadBytes(data []byte)

LoadBytes reads in a list of tokens from memory, one per line. Comments are supported using `#` or `|`

func (TokenMap) LoadFile

func (t TokenMap) LoadFile(filename string) error

LoadFile reads in a list of tokens from a text file, one per line. Comments are supported using `#` or `|`

func (TokenMap) LoadLine

func (t TokenMap) LoadLine(line string)

type TokenStream

type TokenStream []*Token

type TokenType

type TokenType int
const (
	AlphaNumeric TokenType = iota
	Ideographic
	Numeric
	DateTime
	Shingle
	Single
	Double
	Boolean
)

type Tokenizer

type Tokenizer interface {
	Tokenize([]byte) TokenStream
}

A Tokenizer splits an input string into tokens, the usual behavior being to map words to tokens.

Directories

Path Synopsis
lang
ar
bg
ca
cjk
ckb
cs
da
de
el
en
Package en implements an analyzer with reasonable defaults for processing English text.
Package en implements an analyzer with reasonable defaults for processing English text.
es
eu
fa
fi
fr
ga
gl
hi
hu
hy
id
in
it
nl
no
pt
ro
ru
sv
tr
Package lowercase implements a TokenFilter which converts tokens to lower case according to unicode rules.
Package lowercase implements a TokenFilter which converts tokens to lower case according to unicode rules.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL