tokenizer

package
v2.10.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 3, 2024 License: MIT Imports: 7 Imported by: 84

Documentation

Overview

Package tokenizer is a japanese morphological analyzer library.

Example (Tokenize_mode)
d, err := prepareTestDict()
if err != nil {
	panic(err)
}
t, err := New(d)
if err != nil {
	panic(err)
}
for _, mode := range []TokenizeMode{Normal, Search, Extended} {
	tokens := t.Analyze("関西国際空港", Normal)
	fmt.Printf("---%s---", mode)
	for _, token := range tokens {
		if token.Class == DUMMY {
			// BOS: Begin Of Sentence, EOS: End Of Sentence.
			fmt.Printf("%s\n", token.Surface)
			continue
		}
		features := strings.Join(token.Features(), ",")
		fmt.Printf("%s\t%v\n", token.Surface, features)
	}
}
Output:

Index

Examples

Constants

View Source
const (
	// DUMMY represents the dummy token.
	DUMMY = TokenClass(lattice.DUMMY)
	// KNOWN represents the token in the dictionary.
	KNOWN = TokenClass(lattice.KNOWN)
	// UNKNOWN represents the token which is not in the dictionary.
	UNKNOWN = TokenClass(lattice.UNKNOWN)
	// USER represents the token in the user dictionary.
	USER = TokenClass(lattice.USER)
)

Variables

This section is empty.

Functions

func EqualFeatures added in v2.8.0

func EqualFeatures(lhs, rhs []string) bool

EqualFeatures returns true, if the features are equal.

Types

type Option

type Option func(*Tokenizer) error

Option represents an option for the tokenizer.

func Nop

func Nop() Option

Nop represents a no operation option.

func OmitBosEos added in v2.1.1

func OmitBosEos() Option

OmitBosEos is a tokenizer option to omit BOS/EOS from output tokens.

func UserDict

func UserDict(d *dict.UserDict) Option

UserDict is a tokenizer option to sets a user dictionary.

type Token

type Token struct {
	Index    int
	ID       int
	Class    TokenClass
	Position int // byte position
	Start    int
	End      int
	Surface  string
	// contains filtered or unexported fields
}

Token represents a morph of a sentence.

func (Token) BaseForm

func (t Token) BaseForm() (string, bool)

BaseForm returns the base form features if exists.

func (Token) Equal added in v2.1.1

func (t Token) Equal(v Token) bool

Equal returns true if tokens are equal.

func (Token) EqualFeatures added in v2.8.0

func (t Token) EqualFeatures(tt Token) bool

EqualFeatures returns true, if the features of tokens are equal.

func (Token) EqualPOS added in v2.8.0

func (t Token) EqualPOS(tt Token) bool

EqualPOS returns true, if the POSs of tokens are equal.

func (Token) FeatureAt

func (t Token) FeatureAt(i int) (string, bool)

FeatureAt returns the i th feature if exists.

func (Token) Features

func (t Token) Features() []string

Features returns contents of a token.

func (Token) InflectionalForm

func (t Token) InflectionalForm() (string, bool)

InflectionalForm returns the inflectional form feature if exists.

func (Token) InflectionalType

func (t Token) InflectionalType() (string, bool)

InflectionalType returns the inflectional type feature if exists.

func (Token) POS

func (t Token) POS() []string

POS returns POS elements of features.

func (Token) Pronunciation

func (t Token) Pronunciation() (string, bool)

Pronunciation returns the pronunciation feature if exists.

func (Token) Reading

func (t Token) Reading() (string, bool)

Reading returns the reading feature if exists.

func (Token) String

func (t Token) String() string

String returns a string representation of a token.

func (Token) UserExtra added in v2.9.0

func (t Token) UserExtra() *UserExtra

UserExtra returns extra data if token comes from a user dict.

type TokenClass

type TokenClass lattice.NodeClass

TokenClass represents the token class.

func (TokenClass) String

func (c TokenClass) String() string

String returns string representation of a token class.

type TokenData added in v2.6.4

type TokenData struct {
	ID            int      `json:"id"`
	Start         int      `json:"start"`
	End           int      `json:"end"`
	Surface       string   `json:"surface"`
	Class         string   `json:"class"`
	POS           []string `json:"pos"`
	BaseForm      string   `json:"base_form"`
	Reading       string   `json:"reading"`
	Pronunciation string   `json:"pronunciation"`
	Features      []string `json:"features"`
}

TokenData is a data format with all the contents of the token.

func NewTokenData added in v2.6.4

func NewTokenData(t Token) TokenData

NewTokenData returns a data which has with all the contents of the token.

type TokenizeMode

type TokenizeMode int

TokenizeMode represents a mode of tokenize.

Kagome has segmentation mode for search such as Kuromoji.

Normal: Regular segmentation
Search: Use a heuristic to do additional segmentation useful for search
Extended: Similar to search mode, but also unigram unknown words
const (
	// Normal is the normal tokenize mode.
	Normal TokenizeMode = iota + 1
	// Search is the tokenize mode for search.
	Search
	// Extended is the experimental tokenize mode.
	Extended
	// BosEosID means the beginning a sentence (BOS) or the end of a sentence (EOS).
	BosEosID = lattice.BosEosID
)

func (TokenizeMode) String

func (m TokenizeMode) String() string

type Tokenizer

type Tokenizer struct {
	// contains filtered or unexported fields
}

Tokenizer represents morphological analyzer.

func New

func New(d *dict.Dict, opts ...Option) (*Tokenizer, error)

New creates a tokenizer.

func (Tokenizer) Analyze

func (t Tokenizer) Analyze(input string, mode TokenizeMode) []Token

Analyze tokenizes a sentence in the specified mode.

func (Tokenizer) AnalyzeGraph

func (t Tokenizer) AnalyzeGraph(w io.Writer, input string, mode TokenizeMode) []Token

AnalyzeGraph returns morphs of a sentence and exports a lattice graph to dot format.

func (Tokenizer) Dot

func (t Tokenizer) Dot(w io.Writer, input string) (tokens []Token)

Dot returns morphs of a sentence and exports a lattice graph to dot format in standard tokenize mode.

func (Tokenizer) Tokenize

func (t Tokenizer) Tokenize(input string) []Token

Tokenize analyzes a sentence in standard tokenize mode.

func (Tokenizer) Wakati

func (t Tokenizer) Wakati(input string) []string

Wakati tokenizes a sentence and returns its divided surface strings.

type UserExtra added in v2.9.0

type UserExtra struct {
	Tokens   []string
	Readings []string
}

UserExtra represents custom segmentation and custom reading for user entries.

Directories

Path Synopsis
Package lattice implements the core of the morph analyzer.
Package lattice implements the core of the morph analyzer.
mem
Package mem implements the memory utility such as memory pool.
Package mem implements the memory utility such as memory pool.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL