Documentation ¶
Overview ¶
Package tokenizer is a japanese morphological analyzer library.
Example (Tokenize_mode) ¶
d, err := prepareTestDict() if err != nil { panic(err) } t, err := New(d) if err != nil { panic(err) } for _, mode := range []TokenizeMode{Normal, Search, Extended} { tokens := t.Analyze("関西国際空港", Normal) fmt.Printf("---%s---", mode) for _, token := range tokens { if token.Class == DUMMY { // BOS: Begin Of Sentence, EOS: End Of Sentence. fmt.Printf("%s\n", token.Surface) continue } features := strings.Join(token.Features(), ",") fmt.Printf("%s\t%v\n", token.Surface, features) } }
Output:
Index ¶
- Constants
- func EqualFeatures(lhs, rhs []string) bool
- type Option
- type Token
- func (t Token) BaseForm() (string, bool)
- func (t Token) Equal(v Token) bool
- func (t Token) EqualFeatures(tt Token) bool
- func (t Token) EqualPOS(tt Token) bool
- func (t Token) FeatureAt(i int) (string, bool)
- func (t Token) Features() []string
- func (t Token) InflectionalForm() (string, bool)
- func (t Token) InflectionalType() (string, bool)
- func (t Token) POS() []string
- func (t Token) Pronunciation() (string, bool)
- func (t Token) Reading() (string, bool)
- func (t Token) String() string
- func (t Token) UserExtra() *UserExtra
- type TokenClass
- type TokenData
- type TokenizeMode
- type Tokenizer
- func (t Tokenizer) Analyze(input string, mode TokenizeMode) []Token
- func (t Tokenizer) AnalyzeGraph(w io.Writer, input string, mode TokenizeMode) []Token
- func (t Tokenizer) Dot(w io.Writer, input string) (tokens []Token)
- func (t Tokenizer) Tokenize(input string) []Token
- func (t Tokenizer) Wakati(input string) []string
- type UserExtra
Examples ¶
Constants ¶
const ( // DUMMY represents the dummy token. DUMMY = TokenClass(lattice.DUMMY) // KNOWN represents the token in the dictionary. KNOWN = TokenClass(lattice.KNOWN) // UNKNOWN represents the token which is not in the dictionary. UNKNOWN = TokenClass(lattice.UNKNOWN) // USER represents the token in the user dictionary. USER = TokenClass(lattice.USER) )
Variables ¶
This section is empty.
Functions ¶
func EqualFeatures ¶ added in v2.8.0
EqualFeatures returns true, if the features are equal.
Types ¶
type Option ¶
Option represents an option for the tokenizer.
func OmitBosEos ¶ added in v2.1.1
func OmitBosEos() Option
OmitBosEos is a tokenizer option to omit BOS/EOS from output tokens.
type Token ¶
type Token struct { Index int ID int Class TokenClass Position int // byte position Start int End int Surface string // contains filtered or unexported fields }
Token represents a morph of a sentence.
func (Token) EqualFeatures ¶ added in v2.8.0
EqualFeatures returns true, if the features of tokens are equal.
func (Token) InflectionalForm ¶
InflectionalForm returns the inflectional form feature if exists.
func (Token) InflectionalType ¶
InflectionalType returns the inflectional type feature if exists.
func (Token) Pronunciation ¶
Pronunciation returns the pronunciation feature if exists.
type TokenClass ¶
TokenClass represents the token class.
func (TokenClass) String ¶
func (c TokenClass) String() string
String returns string representation of a token class.
type TokenData ¶ added in v2.6.4
type TokenData struct { ID int `json:"id"` Start int `json:"start"` End int `json:"end"` Surface string `json:"surface"` Class string `json:"class"` POS []string `json:"pos"` BaseForm string `json:"base_form"` Reading string `json:"reading"` Pronunciation string `json:"pronunciation"` Features []string `json:"features"` }
TokenData is a data format with all the contents of the token.
func NewTokenData ¶ added in v2.6.4
NewTokenData returns a data which has with all the contents of the token.
type TokenizeMode ¶
type TokenizeMode int
TokenizeMode represents a mode of tokenize.
Kagome has segmentation mode for search such as Kuromoji.
Normal: Regular segmentation Search: Use a heuristic to do additional segmentation useful for search Extended: Similar to search mode, but also unigram unknown words
const ( // Normal is the normal tokenize mode. Normal TokenizeMode = iota + 1 // Search is the tokenize mode for search. Search // Extended is the experimental tokenize mode. Extended // BosEosID means the beginning a sentence (BOS) or the end of a sentence (EOS). BosEosID = lattice.BosEosID )
func (TokenizeMode) String ¶
func (m TokenizeMode) String() string
type Tokenizer ¶
type Tokenizer struct {
// contains filtered or unexported fields
}
Tokenizer represents morphological analyzer.
func (Tokenizer) Analyze ¶
func (t Tokenizer) Analyze(input string, mode TokenizeMode) []Token
Analyze tokenizes a sentence in the specified mode.
func (Tokenizer) AnalyzeGraph ¶
AnalyzeGraph returns morphs of a sentence and exports a lattice graph to dot format.
func (Tokenizer) Dot ¶
Dot returns morphs of a sentence and exports a lattice graph to dot format in standard tokenize mode.