tokenizer

package

v2.10.0 Latest Latest Go to latest Published: Aug 3, 2024 License: MIT Imports: 7 Imported by: 84

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/ikawaha/kagome

Links

Open Source Insights

Documentation ¶

Overview ¶

Package tokenizer is a japanese morphological analyzer library.

Example (Tokenize_mode) ¶

d, err := prepareTestDict()
if err != nil {
	panic(err)
}
t, err := New(d)
if err != nil {
	panic(err)
}
for _, mode := range []TokenizeMode{Normal, Search, Extended} {
	tokens := t.Analyze("関西国際空港", Normal)
	fmt.Printf("---%s---", mode)
	for _, token := range tokens {
		if token.Class == DUMMY {
			// BOS: Begin Of Sentence, EOS: End Of Sentence.
			fmt.Printf("%s\n", token.Surface)
			continue
		}
		features := strings.Join(token.Features(), ",")
		fmt.Printf("%s\t%v\n", token.Surface, features)
	}
}

Output:

Index ¶

Constants
func EqualFeatures(lhs, rhs []string) bool
type Option
type Token
type TokenClass
- func (c TokenClass) String() string
type TokenData
- func NewTokenData(t Token) TokenData
type TokenizeMode
- func (m TokenizeMode) String() string
type Tokenizer
- func New(d *dict.Dict, opts ...Option) (*Tokenizer, error)
type UserExtra

Examples ¶

Package (Tokenize_mode)

Constants ¶

View Source

const (
	// DUMMY represents the dummy token.
	DUMMY = TokenClass(lattice.DUMMY)
	// KNOWN represents the token in the dictionary.
	KNOWN = TokenClass(lattice.KNOWN)
	// UNKNOWN represents the token which is not in the dictionary.
	UNKNOWN = TokenClass(lattice.UNKNOWN)
	// USER represents the token in the user dictionary.
	USER = TokenClass(lattice.USER)
)

Variables ¶

This section is empty.

Functions ¶

func EqualFeatures ¶ added in v2.8.0

func EqualFeatures(lhs, rhs []string) bool

EqualFeatures returns true, if the features are equal.

Types ¶

type Option ¶

type Option func(*Tokenizer) error

Option represents an option for the tokenizer.

func Nop ¶

func Nop() Option

Nop represents a no operation option.

func OmitBosEos ¶ added in v2.1.1

func OmitBosEos() Option

OmitBosEos is a tokenizer option to omit BOS/EOS from output tokens.

func UserDict ¶

func UserDict(d *dict.UserDict) Option

UserDict is a tokenizer option to sets a user dictionary.

type Token ¶

type Token struct {
	Index    int
	ID       int
	Class    TokenClass
	Position int // byte position
	Start    int
	End      int
	Surface  string
	// contains filtered or unexported fields
}

Token represents a morph of a sentence.

func (Token) BaseForm ¶

func (t Token) BaseForm() (string, bool)

BaseForm returns the base form features if exists.

func (Token) Equal ¶ added in v2.1.1

func (t Token) Equal(v Token) bool

Equal returns true if tokens are equal.

func (Token) EqualFeatures ¶ added in v2.8.0

func (t Token) EqualFeatures(tt Token) bool

EqualFeatures returns true, if the features of tokens are equal.

func (Token) EqualPOS ¶ added in v2.8.0

func (t Token) EqualPOS(tt Token) bool

EqualPOS returns true, if the POSs of tokens are equal.

func (Token) FeatureAt ¶

func (t Token) FeatureAt(i int) (string, bool)

FeatureAt returns the i th feature if exists.

func (Token) Features ¶

func (t Token) Features() []string

Features returns contents of a token.

func (Token) InflectionalForm ¶

func (t Token) InflectionalForm() (string, bool)

InflectionalForm returns the inflectional form feature if exists.

func (Token) InflectionalType ¶

func (t Token) InflectionalType() (string, bool)

InflectionalType returns the inflectional type feature if exists.

func (Token) POS ¶

func (t Token) POS() []string

POS returns POS elements of features.

func (Token) Pronunciation ¶

func (t Token) Pronunciation() (string, bool)

Pronunciation returns the pronunciation feature if exists.

func (Token) Reading ¶

func (t Token) Reading() (string, bool)

Reading returns the reading feature if exists.

func (Token) String ¶

func (t Token) String() string

String returns a string representation of a token.

func (Token) UserExtra ¶ added in v2.9.0

func (t Token) UserExtra() *UserExtra

UserExtra returns extra data if token comes from a user dict.

type TokenClass ¶

type TokenClass lattice.NodeClass

TokenClass represents the token class.

func (TokenClass) String ¶

func (c TokenClass) String() string

String returns string representation of a token class.

type TokenData ¶ added in v2.6.4

type TokenData struct {
	ID            int      `json:"id"`
	Start         int      `json:"start"`
	End           int      `json:"end"`
	Surface       string   `json:"surface"`
	Class         string   `json:"class"`
	POS           []string `json:"pos"`
	BaseForm      string   `json:"base_form"`
	Reading       string   `json:"reading"`
	Pronunciation string   `json:"pronunciation"`
	Features      []string `json:"features"`
}

TokenData is a data format with all the contents of the token.

func NewTokenData ¶ added in v2.6.4

func NewTokenData(t Token) TokenData

NewTokenData returns a data which has with all the contents of the token.

type TokenizeMode ¶

type TokenizeMode int

TokenizeMode represents a mode of tokenize.

Kagome has segmentation mode for search such as Kuromoji.

Normal: Regular segmentation
Search: Use a heuristic to do additional segmentation useful for search
Extended: Similar to search mode, but also unigram unknown words

const (
	// Normal is the normal tokenize mode.
	Normal TokenizeMode = iota + 1
	// Search is the tokenize mode for search.
	Search
	// Extended is the experimental tokenize mode.
	Extended
	// BosEosID means the beginning a sentence (BOS) or the end of a sentence (EOS).
	BosEosID = lattice.BosEosID
)

func (TokenizeMode) String ¶

func (m TokenizeMode) String() string

type Tokenizer ¶

type Tokenizer struct {
	// contains filtered or unexported fields
}

Tokenizer represents morphological analyzer.

func New ¶

func New(d *dict.Dict, opts ...Option) (*Tokenizer, error)

New creates a tokenizer.

func (Tokenizer) Analyze ¶

func (t Tokenizer) Analyze(input string, mode TokenizeMode) []Token

Analyze tokenizes a sentence in the specified mode.

func (Tokenizer) AnalyzeGraph ¶

func (t Tokenizer) AnalyzeGraph(w io.Writer, input string, mode TokenizeMode) []Token

AnalyzeGraph returns morphs of a sentence and exports a lattice graph to dot format.

func (Tokenizer) Dot ¶

func (t Tokenizer) Dot(w io.Writer, input string) (tokens []Token)

Dot returns morphs of a sentence and exports a lattice graph to dot format in standard tokenize mode.

func (Tokenizer) Tokenize ¶

func (t Tokenizer) Tokenize(input string) []Token

Tokenize analyzes a sentence in standard tokenize mode.

func (Tokenizer) Wakati ¶

func (t Tokenizer) Wakati(input string) []string

Wakati tokenizes a sentence and returns its divided surface strings.

type UserExtra ¶ added in v2.9.0

type UserExtra struct {
	Tokens   []string
	Readings []string
}

UserExtra represents custom segmentation and custom reading for user entries.

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
lattice Package lattice implements the core of the morph analyzer.	Package lattice implements the core of the morph analyzer.
mem Package mem implements the memory utility such as memory pool.	Package mem implements the memory utility such as memory pool.

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL