tokenizer

package

v2.7.0 Latest Latest Go to latest Published: Sep 14, 2021 License: MIT Imports: 7 Imported by: 77

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/ikawaha/kagome

Links

Open Source Insights

Documentation ¶

Overview ¶

Package tokenizer is a japanese morphological analyzer library.

Example (Tokenize_mode) ¶

d, err := prepareTestDict()
if err != nil {
	panic(err)
}
t, err := New(d)
if err != nil {
	panic(err)
}
for _, mode := range []TokenizeMode{Normal, Search, Extended} {
	tokens := t.Analyze("関西国際空港", Normal)
	fmt.Printf("---%s---", mode)
	for _, token := range tokens {
		if token.Class == DUMMY {
			// BOS: Begin Of Sentence, EOS: End Of Sentence.
			fmt.Printf("%s\n", token.Surface)
			continue
		}
		features := strings.Join(token.Features(), ",")
		fmt.Printf("%s\t%v\n", token.Surface, features)
	}
}

Output:

Index ¶

Constants
type Option
type Token
type TokenClass
- func (c TokenClass) String() string
type TokenData
- func NewTokenData(t Token) TokenData
type TokenizeMode
- func (m TokenizeMode) String() string
type Tokenizer
- func New(d *dict.Dict, opts ...Option) (*Tokenizer, error)

Examples ¶

Package (Tokenize_mode)

Constants ¶

View Source

const (
	// DUMMY represents the dummy token.
	DUMMY = TokenClass(lattice.DUMMY)
	// KNOWN represents the token in the dictionary.
	KNOWN = TokenClass(lattice.KNOWN)
	// UNKNOWN represents the token which is not in the dictionary.
	UNKNOWN = TokenClass(lattice.UNKNOWN)
	// USER represents the token in the user dictionary.
	USER = TokenClass(lattice.USER)
)

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type Option ¶

type Option func(*Tokenizer) error

Option represents an option for the tokenizer.

func Nop ¶

func Nop() Option

Nop represents a no operation option.

func OmitBosEos ¶ added in v2.1.1

func OmitBosEos() Option

OmitBosEos is a tokenizer option to omit BOS/EOS from output tokens.

func UserDict ¶

func UserDict(d *dict.UserDict) Option

UserDict is a tokenizer option to sets a user dictionary.

type Token ¶

type Token struct {
	Index    int
	ID       int
	Class    TokenClass
	Position int // byte position
	Start    int
	End      int
	Surface  string
	// contains filtered or unexported fields
}

Token represents a morph of a sentence.

func (Token) BaseForm ¶

func (t Token) BaseForm() (string, bool)

BaseForm returns the base form features if exists.

func (Token) Equal ¶ added in v2.1.1

func (t Token) Equal(v Token) bool

Equal returns true if tokens are equal.

func (Token) FeatureAt ¶

func (t Token) FeatureAt(i int) (string, bool)

FeatureAt returns the i th feature if exists.

func (Token) Features ¶

func (t Token) Features() []string

Features returns contents of a token.

func (Token) InflectionalForm ¶

func (t Token) InflectionalForm() (string, bool)

InflectionalForm returns the inflectional form feature if exists.

func (Token) InflectionalType ¶

func (t Token) InflectionalType() (string, bool)

InflectionalType returns the inflectional type feature if exists.

func (Token) POS ¶

func (t Token) POS() []string

POS returns POS elements of features.

func (Token) Pronunciation ¶

func (t Token) Pronunciation() (string, bool)

Pronunciation returns the pronunciation feature if exists.

func (Token) Reading ¶

func (t Token) Reading() (string, bool)

Reading returns the reading feature if exists.

func (Token) String ¶

func (t Token) String() string

String returns a string representation of a token.

type TokenClass ¶

type TokenClass lattice.NodeClass

TokenClass represents the token class.

func (TokenClass) String ¶

func (c TokenClass) String() string

String returns string representation of a token class.

type TokenData ¶ added in v2.6.4

type TokenData struct {
	ID            int      `json:"id"`
	Start         int      `json:"start"`
	End           int      `json:"end"`
	Surface       string   `json:"surface"`
	Class         string   `json:"class"`
	POS           []string `json:"pos"`
	BaseForm      string   `json:"base_form"`
	Reading       string   `json:"reading"`
	Pronunciation string   `json:"pronunciation"`
	Features      []string `json:"features"`
}

TokenData is a data format with all the contents of the token.

func NewTokenData ¶ added in v2.6.4

func NewTokenData(t Token) TokenData

NewTokenData returns a data which has with all the contents of the token.

type TokenizeMode ¶

type TokenizeMode int

TokenizeMode represents a mode of tokenize.

Kagome has segmentation mode for search such as Kuromoji.

Normal: Regular segmentation
Search: Use a heuristic to do additional segmentation useful for search
Extended: Similar to search mode, but also unigram unknown words

const (
	// Normal is the normal tokenize mode.
	Normal TokenizeMode = iota + 1
	// Search is the tokenize mode for search.
	Search
	// Extended is the experimental tokenize mode.
	Extended
	// BosEosID means the beginning a sentence (BOS) or the end of a sentence (EOS).
	BosEosID = lattice.BosEosID
)

func (TokenizeMode) String ¶

func (m TokenizeMode) String() string

type Tokenizer ¶

type Tokenizer struct {
	// contains filtered or unexported fields
}

Tokenizer represents morphological analyzer.

func New ¶

func New(d *dict.Dict, opts ...Option) (*Tokenizer, error)

New creates a tokenizer.

func (Tokenizer) Analyze ¶

func (t Tokenizer) Analyze(input string, mode TokenizeMode) []Token

Analyze tokenizes a sentence in the specified mode.

func (Tokenizer) AnalyzeGraph ¶

func (t Tokenizer) AnalyzeGraph(w io.Writer, input string, mode TokenizeMode) []Token

AnalyzeGraph returns morphs of a sentence and exports a lattice graph to dot format.

func (Tokenizer) Dot ¶

func (t Tokenizer) Dot(w io.Writer, input string) (tokens []Token)

Dot returns morphs of a sentence and exports a lattice graph to dot format in standard tokenize mode.

func (Tokenizer) Tokenize ¶

func (t Tokenizer) Tokenize(input string) []Token

Tokenize analyzes a sentence in standard tokenize mode.

func (Tokenizer) Wakati ¶

func (t Tokenizer) Wakati(input string) []string

Wakati tokenizes a sentence and returns its divided surface strings.

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
lattice Package lattice implements the core of the morph analyzer.	Package lattice implements the core of the morph analyzer.

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL