tokenizer

package
Version: v1.1.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 2, 2020 License: MIT Imports: 6 Imported by: 0

Documentation

Index

Constants

View Source
const (
	SCWS_MULTI_NONE    int = 0x00000 // 无
	SCWS_MULTI_SHORT   int = 0x01000 // 短词
	SCWS_MULTI_DUALITY int = 0x02000 // 二元(将相邻的2个单字组合成一个词)
	SCWS_MULTI_ZMAIN   int = 0x04000 // 重要单字
	SCWS_MULTI_ZALL    int = 0x08000 // 全部单字
	SCWS_MULTI_MASK    int = 0xff000
	SCWS_XDICT_XDB     int = 1
	SCWS_XDICT_MEM     int = 2
	SCWS_XDICT_TXT     int = 4
)

Variables

This section is empty.

Functions

func CloseScws

func CloseScws()

CloseScws close scws

func InitScws

func InitScws(dict string, rule ...string) error

InitScws 初始化全局的scws实例.

Types

type DefaultTokenizer

type DefaultTokenizer struct {
	Name string
}

DefaultTokenizer split text by space

func (DefaultTokenizer) GetTokens

func (tokenizer DefaultTokenizer) GetTokens(text string) []string

GetTokens split text by space

type ScwsTokenizer

type ScwsTokenizer struct {
	// contains filtered or unexported fields
}

ScwsTokenizer scws tokenizer depends on scws

func GetScwsTokenizer

func GetScwsTokenizer() (*ScwsTokenizer, error)

GetScwsTokenizer 获取一个scws实例,该方法必须在InitScws之后调用。

func (*ScwsTokenizer) AddTxtDict

func (scws *ScwsTokenizer) AddTxtDict(dict string) error

AddTxtDict load text dict file

func (*ScwsTokenizer) AddXdbDict

func (scws *ScwsTokenizer) AddXdbDict(dict string) error

AddXdbDict load xdb dict file

func (*ScwsTokenizer) Close

func (scws *ScwsTokenizer) Close()

Close tokenizer to free it's memory

func (ScwsTokenizer) GetTokens

func (scws ScwsTokenizer) GetTokens(text string) []string

GetTokens return terms split by scws

func (*ScwsTokenizer) SetDuality

func (scws *ScwsTokenizer) SetDuality(yes int)

SetDuality 设定是否将闲散文字自动以二字分词法聚合

func (*ScwsTokenizer) SetIgnore

func (scws *ScwsTokenizer) SetIgnore(yes int)

SetIgnore 设定分词结果是否忽略所有的标点等特殊符号(不会忽略\r和\n

func (*ScwsTokenizer) SetMulti

func (scws *ScwsTokenizer) SetMulti(mode int)

SetMulti 设定分词执行时是否执行针对长词复合切分。(例:“中国人”分为“中国”、“人”、“中国人”)

func (*ScwsTokenizer) SetRule

func (scws *ScwsTokenizer) SetRule(ruleFile string)

SetRule 设定规则集文件。

func (*ScwsTokenizer) SetTxtDict

func (scws *ScwsTokenizer) SetTxtDict(dict string) error

SetTxtDict load text dict file

func (*ScwsTokenizer) SetXdbDict

func (scws *ScwsTokenizer) SetXdbDict(dict string) error

SetXdbDict load xdb dict file

type Tokenizer

type Tokenizer interface {
	GetTokens(text string) []string
}

Tokenizer used by indexer and searcher

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
t or T : Toggle theme light dark auto
y or Y : Canonical URL