gse

package module

v0.69.17 Latest Latest Go to latest Published: Jan 20, 2022 License: Apache-2.0 Imports: 17 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/xqk/gse

Links

Open Source Insights

README ¶

gse

Go efficient multilingual NLP and text segmentation; support english, chinese, japanese and other. And supports with elasticsearch and bleve.

简体中文

Gse is implements jieba by golang, and try add NLP support and more feature

Feature:

Support common, search engine, full mode, precise mode and HMM mode multiple word segmentation modes;
Support user and embed dictionary, Part-of-speech/POS tagging, analyze segment info, stop and trim words
Support multilingual: English, Chinese, Japanese and other
Support traditional chinese
Support HMM cut text use Viterbi algorithm
Support NLP by TensorFlow (in work)
Named Entity Recognition (in work)
Supports with elasticsearch and bleve
run JSON RPC service.

Algorithm:

Dictionary with double array trie (Double-Array Trie) to achieve
Segmenter algorithm is the shortest path (based on word frequency and dynamic programming), and DAG and HMM algorithm word segmentation.

Text Segmentation speed:

single thread 9.2MB/s
goroutines concurrent 26.8MB/s.
HMM text segmentation single thread 3.2MB/s. (2core 4threads Macbook Pro).

Binding:

gse-bind, binding JavaScript and other, support more language.

Install / update

With Go module support (Go 1.11+), just import:

import "github.com/xqk/gse"

Otherwise, to install the gse package, run the command:

go get -u github.com/xqk/gse

Use

package main

import (
	"fmt"
	"regexp"

	"github.com/xqk/gse"
	"github.com/xqk/gse/hmm/pos"
)

var (
	text = "Hello world, Helloworld. Winter is coming! 你好世界."

	new, _ = gse.New("zh,testdata/test_dict3.txt", "alpha")

	seg gse.Segmenter
	posSeg pos.Segmenter
)

func main() {
	// Loading the default dictionary
	seg.LoadDict()
	// Loading the default dictionary with embed
	// seg.LoadDictEmbed()
	// 
	// Loading the simple chinese dictionary
	// seg.LoadDict("zh_s")
	// seg.LoadDictEmbed("zh_s")
	//
	// Loading the traditional chinese dictionary
	// seg.LoadDict("zh_t")
	// 
	// Loading the japanese dictionary
	// seg.LoadDict("jp")
	// 
	// Load the dictionary
	// seg.LoadDict("your gopath"+"/src/github.com/xqk/gse/data/dict/dictionary.txt")

	cut()

	segCut()
}

func cut() {
	hmm := new.Cut(text, true)
	fmt.Println("cut use hmm: ", hmm)

	hmm = new.CutSearch(text, true)
	fmt.Println("cut search use hmm: ", hmm)
	fmt.Println("analyze: ", new.Analyze(hmm, text))

	hmm = new.CutAll(text)
	fmt.Println("cut all: ", hmm)

	reg := regexp.MustCompile(`(\d+年|\d+月|\d+日|[\p{Latin}]+|[\p{Hangul}]+|\d+\.\d+|[a-zA-Z0-9]+)`)
	text1 := `헬로월드 헬로 서울, 2021年09月10日, 3.14`
	hmm = seg.CutDAG(text1, reg)
	fmt.Println("Cut with hmm and regexp: ", hmm, hmm[0], hmm[6])
}

func analyzeAndTrim(cut []string) {
	a := seg.Analyze(cut, "")
	fmt.Println("analyze the segment: ", a)

	cut = seg.Trim(cut)
	fmt.Println("cut all: ", cut)

	fmt.Println(seg.String(text, true))
	fmt.Println(seg.Slice(text, true))
}

func cutPos() {
	po := seg.Pos(text, true)
	fmt.Println("pos: ", po)
	po = seg.TrimPos(po)
	fmt.Println("trim pos: ", po)

	pos.WithGse(seg)
	po = posSeg.Cut(text, true)
	fmt.Println("pos: ", po)

	po = posSeg.TrimWithPos(po, "zg")
	fmt.Println("trim pos: ", po)
}

func segCut() {
	// Text Segmentation
	tb := []byte(text)
	fmt.Println(seg.String(text, true))

	segments := seg.Segment(tb)
	// Handle word segmentation results, search mode
	fmt.Println(gse.ToString(segments, true))
}

Look at an custom dictionary example

package main

import (
	"fmt"
	_ "embed"

	"github.com/xqk/gse"
)

//go:embed test_dict3.txt
var testDict string

func main() {
	// var seg gse.Segmenter
	// seg.LoadDict("zh, testdata/test_dict.txt, testdata/test_dict1.txt")
	// seg.LoadStop()
	seg, err := gse.NewEmbed("zh, word 20 n"+testDict, "en")
	// seg.LoadDictEmbed()
	seg.LoadStopEmbed()

	text1 := "你好世界, Hello world"
	fmt.Println(seg.Cut(text1, true))
	fmt.Println(seg.String(text1, true))

	segments := seg.Segment([]byte(text1))
	fmt.Println(gse.ToString(segments))
}

Look at an Chinese example

Look at an Japanese example

Elasticsearch

How to use it with elasticsearch?

go-gse-elastic

Authors

License

Gse is primarily distributed under the terms of "both the MIT license and the Apache License (Version 2.0)". See LICENSE-APACHE, LICENSE-MIT.

Thanks for sego and jieba(jiebago).

Documentation ¶

Overview ¶

Package gse Go efficient multilingual NLP and text segmentation, Go 高性能多语言 NLP 和分词

Index ¶

Constants
Variables
func DictPaths(dictDir, filePath string) (files []string)
func FilterEmoji(text string) (new string)
func FilterHtml(text string) string
func FilterLang(text, lang string) (new string)
func FilterSymbol(text string) (new string)
func FindAllOccs(data []byte, searches []string) map[string][]int
func GetVersion() string
func IsJp(segText string) bool
func Join(text []Text) string
func Range(text string) (new []string)
func RangeText(text string) (new string)
func SplitNum(text string) []string
func SplitNums(text string) string
func ToSlice(segs []Segment, searchMode ...bool) (output []string)
func ToString(segs []Segment, searchMode ...bool) (output string)
type AnalyzeToken
type Dictionary
- func NewDict() *Dictionary
- func (dict *Dictionary) AddToken(token Token) error
- func (dict *Dictionary) Find(word []byte) (float64, string, bool)
- func (dict *Dictionary) LookupTokens(words []Text, tokens []*Token) (numOfTokens int)
- func (dict *Dictionary) MaxTokenLen() int
- func (dict *Dictionary) NumTokens() int
- func (dict *Dictionary) RemoveToken(token Token) error
- func (dict *Dictionary) TotalFreq() float64
- func (dict *Dictionary) Value(word []byte) (val, id int, err error)
type Prob
type SegPos
- func ToPos(segs []Segment, searchMode ...bool) (output []SegPos)
type Segment
- func (s *Segment) End() int
- func (s *Segment) Start() int
- func (s *Segment) Token() *Token
type Segmenter
- func New(files ...string) (seg Segmenter, err error)
- func NewEmbed(dict ...string) (seg Segmenter, err error)
- func (seg *Segmenter) AddStop(text string)
- func (seg *Segmenter) AddToken(text string, freq float64, pos ...string) error
- func (seg *Segmenter) AddTokenForce(text string, freq float64, pos ...string) (err error)
- func (seg *Segmenter) Analyze(text []string, t1 string, by ...bool) (az []AnalyzeToken)
- func (seg *Segmenter) CalcToken()
- func (seg *Segmenter) Cut(str string, hmm ...bool) []string
- func (seg *Segmenter) CutAll(str string) []string
- func (seg *Segmenter) CutDAG(str string, reg ...*regexp.Regexp) []string
- func (seg *Segmenter) CutDAGNoHMM(str string) []string
- func (seg *Segmenter) CutSearch(str string, hmm ...bool) []string
- func (seg *Segmenter) CutStr(str []string, separator ...string) (r string)
- func (seg *Segmenter) CutTrim(str string, hmm ...bool) []string
- func (seg *Segmenter) CutTrimHtml(str string, hmm ...bool) []string
- func (seg *Segmenter) CutTrimHtmls(str string, hmm ...bool) string
- func (seg *Segmenter) CutUrl(str string, num ...bool) []string
- func (seg *Segmenter) CutUrls(str string, num ...bool) string
- func (seg *Segmenter) Dictionary() *Dictionary
- func (seg *Segmenter) Empty() error
- func (seg *Segmenter) EmptyStop() error
- func (seg *Segmenter) Find(str string) (float64, string, bool)
- func (seg *Segmenter) GetCurrentFilePath() string
- func (seg *Segmenter) GetIdfPath(files ...string) []string
- func (seg *Segmenter) HMMCut(str string, reg ...*regexp.Regexp) []string
- func (seg *Segmenter) HMMCutMod(str string, prob ...map[rune]float64) []string
- func (seg *Segmenter) Init()
- func (seg *Segmenter) IsStop(s string) bool
- func (seg *Segmenter) LoadDict(files ...string) error
- func (seg *Segmenter) LoadDictEmbed(dict ...string) (err error)
- func (seg *Segmenter) LoadDictMap(dict []map[string]string) error
- func (seg *Segmenter) LoadDictStr(dict string) error
- func (seg *Segmenter) LoadModel(prob ...map[rune]float64)
- func (seg *Segmenter) LoadStop(files ...string) error
- func (seg *Segmenter) LoadStopArr(dict []string)
- func (seg *Segmenter) LoadStopEmbed(dict ...string) (err error)
- func (seg *Segmenter) LoadStopStr(dict string) error
- func (seg *Segmenter) ModeSegment(bytes []byte, searchMode ...bool) []Segment
- func (seg *Segmenter) Pos(s string, searchMode ...bool) []SegPos
- func (seg *Segmenter) PosStr(str []SegPos, separator ...string) (r string)
- func (seg *Segmenter) PosTrim(str string, search bool, pos ...string) []SegPos
- func (seg *Segmenter) PosTrimArr(str string, search bool, pos ...string) (re []string)
- func (seg *Segmenter) PosTrimStr(str string, search bool, pos ...string) string
- func (seg *Segmenter) ReAddToken(text string, freq float64, pos ...string) error
- func (seg *Segmenter) Read(file string) error
- func (seg *Segmenter) Reader(reader io.Reader, files ...string) error
- func (seg *Segmenter) RemoveStop(text string)
- func (seg *Segmenter) RemoveToken(text string) error
- func (seg *Segmenter) Segment(bytes []byte) []Segment
- func (seg *Segmenter) SetDataPath(dataPath string)
- func (seg *Segmenter) Size(size int, text, freqText string) (freq float64)
- func (seg *Segmenter) Slice(s string, searchMode ...bool) []string
- func (seg *Segmenter) SplitTextToWords(text Text) []Text
- func (seg *Segmenter) String(s string, searchMode ...bool) string
- func (seg *Segmenter) SuggestFreq(words ...string) float64
- func (seg *Segmenter) ToToken(text string, freq float64, pos ...string) Token
- func (seg *Segmenter) Trim(s []string) (r []string)
- func (seg *Segmenter) TrimPos(s []SegPos) (r []SegPos)
- func (seg *Segmenter) TrimPosPunct(se []SegPos) (re []SegPos)
- func (seg *Segmenter) TrimPunct(s []string) (r []string)
- func (seg *Segmenter) TrimSymbol(s []string) (r []string)
- func (seg *Segmenter) TrimWithPos(se []SegPos, pos ...string) (re []SegPos)
- func (seg *Segmenter) Value(str string) (int, int, error)
type Text
- func SplitWords(text Text) []Text
type Token
- func (token *Token) Equals(str string) bool
- func (token *Token) Freq() float64
- func (token *Token) Pos() string
- func (token *Token) Segments() []*Segment
- func (token *Token) Text() string

Constants ¶

View Source

const (
	// RatioWord ratio words and letters
	RatioWord float32 = 1.5
	// RatioWordFull full ratio words and letters
	RatioWordFull float32 = 1
)

View Source

const (
	// Version get the gse version
	Version = "v0.69.9.593, Green Lake!"
)

Variables ¶

View Source

var StopWordMap = map[string]bool{
	" ": true,
}

StopWordMap the default stop words.

View Source

var (
	// ToLower set alpha tolower
	ToLower = true
)

Functions ¶

func DictPaths ¶

func DictPaths(dictDir, filePath string) (files []string)

DictPaths get the dict's paths

func FilterEmoji ¶

func FilterEmoji(text string) (new string)

FilterEmoji filter the emoji

func FilterHtml ¶

func FilterHtml(text string) string

FilterHtml filter the html tag

func FilterLang ¶

func FilterLang(text, lang string) (new string)

FilterLang filter the language

func FilterSymbol ¶

func FilterSymbol(text string) (new string)

FilterSymbol filter the symbol

func FindAllOccs ¶

func FindAllOccs(data []byte, searches []string) map[string][]int

FindAllOccs find the all search byte start in data

func GetVersion ¶

func GetVersion() string

GetVersion get the gse version

func IsJp ¶

func IsJp(segText string) bool

IsJp is Japan char return true

func Join ¶

func Join(text []Text) string

Join is better string splicing

func Range ¶

func Range(text string) (new []string)

Range range text to []string

func RangeText ¶

func RangeText(text string) (new string)

RangeText range text to string

func SplitNum ¶

func SplitNum(text string) []string

SplitNum cut string by num to []string

func SplitNums ¶

func SplitNums(text string) string

SplitNums cut string by num to string

func ToSlice ¶

func ToSlice(segs []Segment, searchMode ...bool) (output []string)

ToSlice segments to slice 输出分词结果到一个字符串 slice

有两种输出模式，以 "山达尔星联邦共和国" 为例

普通模式（searchMode=false）输出一个分词"[山达尔星联邦共和国]"
搜索模式（searchMode=true） 输出普通模式的再细致切分：
    "[山达尔星 联邦 共和 国 共和国 联邦共和国 山达尔星联邦共和国]"

默认 searchMode=false 搜索模式主要用于给搜索引擎提供尽可能多的关键字，详情请见Token结构体的注释。

func ToString ¶

func ToString(segs []Segment, searchMode ...bool) (output string)

ToString segments to string 输出分词结果为字符串

有两种输出模式，以 "山达尔星联邦共和国" 为例

普通模式（searchMode=false）输出一个分词 "山达尔星联邦共和国/ns "
搜索模式（searchMode=true） 输出普通模式的再细致切分：
    "山达尔星/nz 联邦/n 共和/nz 国/n 共和国/ns 联邦共和国/nt 山达尔星联邦共和国/ns "

默认 searchMode=false 搜索模式主要用于给搜索引擎提供尽可能多的关键字，详情请见 Token 结构体的注释。

Types ¶

type AnalyzeToken ¶

type AnalyzeToken struct {
	// 分词在文本中的起始位置
	Start int
	End   int

	Position int
	Len      int

	Type string

	Text string
	Freq float64
	Pos  string
}

AnalyzeToken analyze the segment info structure

type Dictionary ¶

type Dictionary struct {
	Tokens []Token // 词典中所有的分词，方便遍历
	// contains filtered or unexported fields
}

Dictionary 结构体实现了一个字串双数组树，一个分词可能出现在叶子节点也有可能出现在非叶节点

func NewDict ¶

func NewDict() *Dictionary

NewDict new dictionary

func (*Dictionary) AddToken ¶

func (dict *Dictionary) AddToken(token Token) error

AddToken 向词典中加入一个分词

func (*Dictionary) Find ¶

func (dict *Dictionary) Find(word []byte) (float64, string, bool)

Find find the word in the dictionary is non-existent and the word's frequency, pos

func (*Dictionary) LookupTokens ¶

func (dict *Dictionary) LookupTokens(
	words []Text, tokens []*Token) (numOfTokens int)

LookupTokens 在词典中查找和字元组 words 可以前缀匹配的所有分词返回值为找到的分词数

func (*Dictionary) MaxTokenLen ¶

func (dict *Dictionary) MaxTokenLen() int

MaxTokenLen 词典中最长的分词

func (*Dictionary) NumTokens ¶

func (dict *Dictionary) NumTokens() int

NumTokens 词典中分词数目

func (*Dictionary) RemoveToken ¶

func (dict *Dictionary) RemoveToken(token Token) error

RemoveToken remove token in dictionary

func (*Dictionary) TotalFreq ¶

func (dict *Dictionary) TotalFreq() float64

TotalFreq 词典中所有分词的频率之和

func (*Dictionary) Value ¶

func (dict *Dictionary) Value(word []byte) (val, id int, err error)

Value find word in the dictionary retrun the word's value, id

type Prob ¶

type Prob struct {
	B, E, M, S map[rune]float64
}

Prob type hmm model struct

type SegPos ¶

type SegPos struct {
	Text, Pos string
}

SegPos type a POS struct

func ToPos ¶

func ToPos(segs []Segment, searchMode ...bool) (output []SegPos)

ToPos segments to SegPos

type Segment ¶

type Segment struct {
	Position int
	// contains filtered or unexported fields
}

Segment 文本中的一个分词

func (*Segment) End ¶

func (s *Segment) End() int

End 返回分词在文本中的结束字节位置（不包括该位置）

func (*Segment) Start ¶

func (s *Segment) Start() int

Start 返回分词在文本中的起始字节位置

func (*Segment) Token ¶

func (s *Segment) Token() *Token

Token 返回分词信息

type Segmenter ¶

type Segmenter struct {
	Dict *Dictionary
	Load bool

	// AlphaNum set splitTextToWords can add token
	// when words in alphanum
	// set up alphanum dictionary word segmentation
	AlphaNum bool
	Alpha    bool
	Num      bool

	// LoadNoFreq load not have freq dict word
	LoadNoFreq bool
	// MinTokenFreq load min freq token
	MinTokenFreq float64
	// TextFreq add token frenquency when not specified freq
	TextFreq string

	// SkipLog set skip log print
	SkipLog bool
	MoreLog bool

	// SkipPos skip PosStr pos
	SkipPos bool

	NotStop bool
	// StopWordMap the stop word map
	StopWordMap map[string]bool

	DataPath string
}

Segmenter 分词器结构体

func New ¶

func New(files ...string) (seg Segmenter, err error)

New return new gse segmenter

func NewEmbed ¶

func NewEmbed(dict ...string) (seg Segmenter, err error)

NewEmbed return new gse segmenter by embed dictionary

func (*Segmenter) AddStop ¶

func (seg *Segmenter) AddStop(text string)

AddStop add a token to the StopWord dictionary.

func (*Segmenter) AddToken ¶

func (seg *Segmenter) AddToken(text string, freq float64, pos ...string) error

AddToken add new text to token

func (*Segmenter) AddTokenForce ¶

func (seg *Segmenter) AddTokenForce(text string, freq float64, pos ...string) (err error)

AddTokenForce add new text to token and force time-consuming

func (*Segmenter) Analyze ¶

func (seg *Segmenter) Analyze(text []string, t1 string, by ...bool) (az []AnalyzeToken)

Analyze analyze the token segment info

func (*Segmenter) CalcToken ¶

func (seg *Segmenter) CalcToken()

CalcToken calc the segmenter token

func (*Segmenter) Cut ¶

func (seg *Segmenter) Cut(str string, hmm ...bool) []string

Cut cuts a str into words using accurate mode. Parameter hmm controls whether to use the HMM(Hidden Markov Model) or use the user's model.

seg.Cut(text):

use the shortest path

seg.Cut(text, false):

use cut dag not hmm

seg.Cut(text, true):

use cut dag and hmm mode

func (*Segmenter) CutAll ¶

func (seg *Segmenter) CutAll(str string) []string

CutAll cuts a str into words using full mode.

func (*Segmenter) CutDAG ¶

func (seg *Segmenter) CutDAG(str string, reg ...*regexp.Regexp) []string

CutDAG cut string with DAG use hmm and regexp

func (*Segmenter) CutDAGNoHMM ¶

func (seg *Segmenter) CutDAGNoHMM(str string) []string

CutDAGNoHMM cut string with DAG not use hmm

func (*Segmenter) CutSearch ¶

func (seg *Segmenter) CutSearch(str string, hmm ...bool) []string

CutSearch cuts str into words using search engine mode.

func (*Segmenter) CutStr ¶

func (seg *Segmenter) CutStr(str []string, separator ...string) (r string)

CutStr cut []string with Cut return string

func (*Segmenter) CutTrim ¶

func (seg *Segmenter) CutTrim(str string, hmm ...bool) []string

CutTrim cut string and tirm

func (*Segmenter) CutTrimHtml ¶

func (seg *Segmenter) CutTrimHtml(str string, hmm ...bool) []string

CutTrimHtml cut string trim html and symbol return []string

func (*Segmenter) CutTrimHtmls ¶

func (seg *Segmenter) CutTrimHtmls(str string, hmm ...bool) string

CutTrimHtmls cut string trim html and symbol return string

func (*Segmenter) CutUrl ¶

func (seg *Segmenter) CutUrl(str string, num ...bool) []string

CutUrl cut url string trim symbol return []string

func (*Segmenter) CutUrls ¶

func (seg *Segmenter) CutUrls(str string, num ...bool) string

CutUrls cut url string trim symbol return string

func (*Segmenter) Dictionary ¶

func (seg *Segmenter) Dictionary() *Dictionary

Dictionary 返回分词器使用的词典

func (*Segmenter) Empty ¶

func (seg *Segmenter) Empty() error

Empty empty the seg dictionary

func (*Segmenter) EmptyStop ¶

func (seg *Segmenter) EmptyStop() error

EmptyStop empty the stop dictionary

func (*Segmenter) Find ¶

func (seg *Segmenter) Find(str string) (float64, string, bool)

Find find word in dictionary return word's freq, pos and existence

func (*Segmenter) GetCurrentFilePath ¶

func (seg *Segmenter) GetCurrentFilePath() string

GetCurrentFilePath get current file path

func (*Segmenter) GetIdfPath ¶

func (seg *Segmenter) GetIdfPath(files ...string) []string

GetIdfPath get the idf path

func (*Segmenter) HMMCut ¶

func (seg *Segmenter) HMMCut(str string, reg ...*regexp.Regexp) []string

HMMCut cut sentence string use HMM with Viterbi

func (*Segmenter) HMMCutMod ¶

func (seg *Segmenter) HMMCutMod(str string, prob ...map[rune]float64) []string

HMMCutMod cut sentence string use HMM with Viterbi

func (*Segmenter) Init ¶

func (seg *Segmenter) Init()

Init init seg config

func (*Segmenter) IsStop ¶

func (seg *Segmenter) IsStop(s string) bool

IsStop check the word is a stop word.

func (*Segmenter) LoadDict ¶

func (seg *Segmenter) LoadDict(files ...string) error

LoadDict load the dictionary from the file

The format of the dictionary is (one for each participle):

participle text, frequency, part of speech

Can load multiple dictionary files, the file name separated by "," or ", " the front of the dictionary preferentially load the participle,

such as: "user_dictionary.txt,common_dictionary.txt"

When a participle appears both in the user dictionary and in the `common dictionary`, the `user dictionary` is given priority.

从文件中载入词典

可以载入多个词典文件，文件名用 "," 或 ", " 分隔，排在前面的词典优先载入分词，比如:

"用户词典.txt,通用词典.txt"

当一个分词既出现在用户词典也出现在 `通用词典` 中，则优先使用 `用户词典`。

词典的格式为（每个分词一行）：

分词文本 频率 词性

func (*Segmenter) LoadDictEmbed ¶

func (seg *Segmenter) LoadDictEmbed(dict ...string) (err error)

LoadDictEmbed load dictionary by embed file

func (*Segmenter) LoadDictMap ¶

func (seg *Segmenter) LoadDictMap(dict []map[string]string) error

LoadDictMap load dictionary from []map[string]string

func (*Segmenter) LoadDictStr ¶

func (seg *Segmenter) LoadDictStr(dict string) error

LoadDictStr load dictionary from string

func (*Segmenter) LoadModel ¶

func (seg *Segmenter) LoadModel(prob ...map[rune]float64)

LoadModel load the hmm model

Use the user's model:

seg.LoadModel(B, E, M, S map[rune]float64)

func (*Segmenter) LoadStop ¶

func (seg *Segmenter) LoadStop(files ...string) error

LoadStop load stop word files add token to map

func (*Segmenter) LoadStopArr ¶

func (seg *Segmenter) LoadStopArr(dict []string)

LoadStopArr load stop word by []string

func (*Segmenter) LoadStopEmbed ¶

func (seg *Segmenter) LoadStopEmbed(dict ...string) (err error)

LoadStopEmbed load stop dictionary from embed file

func (*Segmenter) LoadStopStr ¶

func (seg *Segmenter) LoadStopStr(dict string) error

LoadDictStr load the stop dictionary from string

func (*Segmenter) ModeSegment ¶

func (seg *Segmenter) ModeSegment(bytes []byte, searchMode ...bool) []Segment

ModeSegment segment using search mode if searchMode is true

func (*Segmenter) Pos ¶

func (seg *Segmenter) Pos(s string, searchMode ...bool) []SegPos

Pos return text and pos array

func (*Segmenter) PosStr ¶

func (seg *Segmenter) PosStr(str []SegPos, separator ...string) (r string)

PosStr cut []SegPos with Pos return string

func (*Segmenter) PosTrim ¶

func (seg *Segmenter) PosTrim(str string, search bool, pos ...string) []SegPos

PosTrim cut string pos and trim

func (*Segmenter) PosTrimArr ¶

func (seg *Segmenter) PosTrimArr(str string, search bool, pos ...string) (re []string)

PosTrimArr cut string return pos.Text []string

func (*Segmenter) PosTrimStr ¶

func (seg *Segmenter) PosTrimStr(str string, search bool, pos ...string) string

PosTrimStr cut string return pos.Text string

func (*Segmenter) ReAddToken ¶

func (seg *Segmenter) ReAddToken(text string, freq float64, pos ...string) error

ReAddToken remove and add token again

func (*Segmenter) Read ¶

func (seg *Segmenter) Read(file string) error

Read read the dict flie

func (*Segmenter) Reader ¶

func (seg *Segmenter) Reader(reader io.Reader, files ...string) error

Reader load dictionary from io.Reader

func (*Segmenter) RemoveStop ¶

func (seg *Segmenter) RemoveStop(text string)

RemoveStop remove a token from the StopWord dictionary.

func (*Segmenter) RemoveToken ¶

func (seg *Segmenter) RemoveToken(text string) error

RemoveToken remove token in dictionary

func (*Segmenter) Segment ¶

func (seg *Segmenter) Segment(bytes []byte) []Segment

Segment 对文本分词

输入参数：

bytes	UTF8 文本的字节数组

输出：

[]Segment	划分的分词

func (*Segmenter) SetDataPath ¶ added in v0.69.17

func (seg *Segmenter) SetDataPath(dataPath string)

func (*Segmenter) Size ¶

func (seg *Segmenter) Size(size int, text, freqText string) (freq float64)

Size frequency is calculated based on the size of the text

func (*Segmenter) Slice ¶

func (seg *Segmenter) Slice(s string, searchMode ...bool) []string

Slice use modeSegment segment retrun []string using search mode if searchMode is true

func (*Segmenter) SplitTextToWords ¶

func (seg *Segmenter) SplitTextToWords(text Text) []Text

SplitTextToWords 将文本划分成字元

func (*Segmenter) String ¶

func (seg *Segmenter) String(s string, searchMode ...bool) string

Slice use modeSegment segment retrun string using search mode if searchMode is true

func (*Segmenter) SuggestFreq ¶

func (seg *Segmenter) SuggestFreq(words ...string) float64

SuggestFreq suggest the words frequency return a suggested frequency of a word cutted to short words.

func (*Segmenter) ToToken ¶

func (seg *Segmenter) ToToken(text string, freq float64, pos ...string) Token

ToToken make the text, freq and pos to token structure

func (*Segmenter) Trim ¶

func (seg *Segmenter) Trim(s []string) (r []string)

Trim trim []string exclude symbol, space and punct

func (*Segmenter) TrimPos ¶

func (seg *Segmenter) TrimPos(s []SegPos) (r []SegPos)

TrimPos trim SegPos not symbol, space and punct

func (*Segmenter) TrimPosPunct ¶

func (seg *Segmenter) TrimPosPunct(se []SegPos) (re []SegPos)

TrimPosPunct trim SegPos not space and punct

func (*Segmenter) TrimPunct ¶

func (seg *Segmenter) TrimPunct(s []string) (r []string)

TrimPunct trim []string exclude space and punct

func (*Segmenter) TrimSymbol ¶

func (seg *Segmenter) TrimSymbol(s []string) (r []string)

TrimSymbol trim []string exclude symbol, space and punct

func (*Segmenter) TrimWithPos ¶

func (seg *Segmenter) TrimWithPos(se []SegPos, pos ...string) (re []SegPos)

TrimWithPos trim some seg with pos

func (*Segmenter) Value ¶

func (seg *Segmenter) Value(str string) (int, int, error)

Value find word in dictionary return word's value

type Text ¶

type Text []byte

Text 字串类型，可以用来表达

一个字元，比如 "世" 又如 "界", 英文的一个字元是一个词
一个分词，比如 "世界" 又如 "人口"
一段文字，比如 "世界有七十亿人口"

func SplitWords ¶

func SplitWords(text Text) []Text

SplitWords 将文本划分成字元

type Token ¶

type Token struct {
	// contains filtered or unexported fields
}

Token 一个分词

func (*Token) Equals ¶

func (token *Token) Equals(str string) bool

Equals compare str split tokens

func (*Token) Freq ¶

func (token *Token) Freq() float64

Freq 返回分词在语料库中的词频

func (*Token) Pos ¶

func (token *Token) Pos() string

Pos 返回分词词性标注

func (*Token) Segments ¶

func (token *Token) Segments() []*Segment

Segments 该分词文本的进一步分词划分，比如 "山达尔星联邦共和国联邦政府" 这个分词有两个子分词 "山达尔星联邦共和国 " 和 "联邦政府"。子分词也可以进一步有子分词形成一个树结构，遍历这个树就可以得到该分词的所有细致分词划分，这主要用于搜索引擎对一段文本进行全文搜索。

func (*Token) Text ¶

func (token *Token) Text() string

Text 返回分词文本

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
crf
data
examples
dict
dict/embed
en
hmm
jp
gonn
cnn
rnn
hmm Package hmm is the Golang HMM cut module Package hmm model data The data from https://github.com/fxsjy/jieba	Package hmm is the Golang HMM cut module Package hmm model data The data from https://github.com/fxsjy/jieba
bm25
idf
pos Package pos model data The data from https://github.com/fxsjy/jieba	Package pos model data The data from https://github.com/fxsjy/jieba
util
tf
nlp
tools
benchmark
benchmark/goroutines
server

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL