jieba

package module

v0.0.0-...-36c17a1 Latest Latest Go to latest Published: Dec 3, 2022 License: AGPL-3.0 Imports: 8 Imported by: 6

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/fumiama/jieba

Links

Open Source Insights

README ¶

jieba

结巴分词是由 @fxsjy 使用 Python 编写的中文分词组件，本仓库是结巴分词的 Golang 语言实现，修改于jiebago，大幅优化了速度与性能，增加了从io.Reader加载字典等功能。

使用

go get -d github.com/fumiama/jieba

示例

package main

import (
        "fmt"

        "github.com/fumiama/jieba"
)

func main() {
	seg, err := LoadDictionaryAt("dict.txt")
	if err != nil {
		panic(err)
	}

	fmt.Print("【全模式】：")
	fmt.Println(seg.CutAll("我来到北京清华大学"))

	fmt.Print("【精确模式】：")
	fmt.Println(seg.Cut("我来到北京清华大学", false))

	fmt.Print("【新词识别】：")
	fmt.Println(seg.Cut("他来到了网易杭研大厦", true))

	fmt.Print("【搜索引擎模式】：")
	fmt.Println(seg.CutForSearch("小明硕士毕业于中国科学院计算所，后在日本京都大学深造", true))
}

输出结果：

【全模式】：[我 来到 北京 清华 清华大学 华大 大学]
【精确模式】：[我 来到 北京 清华大学]
【新词识别】：[他 来到 了 网易 杭研 大厦]
【搜索引擎模式】：[小明 硕士 毕业 于 中国 科学 学院 科学院 中国科学院 计算 计算所 ， 后 在 日本 京都 大学 日本京都大学 深造]

更多信息请参考文档。

分词速度

goos: darwin
goarch: amd64
pkg: github.com/fumiama/jieba
cpu: Intel(R) Core(TM) i5-8265U CPU @ 1.60GHz
BenchmarkCutNoHMM-8            	   50101	     22889 ns/op	   4.67 MB/s	   24492 B/op	     148 allocs/op
BenchmarkCut-8                 	   47473	     25152 ns/op	   4.25 MB/s	   31310 B/op	     185 allocs/op
BenchmarkCutAll-8              	   81760	     14286 ns/op	   7.49 MB/s	   22746 B/op	      75 allocs/op
BenchmarkCutForSearchNoHMM-8   	   49009	     24371 ns/op	   4.39 MB/s	   26421 B/op	     157 allocs/op
BenchmarkCutForSearch-8        	   44643	     26597 ns/op	   4.02 MB/s	   33224 B/op	     194 allocs/op
PASS
ok  	github.com/fumiama/jieba	8.769s

对比原仓库速度

goos: darwin
goarch: amd64
pkg: 
cpu: Intel(R) Core(TM) i5-8265U CPU @ 1.60GHz
BenchmarkCutNoHMM-8            	   21237	     56105 ns/op	   1.91 MB/s	   11514 B/op	     133 allocs/op
BenchmarkCut-8                 	   17604	     68463 ns/op	   1.56 MB/s	   13480 B/op	     200 allocs/op
BenchmarkCutAll-8              	   24620	     49472 ns/op	   2.16 MB/s	    7724 B/op	     116 allocs/op
BenchmarkCutForSearchNoHMM-8   	   17803	     66158 ns/op	   1.62 MB/s	   11766 B/op	     143 allocs/op
BenchmarkCutForSearch-8        	   14895	     79056 ns/op	   1.35 MB/s	   13772 B/op	     210 allocs/op
PASS
ok  		11.911s

Documentation ¶

Overview ¶

Package jieba is the Golang implemention of [Jieba](https://github.com/fxsjy/jieba), Python Chinese text segmentation module.

Example ¶

seg, err := LoadDictionaryAt("dict.txt")
if err != nil {
	panic(err)
}

fmt.Print("【全模式】：")
fmt.Println(seg.CutAll("我来到北京清华大学"))

fmt.Print("【精确模式】：")
fmt.Println(seg.Cut("我来到北京清华大学", false))

fmt.Print("【新词识别】：")
fmt.Println(seg.Cut("他来到了网易杭研大厦", true))

fmt.Print("【搜索引擎模式】：")
fmt.Println(seg.CutForSearch("小明硕士毕业于中国科学院计算所，后在日本京都大学深造", true))

Output:

【全模式】：[我 来到 北京 清华 清华大学 华大 大学]
【精确模式】：[我 来到 北京 清华大学]
【新词识别】：[他 来到 了 网易 杭研 大厦]
【搜索引擎模式】：[小明 硕士 毕业 于 中国 科学 学院 科学院 中国科学院 计算 计算所 ， 后 在 日本 京都 大学 日本京都大学 深造]

Example (LoadUserDictionary) ¶

seg, err := LoadDictionaryAt("dict.txt")
if err != nil {
	panic(err)
}

sentence := "李小福是创新办主任也是云计算方面的专家"
fmt.Print("Before:")
fmt.Println(seg.Cut(sentence, true))

seg.LoadUserDictionaryAt("userdict.txt")

fmt.Print("After:")
fmt.Println(seg.Cut(sentence, true))

Output:

Before:[李小福 是 创新 办 主任 也 是 云 计算 方面 的 专家]
After:[李小福 是 创新办 主任 也 是 云计算 方面 的 专家]

Example (ParallelCut) ¶

package main

import (
	"bufio"
	"fmt"
	"log"
	"os"
	"runtime"
	"strings"
	"time"
)

type line struct {
	number int
	text   string
}

var (
	segmenter  *Segmenter
	numThreads = runtime.NumCPU()
	task       = make(chan line, numThreads)
	result     = make(chan line, numThreads)
)

func worker() {
	for l := range task {
		segments := segmenter.Cut(l.text, true)

		l.text = fmt.Sprintf("%s\n", strings.Join(segments, " / "))
		result <- l
	}
}

func main() {
	// open file for segmentation
	file, err := os.Open("README.md")
	if err != nil {
		log.Fatal(err)
	}
	defer file.Close()

	// Load dictionary
	segmenter, err = LoadDictionaryAt("dict.txt")
	if err != nil {
		log.Fatal(err)
	}

	// start worker routines
	for i := 0; i < numThreads; i++ {
		go worker()
	}

	var length, size int
	scanner := bufio.NewScanner(file)

	t0 := time.Now()

	lines := make([]string, 0)

	// Read lines
	for scanner.Scan() {
		t := scanner.Text()
		size += len(t)
		lines = append(lines, t)
	}
	length = len(lines)

	// Segmentation
	go func() {
		for i := 0; i < length; i++ {
			task <- line{number: i, text: lines[i]}
		}
		close(task)
	}()

	// Make sure the segmentation result contains same line as original file
	for i := 0; i < length; i++ {
		l := <-result
		lines[l.number] = l.text
	}

	t1 := time.Now()
	close(result)

	// Write the segments into a file for verify
	outputFile, _ := os.OpenFile("parallelCut.log", os.O_CREATE|os.O_WRONLY, 0600)
	defer outputFile.Close()
	writer := bufio.NewWriter(outputFile)
	for _, l := range lines {
		writer.WriteString(l)
	}
	writer.Flush()

	log.Printf("Time cousumed: %v", t1.Sub(t0))
	log.Printf("Segmentation speed: %f MB/s", float64(size)/t1.Sub(t0).Seconds()/(1024*1024))
}

Output:

Example (SuggestFrequency) ¶

seg, err := LoadDictionaryAt("dict.txt")
if err != nil {
	panic(err)
}

sentence := "超敏C反应蛋白是什么？"
fmt.Print("Before:")
fmt.Println(seg.Cut(sentence, false))
word := "超敏C反应蛋白"
oldFrequency, _ := seg.Frequency(word)
frequency := seg.SuggestFrequency(word)
fmt.Printf("%s current frequency: %f, suggest: %f.\n", word, oldFrequency, frequency)
seg.AddWord(word, frequency)
fmt.Print("After:")
fmt.Println(seg.Cut(sentence, false))

sentence = "如果放到post中将出错"
fmt.Print("Before:")
fmt.Println(seg.Cut(sentence, false))
word = "中将"
oldFrequency, _ = seg.Frequency(word)
frequency = seg.SuggestFrequency("中", "将")
fmt.Printf("%s current frequency: %f, suggest: %f.\n", word, oldFrequency, frequency)
seg.AddWord(word, frequency)
fmt.Print("After:")
fmt.Println(seg.Cut(sentence, false))

sentence = "今天天气不错"
fmt.Print("Before:")
fmt.Println(seg.Cut(sentence, false))
word = "今天天气"
oldFrequency, _ = seg.Frequency(word)
frequency = seg.SuggestFrequency("今天", "天气")
fmt.Printf("%s current frequency: %f, suggest: %f.\n", word, oldFrequency, frequency)
seg.AddWord(word, frequency)
fmt.Print("After:")
fmt.Println(seg.Cut(sentence, false))

Output:

Before:[超敏 C 反应 蛋白 是 什么 ？]
超敏C反应蛋白 current frequency: 0.000000, suggest: 1.000000.
After:[超敏C反应蛋白 是 什么 ？]
Before:[如果 放到 post 中将 出错]
中将 current frequency: 763.000000, suggest: 494.000000.
After:[如果 放到 post 中 将 出错]
Before:[今天天气 不错]
今天天气 current frequency: 3.000000, suggest: 0.000000.
After:[今天 天气 不错]

Index ¶

Constants
type Dictionary
type Segmenter
- func LoadDictionary(file io.Reader) (*Segmenter, error)
- func LoadDictionaryAt(file string) (*Segmenter, error)

Constants ¶

View Source

const (
	RatioLetterWord     float32 = 1.5
	RatioLetterWordFull float32 = 1
)

ratio words and letters in an article commonly

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type Dictionary ¶

type Dictionary struct {
	sync.RWMutex
	// contains filtered or unexported fields
}

A Dictionary represents a thread-safe dictionary used for word segmentation.

func (*Dictionary) AddToken ¶

func (d *Dictionary) AddToken(token dictionary.Token)

AddToken adds one token

func (*Dictionary) Frequency ¶

func (d *Dictionary) Frequency(key string) (float64, bool)

Frequency returns the frequency and existence of give word

func (*Dictionary) Load ¶

func (d *Dictionary) Load(tokens ...dictionary.Token)

Load loads all tokens

type Segmenter ¶

type Segmenter Dictionary

Segmenter is a Chinese words segmentation struct.

func LoadDictionary ¶

func LoadDictionary(file io.Reader) (*Segmenter, error)

LoadDictionary loads dictionary from given file name. Everytime LoadDictionary is called, previously loaded dictionary will be cleard.

func LoadDictionaryAt ¶

func LoadDictionaryAt(file string) (*Segmenter, error)

LoadDictionaryAt loads dictionary from given file name. Everytime LoadDictionaryAt is called, previously loaded dictionary will be cleard.

func (*Segmenter) AddWord ¶

func (seg *Segmenter) AddWord(word string, frequency float64)

AddWord adds a new word with frequency to dictionary

func (*Segmenter) Cut ¶

func (seg *Segmenter) Cut(sentence string, hmm bool) []string

Cut cuts a sentence into words using accurate mode. Parameter hmm controls whether to use the Hidden Markov Model. Accurate mode attempts to cut the sentence into the most accurate segmentations, which is suitable for text analysis.

func (*Segmenter) CutAll ¶

func (seg *Segmenter) CutAll(sentence string) []string

CutAll cuts a sentence into words using full mode. Full mode gets all the possible words from the sentence. Fast but not accurate.

func (*Segmenter) CutForSearch ¶

func (seg *Segmenter) CutForSearch(sentence string, hmm bool) []string

CutForSearch cuts sentence into words using search engine mode. Search engine mode, based on the accurate mode, attempts to cut long words into several short words, which can raise the recall rate. Suitable for search engines.

func (*Segmenter) DeleteWord ¶

func (seg *Segmenter) DeleteWord(word string)

DeleteWord removes a word from dictionary

func (*Segmenter) Frequency ¶

func (seg *Segmenter) Frequency(word string) (float64, bool)

Frequency returns a word's frequency and existence

func (*Segmenter) LoadUserDictionary ¶

func (seg *Segmenter) LoadUserDictionary(file io.Reader) error

LoadUserDictionary loads a user specified dictionary, it must be called after LoadDictionary, and it will not clear any previous loaded dictionary, instead it will override exist entries.

func (*Segmenter) LoadUserDictionaryAt ¶

func (seg *Segmenter) LoadUserDictionaryAt(file string) error

LoadUserDictionaryAt loads a user specified dictionary, it must be called after LoadDictionary, and it will not clear any previous loaded dictionary, instead it will override exist entries.

func (*Segmenter) SuggestFrequency ¶

func (seg *Segmenter) SuggestFrequency(words ...string) float64

SuggestFrequency returns a suggested frequncy of a word or a long word cutted into several short words.

This method is useful when a word in the sentence is not cutted out correctly.

If a word should not be further cutted, for example word "石墨烯" should not be cutted into "石墨" and "烯", SuggestFrequency("石墨烯") will return the maximu frequency for this word.

If a word should be further cutted, for example word "今天天气" should be further cutted into two words "今天" and "天气", SuggestFrequency("今天", "天气") should return the minimum frequency for word "今天天气".

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
analyse Package analyse is the Golang implementation of Jieba's analyse module.	Package analyse is the Golang implementation of Jieba's analyse module.
dictionary Package dictionary contains a interface and wraps all io related work.	Package dictionary contains a interface and wraps all io related work.
finalseg Package finalseg is the Golang implementation of Jieba's finalseg module.	Package finalseg is the Golang implementation of Jieba's finalseg module.
posseg Package posseg is the Golang implementation of Jieba's posseg module.	Package posseg is the Golang implementation of Jieba's posseg module.
tokenizers
util Package util contains some util functions used by jieba.	Package util contains some util functions used by jieba.
helper

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL