gotokenizer

package module

v1.1.0 Latest Latest Go to latest Published: Oct 17, 2018 License: Apache-2.0 Imports: 7 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/xujiajun/gotokenizer

Links

Open Source Insights

README ¶

gotokenizer

A tokenizer based on the dictionary and Bigram language models for Golang. (Now only support chinese segmentation)

Motivation

I wanted a simple tokenizer that has no unnecessary overhead using the standard library only, following good practices and well tested code.

Features

Support Maximum Matching Method
Support Minimum Matching Method
Support Reverse Maximum Matching
Support Reverse Minimum Matching
Support Bidirectional Maximum Matching
Support Bidirectional Minimum Matching
Support using Stop Tokens
Support Custom word Filter

Installation

go get -u github.com/xujiajun/gotokenizer

Usage

package main

import (
	"fmt"
	"github.com/xujiajun/gotokenizer"
)

func main() {
	text := "gotokenizer是一款基于字典和Bigram模型纯go语言编写的分词器，支持6种分词算法。支持stopToken过滤和自定义word过滤功能。"

	dictPath := "/Users/xujiajun/go/src/github.com/xujiajun/gotokenizer/data/zh/dict.txt"
	// NewMaxMatch default wordFilter is NumAndLetterWordFilter
	mm := gotokenizer.NewMaxMatch(dictPath)
	// load dict
	mm.LoadDict()

	fmt.Println(mm.Get(text)) //[gotokenizer 是 一款 基于 字典 和 Bigram 模型 纯 go 语言 编写 的 分词器 ， 支持 6 种 分词 算法 。 支持 stopToken 过滤 和 自定义 word 过滤 功能 。] <nil>

	// enabled filter stop tokens 
	mm.EnabledFilterStopToken = true
	mm.StopTokens = gotokenizer.NewStopTokens()
	stopTokenDicPath := "/Users/xujiajun/go/src/github.com/xujiajun/gotokenizer/data/zh/stop_tokens.txt"
	mm.StopTokens.Load(stopTokenDicPath)

	fmt.Println(mm.Get(text)) //[gotokenizer 一款 字典 Bigram 模型 go 语言 编写 分词器 支持 6 种 分词 算法 支持 stopToken 过滤 自定义 word 过滤 功能] <nil>
	fmt.Println(mm.GetFrequency(text)) //map[6:1 种:1 算法:1 过滤:2 支持:2 Bigram:1 模型:1 编写:1 gotokenizer:1 go:1 分词器:1 分词:1 word:1 功能:1 一款:1 语言:1 stopToken:1 自定义:1 字典:1] <nil>

}

More examples see tests

Contributing

If you'd like to help out with the project. You can put up a Pull Request.

Author

xujiajun

License

The gotokenizer is open-sourced software licensed under the Apache-2.0

Acknowledgements

This package is inspired by the following:

https://github.com/ysc/word

Documentation ¶

Index ¶

Variables
func CheckDictIsLoaded(dict *Dict) error
func GetFrequency(result []string) map[string]int
func Reverse(s []string) []string
type BiDirectionalMaxMatch
- func NewBiDirectionalMaxMatch(dictPath, bigramDictPath string) *BiDirectionalMaxMatch
- func (bdmm *BiDirectionalMaxMatch) Get(text string) ([]string, error)
- func (bdmm *BiDirectionalMaxMatch) GetFrequency(text string) (map[string]int, error)
- func (bdmm *BiDirectionalMaxMatch) LoadDict() error
type BiDirectionalMinMatch
- func NewBiDirectionalMinMatch(dictPath, bigramDictPath string) *BiDirectionalMinMatch
- func (bdmm *BiDirectionalMinMatch) Get(text string) ([]string, error)
- func (bdmm *BiDirectionalMinMatch) GetFrequency(text string) (map[string]int, error)
- func (bdmm *BiDirectionalMinMatch) LoadDict() error
type BigramDict
- func NewBigramDict(dictPath string) *BigramDict
- func (bd *BigramDict) Load() error
type Dict
- func NewDict(dictPath string) *Dict
- func (dict *Dict) Load() error
type DictRecord
type MaxMatch
- func NewMaxMatch(dictPath string) *MaxMatch
- func (mm *MaxMatch) Get(text string) ([]string, error)
- func (mm *MaxMatch) GetFrequency(text string) (map[string]int, error)
- func (mm *MaxMatch) LoadDict() error
type MinMatch
- func NewMinMatch(dictPath string) *MinMatch
- func (mm *MinMatch) Get(text string) ([]string, error)
- func (mm *MinMatch) GetFrequency(text string) (map[string]int, error)
- func (mm *MinMatch) LoadDict() error
type NumAndLetterWordFilter
- func (nlFilter *NumAndLetterWordFilter) Filter(text string) bool
type ReverseMaxMatch
- func NewReverseMaxMatch(dictPath string) *ReverseMaxMatch
- func (rmm *ReverseMaxMatch) Get(text string) ([]string, error)
- func (rmm *ReverseMaxMatch) GetFrequency(text string) (map[string]int, error)
- func (rmm *ReverseMaxMatch) LoadDict() error
type ReverseMinMatch
- func NewReverseMinMatch(dictPath string) *ReverseMinMatch
- func (rmm *ReverseMinMatch) Get(text string) ([]string, error)
- func (rmm *ReverseMinMatch) GetFrequency(text string) (map[string]int, error)
- func (rmm *ReverseMinMatch) LoadDict() error
type StopTokens
- func NewStopTokens() *StopTokens
- func (st *StopTokens) IsStopToken(token string) bool
- func (st *StopTokens) Load(path string) error
type Tokenizer
type WordFilter

Constants ¶

This section is empty.

Variables ¶

View Source

var DefaultMinTokenLen = 2

DefaultMinTokenLen is default minimum tokenLen

Functions ¶

func CheckDictIsLoaded ¶

func CheckDictIsLoaded(dict *Dict) error

CheckDictIsLoaded that checks dict is Loaded

func GetFrequency ¶

func GetFrequency(result []string) map[string]int

GetFrequency returns frequency of tokens

func Reverse ¶

func Reverse(s []string) []string

Reverse returns reversed of string slice

Types ¶

type BiDirectionalMaxMatch ¶

type BiDirectionalMaxMatch struct {
	MMScore  float64
	RMMScore float64
	MM       *MaxMatch
	RMM      *ReverseMaxMatch
	// contains filtered or unexported fields
}

BiDirectionalMaxMatch records dict and bigramDic etc.

func NewBiDirectionalMaxMatch ¶

func NewBiDirectionalMaxMatch(dictPath, bigramDictPath string) *BiDirectionalMaxMatch

NewBiDirectionalMaxMatch returns a newly initialized BiDirectionalMaxMatch object

func (*BiDirectionalMaxMatch) Get ¶

func (bdmm *BiDirectionalMaxMatch) Get(text string) ([]string, error)

Get returns segmentation that implements the Tokenizer interface

func (*BiDirectionalMaxMatch) GetFrequency ¶

func (bdmm *BiDirectionalMaxMatch) GetFrequency(text string) (map[string]int, error)

GetFrequency returns token frequency that implements the Tokenizer interface

func (*BiDirectionalMaxMatch) LoadDict ¶

func (bdmm *BiDirectionalMaxMatch) LoadDict() error

LoadDict load dict and bigramDic that implements the Tokenizer interface

type BiDirectionalMinMatch ¶

type BiDirectionalMinMatch struct {
	MMScore  float64
	RMMScore float64
	MM       *MinMatch
	RMM      *ReverseMinMatch
	// contains filtered or unexported fields
}

BiDirectionalMinMatch records dict and bigramDic etc.

func NewBiDirectionalMinMatch ¶

func NewBiDirectionalMinMatch(dictPath, bigramDictPath string) *BiDirectionalMinMatch

NewBiDirectionalMinMatch returns a newly initialized BiDirectionalMinMatch object

func (*BiDirectionalMinMatch) Get ¶

func (bdmm *BiDirectionalMinMatch) Get(text string) ([]string, error)

Get returns segmentation that implements the Tokenizer interface

func (*BiDirectionalMinMatch) GetFrequency ¶

func (bdmm *BiDirectionalMinMatch) GetFrequency(text string) (map[string]int, error)

GetFrequency returns token frequency that implements the Tokenizer interface

func (*BiDirectionalMinMatch) LoadDict ¶

func (bdmm *BiDirectionalMinMatch) LoadDict() error

LoadDict load dict and bigramDic that implements the Tokenizer interface

type BigramDict ¶

type BigramDict struct {
	// contains filtered or unexported fields
}

BigramDict records dictPath and bigram records

func NewBigramDict ¶

func NewBigramDict(dictPath string) *BigramDict

NewBigramDict returns a newly initialized BigramDict object

func (*BigramDict) Load ¶

func (bd *BigramDict) Load() error

Load returns Bigram Dict records

type Dict ¶

type Dict struct {
	Records map[string]DictRecord

	DictPath string
	// contains filtered or unexported fields
}

Dict records Records and DictPath etc.

func NewDict ¶

func NewDict(dictPath string) *Dict

NewDict returns a newly initialized Dict object

func (*Dict) Load ¶

func (dict *Dict) Load() error

Load that loads Dict

type DictRecord ¶

type DictRecord struct {
	TF    string
	Token string
	POS   string //part of speech
}

DictRecord records dict meta info

type MaxMatch ¶

type MaxMatch struct {
	WordFilter             WordFilter
	EnabledFilterStopToken bool
	StopTokens             *StopTokens
	// contains filtered or unexported fields
}

MaxMatch records dict and dictPath

func NewMaxMatch ¶

func NewMaxMatch(dictPath string) *MaxMatch

NewMaxMatch returns a newly initialized MaxMatch object

func (*MaxMatch) Get ¶

func (mm *MaxMatch) Get(text string) ([]string, error)

Get returns segmentation that implements the Tokenizer interface

func (*MaxMatch) GetFrequency ¶

func (mm *MaxMatch) GetFrequency(text string) (map[string]int, error)

GetFrequency returns token frequency that implements the Tokenizer interface

func (*MaxMatch) LoadDict ¶

func (mm *MaxMatch) LoadDict() error

LoadDict loads dict that implements the Tokenizer interface

type MinMatch ¶

type MinMatch struct {
	// contains filtered or unexported fields
}

MinMatch records dict and dictPath

func NewMinMatch ¶

func NewMinMatch(dictPath string) *MinMatch

NewMinMatch returns a newly initialized MinMatch object

func (*MinMatch) Get ¶

func (mm *MinMatch) Get(text string) ([]string, error)

Get returns segmentation that implements the Tokenizer interface

func (*MinMatch) GetFrequency ¶

func (mm *MinMatch) GetFrequency(text string) (map[string]int, error)

GetFrequency returns token frequency that implements the Tokenizer interface

func (*MinMatch) LoadDict ¶

func (mm *MinMatch) LoadDict() error

LoadDict loads dict that implements the Tokenizer interface

type NumAndLetterWordFilter ¶ added in v1.1.0

type NumAndLetterWordFilter struct {
}

NumAndLetterWordFilter that implements the WordFilter interface

func (*NumAndLetterWordFilter) Filter ¶ added in v1.1.0

func (nlFilter *NumAndLetterWordFilter) Filter(text string) bool

Filter that implements the WordFilter interface

type ReverseMaxMatch ¶

type ReverseMaxMatch struct {
	WordFilter             WordFilter
	EnabledFilterStopToken bool
	StopTokens             *StopTokens
	// contains filtered or unexported fields
}

ReverseMaxMatch records dict and dictPath

func NewReverseMaxMatch ¶

func NewReverseMaxMatch(dictPath string) *ReverseMaxMatch

NewReverseMaxMatch returns a newly initialized ReverseMaxMatch object

func (*ReverseMaxMatch) Get ¶

func (rmm *ReverseMaxMatch) Get(text string) ([]string, error)

Get returns segmentation that implements the Tokenizer interface

func (*ReverseMaxMatch) GetFrequency ¶

func (rmm *ReverseMaxMatch) GetFrequency(text string) (map[string]int, error)

GetFrequency returns token frequency that implements the Tokenizer interface

func (*ReverseMaxMatch) LoadDict ¶

func (rmm *ReverseMaxMatch) LoadDict() error

LoadDict loads dict that implements the Tokenizer interface

type ReverseMinMatch ¶

type ReverseMinMatch struct {
	// contains filtered or unexported fields
}

ReverseMinMatch records dict and dictPath

func NewReverseMinMatch ¶

func NewReverseMinMatch(dictPath string) *ReverseMinMatch

NewReverseMinMatch returns a newly initialized ReverseMinMatch object

func (*ReverseMinMatch) Get ¶

func (rmm *ReverseMinMatch) Get(text string) ([]string, error)

Get returns segmentation that implements the Tokenizer interface

func (*ReverseMinMatch) GetFrequency ¶

func (rmm *ReverseMinMatch) GetFrequency(text string) (map[string]int, error)

GetFrequency returns token frequency that implements the Tokenizer interface

func (*ReverseMinMatch) LoadDict ¶

func (rmm *ReverseMinMatch) LoadDict() error

LoadDict loads dict that implements the Tokenizer interface

type StopTokens ¶

type StopTokens struct {
	IsLoaded bool
	// contains filtered or unexported fields
}

StopTokens records paths and records

func NewStopTokens ¶

func NewStopTokens() *StopTokens

NewStopTokens returns a newly initialized StopTokens object

func (*StopTokens) IsStopToken ¶

func (st *StopTokens) IsStopToken(token string) bool

IsStopToken returns if token is a token

func (*StopTokens) Load ¶

func (st *StopTokens) Load(path string) error

Load that loads StopToken dict

type Tokenizer ¶

type Tokenizer interface {
	GetFrequency(text string) (map[string]int, error)
	Get(text string) ([]string, error)
	LoadDict() error
}

Tokenizer defines interface of Tokenizer

type WordFilter ¶ added in v1.1.0

type WordFilter interface {
	Filter(text string) bool
}

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL