README

gotokenizer GoDoc Build Status Coverage Status Go Report Card License

A tokenizer based on the dictionary and Bigram language models for Golang. (Now only support chinese segmentation)

Motivation

I wanted a simple tokenizer that has no unnecessary overhead using the standard library only, following good practices and well tested code.

Features

  • Support Maximum Matching Method
  • Support Minimum Matching Method
  • Support Reverse Maximum Matching
  • Support Reverse Minimum Matching
  • Support Bidirectional Maximum Matching
  • Support Bidirectional Minimum Matching
  • Support using Stop Tokens
  • Support Custom word Filter

Installation

go get -u github.com/xujiajun/gotokenizer

Usage

package main

import (
	"fmt"
	"github.com/xujiajun/gotokenizer"
)

func main() {
	text := "gotokenizer是一款基于字典和Bigram模型纯go语言编写的分词器,支持6种分词算法。支持stopToken过滤和自定义word过滤功能。"

	dictPath := "/Users/xujiajun/go/src/github.com/xujiajun/gotokenizer/data/zh/dict.txt"
	// NewMaxMatch default wordFilter is NumAndLetterWordFilter
	mm := gotokenizer.NewMaxMatch(dictPath)
	// load dict
	mm.LoadDict()

	fmt.Println(mm.Get(text)) //[gotokenizer 是 一款 基于 字典 和 Bigram 模型 纯 go 语言 编写 的 分词器 , 支持 6 种 分词 算法 。 支持 stopToken 过滤 和 自定义 word 过滤 功能 。] <nil>

	// enabled filter stop tokens 
	mm.EnabledFilterStopToken = true
	mm.StopTokens = gotokenizer.NewStopTokens()
	stopTokenDicPath := "/Users/xujiajun/go/src/github.com/xujiajun/gotokenizer/data/zh/stop_tokens.txt"
	mm.StopTokens.Load(stopTokenDicPath)

	fmt.Println(mm.Get(text)) //[gotokenizer 一款 字典 Bigram 模型 go 语言 编写 分词器 支持 6 种 分词 算法 支持 stopToken 过滤 自定义 word 过滤 功能] <nil>
	fmt.Println(mm.GetFrequency(text)) //map[6:1 种:1 算法:1 过滤:2 支持:2 Bigram:1 模型:1 编写:1 gotokenizer:1 go:1 分词器:1 分词:1 word:1 功能:1 一款:1 语言:1 stopToken:1 自定义:1 字典:1] <nil>

}

More examples see tests

Contributing

If you'd like to help out with the project. You can put up a Pull Request.

Author

License

The gotokenizer is open-sourced software licensed under the Apache-2.0

Acknowledgements

This package is inspired by the following:

https://github.com/ysc/word

Documentation

Index

Constants

This section is empty.

Variables

View Source
var DefaultMinTokenLen = 2

    DefaultMinTokenLen is default minimum tokenLen

    Functions

    func CheckDictIsLoaded

    func CheckDictIsLoaded(dict *Dict) error

      CheckDictIsLoaded that checks dict is Loaded

      func GetFrequency

      func GetFrequency(result []string) map[string]int

        GetFrequency returns frequency of tokens

        func Reverse

        func Reverse(s []string) []string

          Reverse returns reversed of string slice

          Types

          type BiDirectionalMaxMatch

          type BiDirectionalMaxMatch struct {
          	MMScore  float64
          	RMMScore float64
          	MM       *MaxMatch
          	RMM      *ReverseMaxMatch
          	// contains filtered or unexported fields
          }

            BiDirectionalMaxMatch records dict and bigramDic etc.

            func NewBiDirectionalMaxMatch

            func NewBiDirectionalMaxMatch(dictPath, bigramDictPath string) *BiDirectionalMaxMatch

              NewBiDirectionalMaxMatch returns a newly initialized BiDirectionalMaxMatch object

              func (*BiDirectionalMaxMatch) Get

              func (bdmm *BiDirectionalMaxMatch) Get(text string) ([]string, error)

                Get returns segmentation that implements the Tokenizer interface

                func (*BiDirectionalMaxMatch) GetFrequency

                func (bdmm *BiDirectionalMaxMatch) GetFrequency(text string) (map[string]int, error)

                  GetFrequency returns token frequency that implements the Tokenizer interface

                  func (*BiDirectionalMaxMatch) LoadDict

                  func (bdmm *BiDirectionalMaxMatch) LoadDict() error

                    LoadDict load dict and bigramDic that implements the Tokenizer interface

                    type BiDirectionalMinMatch

                    type BiDirectionalMinMatch struct {
                    	MMScore  float64
                    	RMMScore float64
                    	MM       *MinMatch
                    	RMM      *ReverseMinMatch
                    	// contains filtered or unexported fields
                    }

                      BiDirectionalMinMatch records dict and bigramDic etc.

                      func NewBiDirectionalMinMatch

                      func NewBiDirectionalMinMatch(dictPath, bigramDictPath string) *BiDirectionalMinMatch

                        NewBiDirectionalMinMatch returns a newly initialized BiDirectionalMinMatch object

                        func (*BiDirectionalMinMatch) Get

                        func (bdmm *BiDirectionalMinMatch) Get(text string) ([]string, error)

                          Get returns segmentation that implements the Tokenizer interface

                          func (*BiDirectionalMinMatch) GetFrequency

                          func (bdmm *BiDirectionalMinMatch) GetFrequency(text string) (map[string]int, error)

                            GetFrequency returns token frequency that implements the Tokenizer interface

                            func (*BiDirectionalMinMatch) LoadDict

                            func (bdmm *BiDirectionalMinMatch) LoadDict() error

                              LoadDict load dict and bigramDic that implements the Tokenizer interface

                              type BigramDict

                              type BigramDict struct {
                              	// contains filtered or unexported fields
                              }

                                BigramDict records dictPath and bigram records

                                func NewBigramDict

                                func NewBigramDict(dictPath string) *BigramDict

                                  NewBigramDict returns a newly initialized BigramDict object

                                  func (*BigramDict) Load

                                  func (bd *BigramDict) Load() error

                                    Load returns Bigram Dict records

                                    type Dict

                                    type Dict struct {
                                    	Records map[string]DictRecord
                                    
                                    	DictPath string
                                    	// contains filtered or unexported fields
                                    }

                                      Dict records Records and DictPath etc.

                                      func NewDict

                                      func NewDict(dictPath string) *Dict

                                        NewDict returns a newly initialized Dict object

                                        func (*Dict) Load

                                        func (dict *Dict) Load() error

                                          Load that loads Dict

                                          type DictRecord

                                          type DictRecord struct {
                                          	TF    string
                                          	Token string
                                          	POS   string //part of speech
                                          }

                                            DictRecord records dict meta info

                                            type MaxMatch

                                            type MaxMatch struct {
                                            	WordFilter             WordFilter
                                            	EnabledFilterStopToken bool
                                            	StopTokens             *StopTokens
                                            	// contains filtered or unexported fields
                                            }

                                              MaxMatch records dict and dictPath

                                              func NewMaxMatch

                                              func NewMaxMatch(dictPath string) *MaxMatch

                                                NewMaxMatch returns a newly initialized MaxMatch object

                                                func (*MaxMatch) Get

                                                func (mm *MaxMatch) Get(text string) ([]string, error)

                                                  Get returns segmentation that implements the Tokenizer interface

                                                  func (*MaxMatch) GetFrequency

                                                  func (mm *MaxMatch) GetFrequency(text string) (map[string]int, error)

                                                    GetFrequency returns token frequency that implements the Tokenizer interface

                                                    func (*MaxMatch) LoadDict

                                                    func (mm *MaxMatch) LoadDict() error

                                                      LoadDict loads dict that implements the Tokenizer interface

                                                      type MinMatch

                                                      type MinMatch struct {
                                                      	// contains filtered or unexported fields
                                                      }

                                                        MinMatch records dict and dictPath

                                                        func NewMinMatch

                                                        func NewMinMatch(dictPath string) *MinMatch

                                                          NewMinMatch returns a newly initialized MinMatch object

                                                          func (*MinMatch) Get

                                                          func (mm *MinMatch) Get(text string) ([]string, error)

                                                            Get returns segmentation that implements the Tokenizer interface

                                                            func (*MinMatch) GetFrequency

                                                            func (mm *MinMatch) GetFrequency(text string) (map[string]int, error)

                                                              GetFrequency returns token frequency that implements the Tokenizer interface

                                                              func (*MinMatch) LoadDict

                                                              func (mm *MinMatch) LoadDict() error

                                                                LoadDict loads dict that implements the Tokenizer interface

                                                                type NumAndLetterWordFilter

                                                                type NumAndLetterWordFilter struct {
                                                                }

                                                                  NumAndLetterWordFilter that implements the WordFilter interface

                                                                  func (*NumAndLetterWordFilter) Filter

                                                                  func (nlFilter *NumAndLetterWordFilter) Filter(text string) bool

                                                                    Filter that implements the WordFilter interface

                                                                    type ReverseMaxMatch

                                                                    type ReverseMaxMatch struct {
                                                                    	WordFilter             WordFilter
                                                                    	EnabledFilterStopToken bool
                                                                    	StopTokens             *StopTokens
                                                                    	// contains filtered or unexported fields
                                                                    }

                                                                      ReverseMaxMatch records dict and dictPath

                                                                      func NewReverseMaxMatch

                                                                      func NewReverseMaxMatch(dictPath string) *ReverseMaxMatch

                                                                        NewReverseMaxMatch returns a newly initialized ReverseMaxMatch object

                                                                        func (*ReverseMaxMatch) Get

                                                                        func (rmm *ReverseMaxMatch) Get(text string) ([]string, error)

                                                                          Get returns segmentation that implements the Tokenizer interface

                                                                          func (*ReverseMaxMatch) GetFrequency

                                                                          func (rmm *ReverseMaxMatch) GetFrequency(text string) (map[string]int, error)

                                                                            GetFrequency returns token frequency that implements the Tokenizer interface

                                                                            func (*ReverseMaxMatch) LoadDict

                                                                            func (rmm *ReverseMaxMatch) LoadDict() error

                                                                              LoadDict loads dict that implements the Tokenizer interface

                                                                              type ReverseMinMatch

                                                                              type ReverseMinMatch struct {
                                                                              	// contains filtered or unexported fields
                                                                              }

                                                                                ReverseMinMatch records dict and dictPath

                                                                                func NewReverseMinMatch

                                                                                func NewReverseMinMatch(dictPath string) *ReverseMinMatch

                                                                                  NewReverseMinMatch returns a newly initialized ReverseMinMatch object

                                                                                  func (*ReverseMinMatch) Get

                                                                                  func (rmm *ReverseMinMatch) Get(text string) ([]string, error)

                                                                                    Get returns segmentation that implements the Tokenizer interface

                                                                                    func (*ReverseMinMatch) GetFrequency

                                                                                    func (rmm *ReverseMinMatch) GetFrequency(text string) (map[string]int, error)

                                                                                      GetFrequency returns token frequency that implements the Tokenizer interface

                                                                                      func (*ReverseMinMatch) LoadDict

                                                                                      func (rmm *ReverseMinMatch) LoadDict() error

                                                                                        LoadDict loads dict that implements the Tokenizer interface

                                                                                        type StopTokens

                                                                                        type StopTokens struct {
                                                                                        	IsLoaded bool
                                                                                        	// contains filtered or unexported fields
                                                                                        }

                                                                                          StopTokens records paths and records

                                                                                          func NewStopTokens

                                                                                          func NewStopTokens() *StopTokens

                                                                                            NewStopTokens returns a newly initialized StopTokens object

                                                                                            func (*StopTokens) IsStopToken

                                                                                            func (st *StopTokens) IsStopToken(token string) bool

                                                                                              IsStopToken returns if token is a token

                                                                                              func (*StopTokens) Load

                                                                                              func (st *StopTokens) Load(path string) error

                                                                                                Load that loads StopToken dict

                                                                                                type Tokenizer

                                                                                                type Tokenizer interface {
                                                                                                	GetFrequency(text string) (map[string]int, error)
                                                                                                	Get(text string) ([]string, error)
                                                                                                	LoadDict() error
                                                                                                }

                                                                                                  Tokenizer defines interface of Tokenizer

                                                                                                  type WordFilter

                                                                                                  type WordFilter interface {
                                                                                                  	Filter(text string) bool
                                                                                                  }