gotokenizer

package module
v1.1.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 17, 2018 License: Apache-2.0 Imports: 7 Imported by: 0

README

gotokenizer GoDoc Build Status Coverage Status Go Report Card License

A tokenizer based on the dictionary and Bigram language models for Golang. (Now only support chinese segmentation)

Motivation

I wanted a simple tokenizer that has no unnecessary overhead using the standard library only, following good practices and well tested code.

Features

  • Support Maximum Matching Method
  • Support Minimum Matching Method
  • Support Reverse Maximum Matching
  • Support Reverse Minimum Matching
  • Support Bidirectional Maximum Matching
  • Support Bidirectional Minimum Matching
  • Support using Stop Tokens
  • Support Custom word Filter

Installation

go get -u github.com/xujiajun/gotokenizer

Usage

package main

import (
	"fmt"
	"github.com/xujiajun/gotokenizer"
)

func main() {
	text := "gotokenizer是一款基于字典和Bigram模型纯go语言编写的分词器,支持6种分词算法。支持stopToken过滤和自定义word过滤功能。"

	dictPath := "/Users/xujiajun/go/src/github.com/xujiajun/gotokenizer/data/zh/dict.txt"
	// NewMaxMatch default wordFilter is NumAndLetterWordFilter
	mm := gotokenizer.NewMaxMatch(dictPath)
	// load dict
	mm.LoadDict()

	fmt.Println(mm.Get(text)) //[gotokenizer 是 一款 基于 字典 和 Bigram 模型 纯 go 语言 编写 的 分词器 , 支持 6 种 分词 算法 。 支持 stopToken 过滤 和 自定义 word 过滤 功能 。] <nil>

	// enabled filter stop tokens 
	mm.EnabledFilterStopToken = true
	mm.StopTokens = gotokenizer.NewStopTokens()
	stopTokenDicPath := "/Users/xujiajun/go/src/github.com/xujiajun/gotokenizer/data/zh/stop_tokens.txt"
	mm.StopTokens.Load(stopTokenDicPath)

	fmt.Println(mm.Get(text)) //[gotokenizer 一款 字典 Bigram 模型 go 语言 编写 分词器 支持 6 种 分词 算法 支持 stopToken 过滤 自定义 word 过滤 功能] <nil>
	fmt.Println(mm.GetFrequency(text)) //map[6:1 种:1 算法:1 过滤:2 支持:2 Bigram:1 模型:1 编写:1 gotokenizer:1 go:1 分词器:1 分词:1 word:1 功能:1 一款:1 语言:1 stopToken:1 自定义:1 字典:1] <nil>

}

More examples see tests

Contributing

If you'd like to help out with the project. You can put up a Pull Request.

Author

License

The gotokenizer is open-sourced software licensed under the Apache-2.0

Acknowledgements

This package is inspired by the following:

https://github.com/ysc/word

Documentation

Index

Constants

This section is empty.

Variables

View Source
var DefaultMinTokenLen = 2

DefaultMinTokenLen is default minimum tokenLen

Functions

func CheckDictIsLoaded

func CheckDictIsLoaded(dict *Dict) error

CheckDictIsLoaded that checks dict is Loaded

func GetFrequency

func GetFrequency(result []string) map[string]int

GetFrequency returns frequency of tokens

func Reverse

func Reverse(s []string) []string

Reverse returns reversed of string slice

Types

type BiDirectionalMaxMatch

type BiDirectionalMaxMatch struct {
	MMScore  float64
	RMMScore float64
	MM       *MaxMatch
	RMM      *ReverseMaxMatch
	// contains filtered or unexported fields
}

BiDirectionalMaxMatch records dict and bigramDic etc.

func NewBiDirectionalMaxMatch

func NewBiDirectionalMaxMatch(dictPath, bigramDictPath string) *BiDirectionalMaxMatch

NewBiDirectionalMaxMatch returns a newly initialized BiDirectionalMaxMatch object

func (*BiDirectionalMaxMatch) Get

func (bdmm *BiDirectionalMaxMatch) Get(text string) ([]string, error)

Get returns segmentation that implements the Tokenizer interface

func (*BiDirectionalMaxMatch) GetFrequency

func (bdmm *BiDirectionalMaxMatch) GetFrequency(text string) (map[string]int, error)

GetFrequency returns token frequency that implements the Tokenizer interface

func (*BiDirectionalMaxMatch) LoadDict

func (bdmm *BiDirectionalMaxMatch) LoadDict() error

LoadDict load dict and bigramDic that implements the Tokenizer interface

type BiDirectionalMinMatch

type BiDirectionalMinMatch struct {
	MMScore  float64
	RMMScore float64
	MM       *MinMatch
	RMM      *ReverseMinMatch
	// contains filtered or unexported fields
}

BiDirectionalMinMatch records dict and bigramDic etc.

func NewBiDirectionalMinMatch

func NewBiDirectionalMinMatch(dictPath, bigramDictPath string) *BiDirectionalMinMatch

NewBiDirectionalMinMatch returns a newly initialized BiDirectionalMinMatch object

func (*BiDirectionalMinMatch) Get

func (bdmm *BiDirectionalMinMatch) Get(text string) ([]string, error)

Get returns segmentation that implements the Tokenizer interface

func (*BiDirectionalMinMatch) GetFrequency

func (bdmm *BiDirectionalMinMatch) GetFrequency(text string) (map[string]int, error)

GetFrequency returns token frequency that implements the Tokenizer interface

func (*BiDirectionalMinMatch) LoadDict

func (bdmm *BiDirectionalMinMatch) LoadDict() error

LoadDict load dict and bigramDic that implements the Tokenizer interface

type BigramDict

type BigramDict struct {
	// contains filtered or unexported fields
}

BigramDict records dictPath and bigram records

func NewBigramDict

func NewBigramDict(dictPath string) *BigramDict

NewBigramDict returns a newly initialized BigramDict object

func (*BigramDict) Load

func (bd *BigramDict) Load() error

Load returns Bigram Dict records

type Dict

type Dict struct {
	Records map[string]DictRecord

	DictPath string
	// contains filtered or unexported fields
}

Dict records Records and DictPath etc.

func NewDict

func NewDict(dictPath string) *Dict

NewDict returns a newly initialized Dict object

func (*Dict) Load

func (dict *Dict) Load() error

Load that loads Dict

type DictRecord

type DictRecord struct {
	TF    string
	Token string
	POS   string //part of speech
}

DictRecord records dict meta info

type MaxMatch

type MaxMatch struct {
	WordFilter             WordFilter
	EnabledFilterStopToken bool
	StopTokens             *StopTokens
	// contains filtered or unexported fields
}

MaxMatch records dict and dictPath

func NewMaxMatch

func NewMaxMatch(dictPath string) *MaxMatch

NewMaxMatch returns a newly initialized MaxMatch object

func (*MaxMatch) Get

func (mm *MaxMatch) Get(text string) ([]string, error)

Get returns segmentation that implements the Tokenizer interface

func (*MaxMatch) GetFrequency

func (mm *MaxMatch) GetFrequency(text string) (map[string]int, error)

GetFrequency returns token frequency that implements the Tokenizer interface

func (*MaxMatch) LoadDict

func (mm *MaxMatch) LoadDict() error

LoadDict loads dict that implements the Tokenizer interface

type MinMatch

type MinMatch struct {
	// contains filtered or unexported fields
}

MinMatch records dict and dictPath

func NewMinMatch

func NewMinMatch(dictPath string) *MinMatch

NewMinMatch returns a newly initialized MinMatch object

func (*MinMatch) Get

func (mm *MinMatch) Get(text string) ([]string, error)

Get returns segmentation that implements the Tokenizer interface

func (*MinMatch) GetFrequency

func (mm *MinMatch) GetFrequency(text string) (map[string]int, error)

GetFrequency returns token frequency that implements the Tokenizer interface

func (*MinMatch) LoadDict

func (mm *MinMatch) LoadDict() error

LoadDict loads dict that implements the Tokenizer interface

type NumAndLetterWordFilter added in v1.1.0

type NumAndLetterWordFilter struct {
}

NumAndLetterWordFilter that implements the WordFilter interface

func (*NumAndLetterWordFilter) Filter added in v1.1.0

func (nlFilter *NumAndLetterWordFilter) Filter(text string) bool

Filter that implements the WordFilter interface

type ReverseMaxMatch

type ReverseMaxMatch struct {
	WordFilter             WordFilter
	EnabledFilterStopToken bool
	StopTokens             *StopTokens
	// contains filtered or unexported fields
}

ReverseMaxMatch records dict and dictPath

func NewReverseMaxMatch

func NewReverseMaxMatch(dictPath string) *ReverseMaxMatch

NewReverseMaxMatch returns a newly initialized ReverseMaxMatch object

func (*ReverseMaxMatch) Get

func (rmm *ReverseMaxMatch) Get(text string) ([]string, error)

Get returns segmentation that implements the Tokenizer interface

func (*ReverseMaxMatch) GetFrequency

func (rmm *ReverseMaxMatch) GetFrequency(text string) (map[string]int, error)

GetFrequency returns token frequency that implements the Tokenizer interface

func (*ReverseMaxMatch) LoadDict

func (rmm *ReverseMaxMatch) LoadDict() error

LoadDict loads dict that implements the Tokenizer interface

type ReverseMinMatch

type ReverseMinMatch struct {
	// contains filtered or unexported fields
}

ReverseMinMatch records dict and dictPath

func NewReverseMinMatch

func NewReverseMinMatch(dictPath string) *ReverseMinMatch

NewReverseMinMatch returns a newly initialized ReverseMinMatch object

func (*ReverseMinMatch) Get

func (rmm *ReverseMinMatch) Get(text string) ([]string, error)

Get returns segmentation that implements the Tokenizer interface

func (*ReverseMinMatch) GetFrequency

func (rmm *ReverseMinMatch) GetFrequency(text string) (map[string]int, error)

GetFrequency returns token frequency that implements the Tokenizer interface

func (*ReverseMinMatch) LoadDict

func (rmm *ReverseMinMatch) LoadDict() error

LoadDict loads dict that implements the Tokenizer interface

type StopTokens

type StopTokens struct {
	IsLoaded bool
	// contains filtered or unexported fields
}

StopTokens records paths and records

func NewStopTokens

func NewStopTokens() *StopTokens

NewStopTokens returns a newly initialized StopTokens object

func (*StopTokens) IsStopToken

func (st *StopTokens) IsStopToken(token string) bool

IsStopToken returns if token is a token

func (*StopTokens) Load

func (st *StopTokens) Load(path string) error

Load that loads StopToken dict

type Tokenizer

type Tokenizer interface {
	GetFrequency(text string) (map[string]int, error)
	Get(text string) ([]string, error)
	LoadDict() error
}

Tokenizer defines interface of Tokenizer

type WordFilter added in v1.1.0

type WordFilter interface {
	Filter(text string) bool
}

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL