kagome

package module
v0.3.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 19, 2014 License: Apache-2.0 Imports: 12 Imported by: 0

README

Build Status Coverage Status GoDoc Go Walker

Kagome Japanese Morphological Analyzer

Kagome(籠目)は Pure Go な日本語形態素解析器です.辞書をソースにエンコードして同梱しているので,バイナリだけで動作します.辞書データとして,MeCab-IPADICを利用しています.

% kagome
すもももももももものうち
すもも	名詞,一般,*,*,*,*,すもも,スモモ,スモモ
も	助詞,係助詞,*,*,*,*,も,モ,モ
もも	名詞,一般,*,*,*,*,もも,モモ,モモ
も	助詞,係助詞,*,*,*,*,も,モ,モ
もも	名詞,一般,*,*,*,*,もも,モモ,モモ
の	助詞,連体化,*,*,*,*,の,ノ,ノ
うち	名詞,非自立,副詞可能,*,*,*,うち,ウチ,ウチ
EOS

Install

Source
% go get github.com/ikawaha/kagome/...

Usage

$ kagome -h
usage: kagome [-file input_file | --http addr] [-udic userdic_file] [-mode (search|extended)]
  -file="": input file
  -http="": HTTP service address (e.g., ':6060')
  -mode="": tokenize mode
  -udic="": user dic
標準入力,もしくはファイルを指定しての解析

入力ファイルを指定した場合,1行1文として解析します. ファイルのエンコードは utf8 である必要があります. ファイルを指定しない場合,標準入力から1行1文として解析します.

$ kagome
すもももももももものうち
すもも	名詞,一般,*,*,*,*,すもも,スモモ,スモモ
も	助詞,係助詞,*,*,*,*,も,モ,モ
もも	名詞,一般,*,*,*,*,もも,モモ,モモ
も	助詞,係助詞,*,*,*,*,も,モ,モ
もも	名詞,一般,*,*,*,*,もも,モモ,モモ
の	助詞,連体化,*,*,*,*,の,ノ,ノ
うち	名詞,非自立,副詞可能,*,*,*,うち,ウチ,ウチ
EOS
検索用の分割モード

kuromoji の検索用分割モード相当の分割が出来るようになっています.

  • 標準 標準の分割
  • 検索 ヒューリスティックの適用によって検索に役立つよう細分割
  • 拡張 検索モードに加えて未知語を unigram に分割します
入力内容 標準モード 検索モード 拡張モード
関西国際空港 関西国際空港 関西 国際 空港 関西 国際 空港
日本経済新聞 日本経済新聞 日本 経済 新聞 日本 経済 新聞
シニアソフトウェアエンジニア シニアソフトウェアエンジニア シニア ソフトウェア エンジニア シニア ソフトウェア エンジニア
デジカメを買った デジカメ を 買っ た デジカメ を 買っ た デ ジ カ メ を 買っ た
HTTP service

サーバとして動作させると,以下の2つの機能が利用できます.

Web API

-http オプションを指定するとWebサーバが立ち上がります. localhost にポート8080でサーバを立ち上げた場合,'http://localhost:8080/' に REST でアクセスできます.

$ kagome -http=":8080" &
$ curl -XPUT localhost:8080 -d'{"sentence":"すもももももももものうち"}'
{"status":true,"tokens":[{"id":36163,"start":0,"end":3,"surface":"すもも","class":"KNOWN","features":["名詞","一般","*","*","*","*","すもも","スモモ","スモモ"]},{"id":73244,"start":3,"end":4,"surface":"も","class":"KNOWN","features":["助詞","係助詞","*","*","*","*","も","モ","モ"]},{"id":74989,"start":4,"end":6,"surface":"もも","class":"KNOWN","features":["名詞","一般","*","*","*","*","もも","モモ","モモ"]},{"id":73244,"start":6,"end":7,"surface":"も","class":"KNOWN","features":["助詞","係助詞","*","*","*","*","も","モ","モ"]},{"id":74989,"start":7,"end":9,"surface":"もも","class":"KNOWN","features":["名詞","一般","*","*","*","*","もも","モモ","モモ"]},{"id":55829,"start":9,"end":10,"surface":"の","class":"KNOWN","features":["助詞","連体化","*","*","*","*","の","ノ","ノ"]},{"id":8024,"start":10,"end":12,"surface":"うち","class":"KNOWN","features":["名詞","非自立","副詞可能","*","*","*","うち","ウチ","ウチ"]}]}
形態素解析デモ

Web サーバを立ち上げた状態で,ブラウザで /_demo にアクセスすると,形態素解析のデモ利用できます. -http=:8080 を指定した場合,http://localhost:8080/_demo になります.Lattice の表示には graphviz が必要です.(デモでは Lattice が大きすぎて表示に時間がかかりすぎる場合は Timeout します.そのような場合には後述の lattice ツールを利用してみてください.)

lattice

ユーザー辞書について

ユーザ辞書の形式は kuromoji 形式です._sampleにサンプルがあります.

% kagome
第68代横綱朝青龍
第	接頭詞,数接続,*,*,*,*,第,ダイ,ダイ
68	名詞,数,*,*,*,*,*
代	名詞,接尾,助数詞,*,*,*,代,ダイ,ダイ
横綱	名詞,一般,*,*,*,*,横綱,ヨコヅナ,ヨコズナ
朝青龍	カスタム人名,朝青龍,アサショウリュウ
EOS
解析状況確認

lattice ツールを利用すると,解析状況を graphviz の dot形式で出力することができます.グラフ化には graphviz のインストールが別途必要です.

$ lattice -v すもももももももものうち  |dot -Tpng -o lattice.png
すもも	名詞,一般,*,*,*,*,すもも,スモモ,スモモ
も	助詞,係助詞,*,*,*,*,も,モ,モ
もも	名詞,一般,*,*,*,*,もも,モモ,モモ
も	助詞,係助詞,*,*,*,*,も,モ,モ
もも	名詞,一般,*,*,*,*,もも,モモ,モモ
の	助詞,連体化,*,*,*,*,の,ノ,ノ
うち	名詞,非自立,副詞可能,*,*,*,うち,ウチ,ウチ
EOS

lattice

License

Kagome is licensed under the Apache License v2.0 and uses the MeCab-IPADIC dictionary/statistical model. See NOTICE.txt for license details.

Documentation

Index

Constants

View Source
const BosEosId int = -1

Reserved identifier of node id.

Variables

This section is empty.

Functions

This section is empty.

Types

type ConnectionTable

type ConnectionTable struct {
	Row, Col int
	Vec      []int16
}

ConnectionTable represents a connection matrix of morphs.

func (*ConnectionTable) At

func (ct *ConnectionTable) At(row, col int) int16

At returns the connection cost of matrix[row, col].

type Dic

type Dic struct {
	Morphs       []Morph
	Contents     [][]string
	Connection   ConnectionTable
	Index        Trie
	IndexDup     map[int]int
	CharClass    []string
	CharCategory []byte
	InvokeList   []bool
	GroupList    []bool

	UnkMorphs   []Morph
	UnkIndex    map[int]int
	UnkIndexDup map[int]int
	UnkContents [][]string
}

Dic represents a dictionary of a tokenizer.

func NewSysDic

func NewSysDic() (dic *Dic)

NewSysDic returns the kagome system dictionary.

type DoubleArray

type DoubleArray []struct {
	Base, Check int
}

DoubleArray represents the TRIE data structure.

func (*DoubleArray) Build

func (d *DoubleArray) Build(keywords []string) (err error)

Build constructs a double array from given keywords.

func (*DoubleArray) BuildWithIds

func (d *DoubleArray) BuildWithIds(keywords []string, ids []int) (err error)

BuildWithIds constructs a double array from given keywords and ids.

func (*DoubleArray) CommonPrefixSearchBytes

func (d *DoubleArray) CommonPrefixSearchBytes(input []byte) (ids, lens []int)

CommonPrefixSearchBytes finds keywords sharing common prefix in an input and returns the ids and it's lengths if found.

func (*DoubleArray) CommonPrefixSearchString

func (d *DoubleArray) CommonPrefixSearchString(input string) (ids, lens []int)

CommonPrefixSearchString finds keywords sharing common prefix in an input and returns the ids and it's lengths if found.

func (*DoubleArray) FindBytes

func (d *DoubleArray) FindBytes(input []byte) (id int, ok bool)

FindBytes searches TRIE by a given keyword and returns the id if found.

func (*DoubleArray) FindString

func (d *DoubleArray) FindString(input string) (id int, ok bool)

FindString searches TRIE by a given keyword and returns the id if found.

func (*DoubleArray) PrefixSearchBytes

func (d *DoubleArray) PrefixSearchBytes(input []byte) (id int, ok bool)

PrefixSearchBytes returns the longest commom prefix keyword in an input if found.

func (*DoubleArray) PrefixSearchString

func (d *DoubleArray) PrefixSearchString(input string) (id int, ok bool)

PrefixSearchString returns the longest commom prefix keyword in an input if found.

type Morph

type Morph struct {
	LeftId, RightId, Weight int16
}

Morph represents part of speeches and an occurrence cost.

type NodeClass

type NodeClass int

NodeClass represents a node type.

const (
	DUMMY NodeClass = iota
	KNOWN
	UNKNOWN
	USER
)

NodeClass codes.

func (NodeClass) String

func (nc NodeClass) String() string

String returns a string representation of a node class.1

type Token

type Token struct {
	Id      int
	Class   NodeClass
	Start   int
	End     int
	Surface string
	// contains filtered or unexported fields
}

Token represents a morph of a sentence.

func (Token) Features

func (t Token) Features() (features []string)

Features returns contents of a token.

func (Token) String

func (t Token) String() string

String returns a string representation of a token.

type Tokenizer

type Tokenizer struct {
	// contains filtered or unexported fields
}

Tokenizer represents morphological analyzer.

func NewThreadsafeTokenizer added in v0.3.0

func NewThreadsafeTokenizer() (t *Tokenizer)

NewThreadsafeTokenizer create a threadsafe tokenizer.

func NewTokenizer

func NewTokenizer() (t *Tokenizer)

NewTokenizer create a tokenizer.

func (*Tokenizer) Dot

func (t *Tokenizer) Dot(input string, w io.Writer) (tokens []Token)

Dot returns morphs of a sentense and exports a lattice graph to dot format.

func (*Tokenizer) ExtendedModeTokenize added in v0.3.0

func (t *Tokenizer) ExtendedModeTokenize(input string) (tokens []Token)

ExtendedModeTokenize returns morphs of a sentence.

func (*Tokenizer) SearchModeTokenize added in v0.3.0

func (t *Tokenizer) SearchModeTokenize(input string) (tokens []Token)

SearchModeTokenize returns morphs of a sentence.

func (*Tokenizer) SetDic

func (t *Tokenizer) SetDic(dic *Dic)

SetDic sets dictionary to dic.

func (*Tokenizer) SetUserDic

func (t *Tokenizer) SetUserDic(udic *UserDic)

SetUserDic sets user dictionary to udic.

func (*Tokenizer) Tokenize

func (t *Tokenizer) Tokenize(input string) (tokens []Token)

Tokenize returns morphs of a sentence.

type Trie

type Trie interface {
	FindString(string) (id int, ok bool)               // search a dictionary by a keyword.
	CommonPrefixSearchString(string) (ids, lens []int) // finds keywords sharing common prefix in a dictionary.
}

Any type implements Trie interface may be used as a dictionary.

type UserDic

type UserDic struct {
	Index    Trie
	Contents []UserDicContent
}

UserDic represents a user dictionary.

func NewUserDic

func NewUserDic(path string) (udic *UserDic, err error)

NewUserDic build a user dictionary from a file.

type UserDicContent

type UserDicContent struct {
	Tokens []string
	Yomi   []string
	Pos    string
}

UserDicContent represents contents of a word in a user dictionary.

Directories

Path Synopsis
_dictool
ipa
cmd

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL