kagome

package module
v0.2.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 10, 2014 License: Apache-2.0 Imports: 11 Imported by: 0

README

Build Status Coverage Status GoDoc

Kagome Japanese Morphological Analyzer

Kagome(籠目)は Pure Go な日本語形態素解析器です.辞書をソースにエンコードして同梱しているので,バイナリだけで動作します.辞書データとして,MeCab-IPADICを利用しています.

% kagome
すもももももももものうち
すもも	名詞,一般,*,*,*,*,すもも,スモモ,スモモ
も	助詞,係助詞,*,*,*,*,も,モ,モ
もも	名詞,一般,*,*,*,*,もも,モモ,モモ
も	助詞,係助詞,*,*,*,*,も,モ,モ
もも	名詞,一般,*,*,*,*,もも,モモ,モモ
の	助詞,連体化,*,*,*,*,の,ノ,ノ
うち	名詞,非自立,副詞可能,*,*,*,うち,ウチ,ウチ
EOS

Install

Source
% go get github.com/ikawaha/kagome/...

Usage

形態素解析
% kagome -h
usage: kagome [-f input_file] [-u userdic_file]
  -f="": input file
  -u="": user dic

ユーザ辞書の形式は kuromoji 形式です._sampleにサンプルがあります.

第68代横綱朝青龍
第	接頭詞,数接続,*,*,*,*,第,ダイ,ダイ
68	名詞,数,*,*,*,*,*
代	名詞,接尾,助数詞,*,*,*,代,ダイ,ダイ
横綱	名詞,一般,*,*,*,*,横綱,ヨコヅナ,ヨコズナ
朝青龍	カスタム人名,朝青龍,アサショウリュウ
EOS
解析状況確認

lattice ツールを利用すると,解析状況を graphviz の dot形式で出力することができます.グラフ化には graphviz のインストールが別途必要です.

$ lattice -v すもももももももものうち  |dot -Tpng -o lattice.png
すもも	名詞,一般,*,*,*,*,すもも,スモモ,スモモ
も	助詞,係助詞,*,*,*,*,も,モ,モ
もも	名詞,一般,*,*,*,*,もも,モモ,モモ
も	助詞,係助詞,*,*,*,*,も,モ,モ
もも	名詞,一般,*,*,*,*,もも,モモ,モモ
の	助詞,連体化,*,*,*,*,の,ノ,ノ
うち	名詞,非自立,副詞可能,*,*,*,うち,ウチ,ウチ
EOS

lattice

License

Kagome is licensed under the Apache License v2.0 and uses the MeCab-IPADIC dictionary/statistical model. See NOTICE.txt for license details.

TODO

  • 検索用モードの実装
  • API 整備

Documentation

Index

Constants

View Source
const BosEosId int = -1

Reserved identifier of node id.

Variables

This section is empty.

Functions

This section is empty.

Types

type ConnectionTable

type ConnectionTable struct {
	Row, Col int
	Vec      []int16
}

ConnectionTable represents a connection matrix of morphs.

func (*ConnectionTable) At

func (ct *ConnectionTable) At(row, col int) int16

At returns the connection cost of matrix[row, col].

type Dic

type Dic struct {
	Morphs       []Morph
	Contents     [][]string
	Connection   ConnectionTable
	Index        Trie
	IndexDup     map[int]int
	CharClass    []string
	CharCategory []byte
	InvokeList   []bool
	GroupList    []bool

	UnkMorphs   []Morph
	UnkIndex    map[int]int
	UnkIndexDup map[int]int
	UnkContents [][]string
}

Dic represents a dictionary of a tokenizer.

func NewSysDic

func NewSysDic() (dic *Dic)

NewSysDic returns the kagome system dictionary.

type DoubleArray

type DoubleArray []struct {
	Base, Check int
}

DoubleArray represents the TRIE data structure.

func (*DoubleArray) Build

func (d *DoubleArray) Build(keywords []string) (err error)

Build constructs a double array from given keywords.

func (*DoubleArray) BuildWithIds

func (d *DoubleArray) BuildWithIds(keywords []string, ids []int) (err error)

BuildWithIds constructs a double array from given keywords and ids.

func (*DoubleArray) CommonPrefixSearchBytes

func (d *DoubleArray) CommonPrefixSearchBytes(input []byte) (ids, lens []int)

CommonPrefixSearchBytes finds keywords sharing common prefix in an input and returns the ids and it's lengths if found.

func (*DoubleArray) CommonPrefixSearchString

func (d *DoubleArray) CommonPrefixSearchString(input string) (ids, lens []int)

CommonPrefixSearchString finds keywords sharing common prefix in an input and returns the ids and it's lengths if found.

func (*DoubleArray) FindBytes

func (d *DoubleArray) FindBytes(input []byte) (id int, ok bool)

FindBytes searches TRIE by a given keyword and returns the id if found.

func (*DoubleArray) FindString

func (d *DoubleArray) FindString(input string) (id int, ok bool)

FindString searches TRIE by a given keyword and returns the id if found.

func (*DoubleArray) PrefixSearchBytes

func (d *DoubleArray) PrefixSearchBytes(input []byte) (id int, ok bool)

PrefixSearchBytes returns the longest commom prefix keyword in an input if found.

func (*DoubleArray) PrefixSearchString

func (d *DoubleArray) PrefixSearchString(input string) (id int, ok bool)

PrefixSearchString returns the longest commom prefix keyword in an input if found.

type Morph

type Morph struct {
	LeftId, RightId, Weight int16
}

Morph represents part of speeches and an occurrence cost.

type NodeClass

type NodeClass int

NodeClass represents a node type.

const (
	DUMMY NodeClass = iota
	KNOWN
	UNKNOWN
	USER
)

NodeClass codes.

func (NodeClass) String

func (nc NodeClass) String() string

String returns a string representation of a node class.1

type Token

type Token struct {
	Id      int
	Class   NodeClass
	Start   int
	End     int
	Surface string
	// contains filtered or unexported fields
}

Token represents a morph of a sentence.

func (Token) Features

func (t Token) Features() (features []string)

Features returns contents of a token.

func (Token) String

func (t Token) String() string

String returns a string representation of a token.

type Tokenizer

type Tokenizer struct {
	// contains filtered or unexported fields
}

Tokenizer represents morphological analyzer.

func NewTokenizer

func NewTokenizer() (t *Tokenizer)

NewTokenizer create a tokenizer.

func (*Tokenizer) Dot

func (t *Tokenizer) Dot(input string, w io.Writer) (tokens []Token)

Dot returns morphs of a sentense and exports a lattice graph to dot format.

func (*Tokenizer) SetDic

func (t *Tokenizer) SetDic(dic *Dic)

SetDic sets dictionary to dic.

func (*Tokenizer) SetUserDic

func (t *Tokenizer) SetUserDic(udic *UserDic)

SetUserDic sets user dictionary to udic.

func (*Tokenizer) Tokenize

func (t *Tokenizer) Tokenize(input string) (tokens []Token)

Tokenize returns morphs of a sentence.

type Trie

type Trie interface {
	FindString(string) (id int, ok bool)               // search a dictionary by a keyword.
	CommonPrefixSearchString(string) (ids, lens []int) // finds keywords sharing common prefix in a dictionary.
}

Any type implements Trie interface may be used as a dictionary.

type UserDic

type UserDic struct {
	Index    Trie
	Contents []UserDicContent
}

UserDic represents a user dictionary.

func NewUserDic

func NewUserDic(path string) (udic *UserDic, err error)

NewUserDic build a user dictionary from a file.

type UserDicContent

type UserDicContent struct {
	Tokens []string
	Yomi   []string
	Pos    string
}

UserDicContent represents contents of a word in a user dictionary.

Directories

Path Synopsis
_dictool
ipa
cmd

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL