hangulize

package module
v0.3.5 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 16, 2020 License: MIT Imports: 17 Imported by: 0

README

한글라이즈

GoDoc Go Report Card Build Status Coverage Status

(WIP: 아직 개발 중, API가 임의로 바뀔 수 있어요!)

외국어의 한글 표기 체계가 제대로 서려면 일반인이 외국어를 한글로 표기하고 싶을 때 바로바로 쉽게 용례를 찾을 수 있어야 한다. 정기적으로 회의를 열어 용례를 정하는 것으로는 한계가 있다. 외래어 표기 심의 방식이 자동화되어 한글로 표기하고 싶은 외국어를 입력하자마자 한글 표기가 나와야 한다. 이미 용례가 정해진 것은 그것을 따르고 용례에 없는 것이라도 각 언어의 표기 규칙에 따라 권장 표기를 표시해야 한다. 프로그래머들과 언어학자들이 손잡고 연구한다면 이게 공상으로만 그치지 않을 것이다.

Brian Jongseong Park (http://iceager.egloos.com/2610028)

한글라이즈는 외래어를 한글로 변환해주는 도구입니다.

$ go get -u github.com/hangulize/hangulize
import "github.com/hangulize/hangulize"

hangulize.Hangulize("ita", "Cappuccino")
// output: "카푸치노"

재제작

한글라이즈 프로젝트는 2010년에 Python으로 처음 구현되었고, 웹 상에서 누구나 쉽게 사용할 수 있도록 hangulize.org 서비스를 제공해왔습니다. 이 Python 구현의 소스코드와 관련 자료는 sublee/hangulize에서 확인할 수 있습니다.

시간이 흘러 2018년, 기존 한글라이즈의 기능은 모두 계승하면서 성능을 높이고 코드의 유지보수 가능성과 규칙 설계 시 생산성을 높이기 위해 재제작을 결정했습니다. 한글라이즈를 다시 구현하는 데에는 Go를 사용했습니다.

이 프로젝트는 처음에 "한글라이즈 2"로 명명했으나 기존 한글라이즈를 충분히 대체할 수 있을 것으로 보여 숫자 2를 뗀 "한글라이즈"로 이름을 바꿨습니다.

자세한 이야기는 한글라이즈 재제작기를 참고해주세요.

목표
  • 기존 한글라이즈(Python 구현)의 기능을 모두 계승
  • 규칙 설계에 정적 파일(.hgl) 사용
  • 간편한 규칙 설계환경
  • 규칙 설계법 꼼꼼히 문서화
  • hangulize.org에 적용
  • hangulize.org 개편
성능

옛 한글라이즈의 시간복잡도는 O(n²)이지만 새 한글라이즈의 시간복잡도는 O(n)입니다.

환경 ita Cappuccino nld Juliana Louise Emma Marie Wilhelmina
옛 한글라이즈 CPython 3.6.3 398 µs 1.46 ms
옛 한글라이즈 PyPy 3.5 v5.8.0 208 µs 9.79 ms
새 한글라이즈 Go 1.10.2 85 µs 1.05 ms

지원하는 언어

LANG     STAGE    ENG                      KOR
aze      draft    Azerbaijani              아제르바이잔어
bel      draft    Belarusian               벨라루스어
bul      draft    Bulgarian                불가리아어
cat      draft    Catalan                  카탈로니아어
ces      draft    Czech                    체코어
chi      draft    Chinese                  중국어
cym      draft    Welsh                    웨일스어
deu      draft    German                   독일어
ell      draft    Greek                    그리스어
epo      draft    Esperanto                에스페란토어
est      draft    Estonian                 에스토니아어
fin      draft    Finnish                  핀란드어
grc      draft    Ancient Greek            고대 그리스어
hbs      draft    Serbo-Croatian           세르보크로아트어
hun      draft    Hungarian                헝가리어
isl      draft    Icelandic                아이슬란드어
ita      draft    Italian                  이탈리아어
jpn      draft    Japanese                 일본어
jpn-ck   draft    Japanese (C.K.)          일본어(최영애-김용옥)
kat-1    draft    Georgian (1st scheme)    조지아어(제1안)
kat-2    draft    Georgian (2nd scheme)    조지아어(제2안)
lat      draft    Latin                    라틴어
lav      draft    Latvian                  라트비아어
lit      draft    Lithuanian               리투아니아어
mkd      draft    Macedonian               마케도니아어
nld      draft    Dutch                    네덜란드어
pol      draft    Polish                   폴란드어
por      draft    Portuguese               포르투갈어
por-br   draft    Brazilian Portuguese     브라질 포르투갈어
ron      draft    Romanian                 루마니아어
rus      draft    Russian                  러시아어
slk      draft    Slovak                   슬로바키아어
slv      draft    Slovenian                슬로베니아어
spa      draft    Spanish                  스페인어
sqi      draft    Albanian                 알바니아어
swe      draft    Swedish                  스웨덴어
tur      draft    Turkish                  터키어
ukr      draft    Ukrainian                우크라이나어
vie      draft    Vietnamese               베트남어
wlm      draft    Middle Welsh             웨일스어(중세)

읽을거리

만든이

라이선스

한글라이즈는 MIT 라이선스 하에 공개되어 있습니다. 소스코드를 사용할 경우 라이선스 내용을 준수해주세요. 라이선스 전문은 LICENSE 파일에서 확인하실 수 있습니다.

Documentation

Overview

Package hangulize transcribes non-Korean words into Hangul.

"Hello!" -> "헬로!"

Hangulize was inspired by Brian Jongseong Park (http://iceager.egloos.com/2610028). Based on this idea, the original Hangulize was developed in Python and went out in 2010 (https://github.com/sublee/hangulize). Since then, serving as a web application on https://hangulize.org/, it has been of great help for Korean translators.

This Go re-implementation is a reboot of Hangulize with feature improvements.

Pipeline

Basically, Hangulize transcribes with 5 steps. These steps include "Normalize", "Group", "Rewrite", "Transcribe", and "Syllabify". To clarify these concepts, let's consider an imaginary example of "Hello!" in English into "헬로!" (actually, English is not supported yet).

First, Hangulize normalizes letter cases:

"Hello" -> "hello!"

And then, it groups letters by meanings:

"hello!" -> "hello", "!"

After that, grouped chunks are rewritten as source language-specific rules. This step is usually for minimizing the differences between pronunciation and spelling:

"hello", "!" -> "heˈlō", "!"

And it transcribes rewritten chunks into Hangul Jamo phonemes.

"heˈlō", "!" -> "ㅎㅔ-ㄹㄹㅗ", "!"

Finally, it composes Jamo phonemes into Hangul syllabic blocks and joins all groups.

"ㅎㅔ-ㄹㄹㅗ", "!" -> "헬로!"

Extended Pipeline

Some languages, such as Japanese, may require 2 more steps: "Phonemize" and "Transliterate". The prior is before the Normalize step, and the latter is after the Syllabify step.

Japanese uses Kanji which is an ideogram. There is the Kanji-to-Kana mapping called Furigana. To get Furigana from Kanji, we need a lexical analysis based on several dictionaries. The Phonemize step guesses the phonograms from a spelling based on lexical analysis.

"日本語" -> "ニホンゴ"

Furthermore, Japanese uses the full-width characters for puctuations while Korean and European languages use the half-width. The full-width puctuations need to be replaced with the half-width and a space to generate a comfortable Korean word. The Transliterate step replaces them.

"이마、아이니유키마스" -> "이마, 아이니유키마스"

Spec

A spec is written by the HGL format which is a configuration DSL for Hangulize 2. One spec is for one language transcription system. So we need to describe about the language at the first:

lang:
    id      = "ita"
    codes   = "it", "ita" # ISO 639-1 and 3 codes
    english = "Italian"
    korean  = "이탈리아어"
    script  = "roman"

Then write about yourself and the stage of this spec:

config:
    author = "John Doe <john@example.com>"
    stage  = "draft"

We will write many patterns in rewrite/transcribe rules soon. Some expressions may appear many times annoyingly. To not repeat ourselves, we can use variables and macros.

A variable is a combination of letters. Variable in pattern will match with one of the letters. Variable "foo" can be referenced with "<foo>" in the patterns.

vars:
    "vowels" = "a", "e", "i", "o", "u"

A macro expression is replaced with the target before parsing the patterns. "@" is the common macro for "<vowels>" variable:

macros:
    "@" = "<vowels>"

Now we can write "rewrite" rules. There are Pattern and RPattern. Pattern matches with letters in a word. RPattern represents how the matched letters should be replaced. A replaced word by a rule would become as the input for the next rule:

rewrite:
    "^gli$"   -> "li"
    "^gli{@}" -> "li"
    "{@}gli"  -> "li"
    "gn{@}"   -> "nJ"

Pattern is based on Regular Expression but it has it's own custom syntax. We call it "HRE" which means "Hangulize-specific Regular Expression". For the detail, see the documentation of "github.com/hangulize/hre".

"transcribe" rules are exactly same with "rewrite" rules. But it's RPatterns represent Hangul Jamo phonemes. In contrast to "rewrite", a replaced word won't become as the input for the next rules:

transcribe:
    "b" -> "ㅂ"
    "d" -> "ㄷ"
    "f" -> "ㅍ"
    "g" -> "ㄱ"

Finally, we should write expected transcription examples. They are used for unit testing. Verify your spec yourself:

test:
    "allegretto" -> "알레그레토"
    "gita"       -> "지타"
    "bisnonno"   -> "비스논노"
    "Pinocchio"  -> "피노키오"
Example
// Person names from http://iceager.egloos.com/2610028
fmt.Println(Hangulize("ron", "Cătălin Moroşanu"))
fmt.Println(Hangulize("nld", "Jerrel Venetiaan"))
fmt.Println(Hangulize("por", "Vítor Constâncio"))
Output:

커털린 모로샤누
예럴 페네티안
비토르 콘스탄시우

Index

Examples

Constants

View Source
const Version = "0.3.5"

Version is the version number of Hangulize package. The version follows Semantic Versioning 2.0.0.

Variables

AllSteps is the array of all steps.

Functions

func Hangulize

func Hangulize(lang string, word string) string

Hangulize transcribes a non-Korean word into Hangul, which is the Korean alphabet.

For example, it will transcribe "Владивосто́к" in Russian into "블라디보스토크".

It is the most simple and useful API of thie package.

Example (Cappuccino)
fmt.Println(Hangulize("ita", "Cappuccino"))
Output:

카푸치노
Example (Nietzsche)
fmt.Println(Hangulize("deu", "Friedrich Wilhelm Nietzsche"))
Output:

프리드리히 빌헬름 니체
Example (ShinkaiMakoto)
// import "github.com/hangulize/hangulize/phonemize/furigana"
// UsePhonemizer(&furigana.P)

fmt.Println(Hangulize("jpn", "新海誠"))
Output:

신카이 마코토

func ListLangs

func ListLangs() []string

ListLangs returns the language name list of bundled specs. The bundled spec can be loaded by LoadSpec.

Example

Here're all supported languages.

for _, lang := range ListLangs() {
	fmt.Println(lang)
}
Output:

aze
bel
bul
cat
ces
chi
cym
deu
ell
epo
est
fin
grc
hbs
hun
isl
ita
jpn
jpn-ck
kat-1
kat-2
lat
lav
lit
mkd
nld
pol
por
por-br
ron
rus
slk
slv
spa
sqi
swe
tur
ukr
vie
wlm

func UnloadSpec added in v0.2.11

func UnloadSpec(lang string)

UnloadSpec flushes a cached spec to get free memory.

func UnusePhonemizer added in v0.2.5

func UnusePhonemizer(id string) bool

UnusePhonemizer discards a global phonemizer.

func UsePhonemizer added in v0.2.5

func UsePhonemizer(p Phonemizer) bool

UsePhonemizer keeps a phonemizer for ready to use globally.

Types

type Config

type Config struct {
	Authors []string
	Stage   string
}

Config keeps some configurations for a transactiption specification.

type Hangulizer

type Hangulizer struct {
	// contains filtered or unexported fields
}

Hangulizer provides the transcription logic for the underlying spec.

func NewHangulizer

func NewHangulizer(spec *Spec) *Hangulizer

NewHangulizer creates a Hangulizer for a spec.

Example
spec, _ := LoadSpec("nld")
h := NewHangulizer(spec)

fmt.Println(h.Hangulize("Vincent van Gogh"))
Output:

빈센트 반고흐

func (*Hangulizer) GetPhonemizer added in v0.2.5

func (h *Hangulizer) GetPhonemizer(id string) (Phonemizer, bool)

GetPhonemizer returns a phonemizer by the ID.

func (*Hangulizer) Hangulize

func (h *Hangulizer) Hangulize(word string) string

Hangulize transcribes a loanword into Hangul.

func (*Hangulizer) HangulizeTrace

func (h *Hangulizer) HangulizeTrace(word string) (string, Traces)

HangulizeTrace transcribes a loanword into Hangul and returns the traced internal events too.

func (*Hangulizer) Spec

func (h *Hangulizer) Spec() *Spec

Spec returns the underlying spec.

func (*Hangulizer) UnusePhonemizer added in v0.2.5

func (h *Hangulizer) UnusePhonemizer(id string) bool

UnusePhonemizer discards a phonemizer.

func (*Hangulizer) UsePhonemizer added in v0.2.5

func (h *Hangulizer) UsePhonemizer(p Phonemizer) bool

UsePhonemizer keeps a phonemizer for ready to use.

type Language

type Language struct {
	ID         string    // Arbitrary, but identifiable language ID.
	Codes      [2]string // [0]: ISO 639-1 code, [1]: ISO 639-3 code
	English    string    // The language name in English.
	Korean     string    // The language name in Korean.
	Script     string
	Phonemizer string
}

Language identifies a natural language.

func (Language) String

func (l Language) String() string

type Phonemizer added in v0.2.5

type Phonemizer interface {
	ID() string
	Phonemize(string) string
}

Phonemizer is an interface to guess phonograms from a spelling based on lexical analysis.

The lexical analysis may require large size of dictionary data. To keep Hangulize lightweight, phonemizers are implemented out of this package.

For example, there is a phonemizer for Furigana of Japanese in a separate package.

import "github.com/hangulize/hangulize"
import "github.com/hangulize/hangulize/phonemize/furigana"

hangulize.UsePhonemizer(&furigana.P)
fmt.Println(hangulize.Hangulize("jpn", "日本語"))

func GetPhonemizer added in v0.2.5

func GetPhonemizer(id string) (Phonemizer, bool)

GetPhonemizer returns a global phonemizer by the ID.

type Rule

type Rule struct {
	ID   int
	From *hre.Pattern
	To   *hre.RPattern
}

Rule is a pair of Pattern and RPattern.

func (Rule) Replace added in v0.3.0

func (r Rule) Replace(word string) string

Replace matches the word with the Pattern and replaces with the RPattern.

func (Rule) String

func (r Rule) String() string

type Spec

type Spec struct {
	// Meta information sections
	Lang   Language
	Config Config

	// Helper setting sections
	Macros    map[string]string
	Vars      map[string][]string
	Normalize map[string][]string

	// Rewrite/Transcribe
	Rewrite    []Rule
	Transcribe []Rule

	// Test examples
	Test [][2]string

	// Source code
	Source string
	// contains filtered or unexported fields
}

Spec represents a transactiption specification for a language.

func LoadSpec

func LoadSpec(lang string) (*Spec, bool)

LoadSpec finds a bundled spec by the given language name. Once it loads a spec, it will cache the spec.

func ParseSpec

func ParseSpec(r io.Reader) (*Spec, error)

ParseSpec parses a Spec from an HGL source.

func (Spec) GoString added in v0.3.1

func (s Spec) GoString() string

GoString implements GoStringer for Spec.

func (Spec) String

func (s Spec) String() string

type Step added in v0.3.0

type Step int

Step is an identifier for the each pipeline step.

const (

	// Input step just records the beginning.
	Input Step

	// Phonemize step converts the spelling to the phonograms.
	Phonemize

	// Normalize step eliminates letter case to make the next steps work easier.
	Normalize

	// Group step associates meaningful letters.
	Group

	// Rewrite step minimizes the gap between pronunciation and spelling.
	Rewrite

	// Transcribe step determines Hangul spelling for the pronunciation.
	Transcribe

	// Syllabify step composes Jamo phonemes into Hangul syllabic blocks.
	Syllabify

	// Transliterate step converts foreign punctuations to fit in Korean.
	Transliterate
)

func (Step) String added in v0.3.0

func (s Step) String() string

type Trace

type Trace struct {
	Step Step
	Word string

	Why string

	Rule    Rule
	HasRule bool
}

Trace is emitted when a replacement occurs. It is used for tracing of Hangulize pipeline internal.

func (Trace) String

func (t Trace) String() string

type Traces added in v0.3.0

type Traces []Trace

Traces is an array of Trace.

func (Traces) Render added in v0.3.0

func (ts Traces) Render(w io.Writer)

Render generates a report text.

Directories

Path Synopsis
internal
jamo
Package jamo implements a Hangul composer.
Package jamo implements a Hangul composer.
subword
Package subword implements a word replacement with a level.
Package subword implements a word replacement with a level.
phonemize
furigana
Package furigana implements the hangulize.Phonemizer interface for Japanese Kanji.
Package furigana implements the hangulize.Phonemizer interface for Japanese Kanji.
pinyin
Package pinyin implements the hangulize.Phonemizer interface for Chinese Hanzu.
Package pinyin implements the hangulize.Phonemizer interface for Chinese Hanzu.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL