Documentation ¶
Overview ¶
Package hangulize transcribes non-Korean words into Hangul.
"Hello!" -> "헬로!"
Hangulize was inspired by Brian Jongseong Park (http://iceager.egloos.com/2610028). Based on this idea, the original Hangulize was developed in Python and went out in 2010 (https://github.com/sublee/hangulize). Since then, serving as a web application on https://hangulize.org/, it has been of great help for Korean translators.
This Go re-implementation is a reboot of Hangulize with feature improvements.
Pipeline ¶
Basically, Hangulize transcribes with 5 steps. These steps include "Normalize", "Group", "Rewrite", "Transcribe", and "Syllabify". To clarify these concepts, let's consider an imaginary example of "Hello!" in English into "헬로!" (actually, English is not supported yet).
First, Hangulize normalizes letter cases:
"Hello" -> "hello!"
And then, it groups letters by meanings:
"hello!" -> "hello", "!"
After that, grouped chunks are rewritten as source language-specific rules. This step is usually for minimizing the differences between pronunciation and spelling:
"hello", "!" -> "heˈlō", "!"
And it transcribes rewritten chunks into Hangul Jamo phonemes.
"heˈlō", "!" -> "ㅎㅔ-ㄹㄹㅗ", "!"
Finally, it composes Jamo phonemes into Hangul syllabic blocks and joins all groups.
"ㅎㅔ-ㄹㄹㅗ", "!" -> "헬로!"
Extended Pipeline ¶
Some languages, such as Japanese, may require 2 more steps: "Phonemize" and "Transliterate". The prior is before the Normalize step, and the latter is after the Syllabify step.
Japanese uses Kanji which is an ideogram. There is the Kanji-to-Kana mapping called Furigana. To get Furigana from Kanji, we need a lexical analysis based on several dictionaries. The Phonemize step guesses the phonograms from a spelling based on lexical analysis.
"日本語" -> "ニホンゴ"
Furthermore, Japanese uses the full-width characters for puctuations while Korean and European languages use the half-width. The full-width puctuations need to be replaced with the half-width and a space to generate a comfortable Korean word. The Transliterate step replaces them.
"이마、아이니유키마스" -> "이마, 아이니유키마스"
Spec ¶
A spec is written by the HGL format which is a configuration DSL for Hangulize 2. One spec is for one language transcription system. So we need to describe about the language at the first:
lang: id = "ita" codes = "it", "ita" # ISO 639-1 and 3 codes english = "Italian" korean = "이탈리아어" script = "roman"
Then write about yourself and the stage of this spec:
config: author = "John Doe <john@example.com>" stage = "draft"
We will write many patterns in rewrite/transcribe rules soon. Some expressions may appear many times annoyingly. To not repeat ourselves, we can use variables and macros.
A variable is a combination of letters. Variable in pattern will match with one of the letters. Variable "foo" can be referenced with "<foo>" in the patterns.
vars: "vowels" = "a", "e", "i", "o", "u"
A macro expression is replaced with the target before parsing the patterns. "@" is the common macro for "<vowels>" variable:
macros: "@" = "<vowels>"
Now we can write "rewrite" rules. There are Pattern and RPattern. Pattern matches with letters in a word. RPattern represents how the matched letters should be replaced. A replaced word by a rule would become as the input for the next rule:
rewrite: "^gli$" -> "li" "^gli{@}" -> "li" "{@}gli" -> "li" "gn{@}" -> "nJ"
Pattern is based on Regular Expression but it has it's own custom syntax. We call it "HRE" which means "Hangulize-specific Regular Expression". For the detail, see the documentation of "github.com/hangulize/hre".
"transcribe" rules are exactly same with "rewrite" rules. But it's RPatterns represent Hangul Jamo phonemes. In contrast to "rewrite", a replaced word won't become as the input for the next rules:
transcribe: "b" -> "ㅂ" "d" -> "ㄷ" "f" -> "ㅍ" "g" -> "ㄱ"
Finally, we should write expected transcription examples. They are used for unit testing. Verify your spec yourself:
test: "allegretto" -> "알레그레토" "gita" -> "지타" "bisnonno" -> "비스논노" "Pinocchio" -> "피노키오"
Example ¶
// Person names from http://iceager.egloos.com/2610028 fmt.Println(Hangulize("ron", "Cătălin Moroşanu")) fmt.Println(Hangulize("nld", "Jerrel Venetiaan")) fmt.Println(Hangulize("por", "Vítor Constâncio"))
Output: 커털린 모로샤누 예럴 페네티안 비토르 콘스탄시우
Index ¶
- Constants
- Variables
- func Hangulize(lang string, word string) string
- func ListLangs() []string
- func UnloadSpec(lang string)
- func UnusePhonemizer(id string) bool
- func UsePhonemizer(p Phonemizer) bool
- type Config
- type Hangulizer
- func (h *Hangulizer) GetPhonemizer(id string) (Phonemizer, bool)
- func (h *Hangulizer) Hangulize(word string) string
- func (h *Hangulizer) HangulizeTrace(word string) (string, Traces)
- func (h *Hangulizer) Spec() *Spec
- func (h *Hangulizer) UnusePhonemizer(id string) bool
- func (h *Hangulizer) UsePhonemizer(p Phonemizer) bool
- type Language
- type Phonemizer
- type Rule
- type Spec
- type Step
- type Trace
- type Traces
Examples ¶
Constants ¶
const Version = "0.3.5"
Version is the version number of Hangulize package. The version follows Semantic Versioning 2.0.0.
Variables ¶
var AllSteps = []Step{ Input, Phonemize, Normalize, Group, Rewrite, Transcribe, Syllabify, Transliterate, }
AllSteps is the array of all steps.
Functions ¶
func Hangulize ¶
Hangulize transcribes a non-Korean word into Hangul, which is the Korean alphabet.
For example, it will transcribe "Владивосто́к" in Russian into "블라디보스토크".
It is the most simple and useful API of thie package.
Example (Cappuccino) ¶
fmt.Println(Hangulize("ita", "Cappuccino"))
Output: 카푸치노
Example (Nietzsche) ¶
fmt.Println(Hangulize("deu", "Friedrich Wilhelm Nietzsche"))
Output: 프리드리히 빌헬름 니체
Example (ShinkaiMakoto) ¶
// import "github.com/hangulize/hangulize/phonemize/furigana" // UsePhonemizer(&furigana.P) fmt.Println(Hangulize("jpn", "新海誠"))
Output: 신카이 마코토
func ListLangs ¶
func ListLangs() []string
ListLangs returns the language name list of bundled specs. The bundled spec can be loaded by LoadSpec.
Example ¶
Here're all supported languages.
for _, lang := range ListLangs() { fmt.Println(lang) }
Output: aze bel bul cat ces chi cym deu ell epo est fin grc hbs hun isl ita jpn jpn-ck kat-1 kat-2 lat lav lit mkd nld pol por por-br ron rus slk slv spa sqi swe tur ukr vie wlm
func UnloadSpec ¶ added in v0.2.11
func UnloadSpec(lang string)
UnloadSpec flushes a cached spec to get free memory.
func UnusePhonemizer ¶ added in v0.2.5
UnusePhonemizer discards a global phonemizer.
func UsePhonemizer ¶ added in v0.2.5
func UsePhonemizer(p Phonemizer) bool
UsePhonemizer keeps a phonemizer for ready to use globally.
Types ¶
type Hangulizer ¶
type Hangulizer struct {
// contains filtered or unexported fields
}
Hangulizer provides the transcription logic for the underlying spec.
func NewHangulizer ¶
func NewHangulizer(spec *Spec) *Hangulizer
NewHangulizer creates a Hangulizer for a spec.
Example ¶
spec, _ := LoadSpec("nld") h := NewHangulizer(spec) fmt.Println(h.Hangulize("Vincent van Gogh"))
Output: 빈센트 반고흐
func (*Hangulizer) GetPhonemizer ¶ added in v0.2.5
func (h *Hangulizer) GetPhonemizer(id string) (Phonemizer, bool)
GetPhonemizer returns a phonemizer by the ID.
func (*Hangulizer) Hangulize ¶
func (h *Hangulizer) Hangulize(word string) string
Hangulize transcribes a loanword into Hangul.
func (*Hangulizer) HangulizeTrace ¶
func (h *Hangulizer) HangulizeTrace(word string) (string, Traces)
HangulizeTrace transcribes a loanword into Hangul and returns the traced internal events too.
func (*Hangulizer) UnusePhonemizer ¶ added in v0.2.5
func (h *Hangulizer) UnusePhonemizer(id string) bool
UnusePhonemizer discards a phonemizer.
func (*Hangulizer) UsePhonemizer ¶ added in v0.2.5
func (h *Hangulizer) UsePhonemizer(p Phonemizer) bool
UsePhonemizer keeps a phonemizer for ready to use.
type Language ¶
type Language struct { ID string // Arbitrary, but identifiable language ID. Codes [2]string // [0]: ISO 639-1 code, [1]: ISO 639-3 code English string // The language name in English. Korean string // The language name in Korean. Script string Phonemizer string }
Language identifies a natural language.
type Phonemizer ¶ added in v0.2.5
Phonemizer is an interface to guess phonograms from a spelling based on lexical analysis.
The lexical analysis may require large size of dictionary data. To keep Hangulize lightweight, phonemizers are implemented out of this package.
For example, there is a phonemizer for Furigana of Japanese in a separate package.
import "github.com/hangulize/hangulize" import "github.com/hangulize/hangulize/phonemize/furigana" hangulize.UsePhonemizer(&furigana.P) fmt.Println(hangulize.Hangulize("jpn", "日本語"))
func GetPhonemizer ¶ added in v0.2.5
func GetPhonemizer(id string) (Phonemizer, bool)
GetPhonemizer returns a global phonemizer by the ID.
type Rule ¶
Rule is a pair of Pattern and RPattern.
type Spec ¶
type Spec struct { // Meta information sections Lang Language Config Config // Helper setting sections Macros map[string]string Vars map[string][]string Normalize map[string][]string // Rewrite/Transcribe Rewrite []Rule Transcribe []Rule // Test examples Test [][2]string // Source code Source string // contains filtered or unexported fields }
Spec represents a transactiption specification for a language.
func LoadSpec ¶
LoadSpec finds a bundled spec by the given language name. Once it loads a spec, it will cache the spec.
type Step ¶ added in v0.3.0
type Step int
Step is an identifier for the each pipeline step.
const ( // Input step just records the beginning. Input Step // Phonemize step converts the spelling to the phonograms. Phonemize // Normalize step eliminates letter case to make the next steps work easier. Normalize // Group step associates meaningful letters. Group // Rewrite step minimizes the gap between pronunciation and spelling. Rewrite // Transcribe step determines Hangul spelling for the pronunciation. Transcribe // Syllabify step composes Jamo phonemes into Hangul syllabic blocks. Syllabify // Transliterate step converts foreign punctuations to fit in Korean. Transliterate )
Source Files ¶
Directories ¶
Path | Synopsis |
---|---|
internal
|
|
jamo
Package jamo implements a Hangul composer.
|
Package jamo implements a Hangul composer. |
subword
Package subword implements a word replacement with a level.
|
Package subword implements a word replacement with a level. |
phonemize
|
|
furigana
Package furigana implements the hangulize.Phonemizer interface for Japanese Kanji.
|
Package furigana implements the hangulize.Phonemizer interface for Japanese Kanji. |
pinyin
Package pinyin implements the hangulize.Phonemizer interface for Chinese Hanzu.
|
Package pinyin implements the hangulize.Phonemizer interface for Chinese Hanzu. |