cedict

package module
v1.0.2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 3, 2024 License: MIT Imports: 17 Imported by: 0

README

cedict 漢英詞典 Go 軟件包

GoDoc Go Report Card License

Overview

Golang library for the community maintained Chinese-English dictionary (CC-CEDICT), published by MDBG.

https://www.mdbg.net/chinese/dictionary?page=cedict

The basic format of a CEDICT entry is:

Traditional Simplified [pin1 yin1] /American English equivalent 1/equivalent 2/
漢字 汉字 [han4 zi4] /Chinese character/CL:個|个/

Install

First grab the latest version of the package,

go get -u github.com/Destaq/cedict

Next, include it in your application:

import "github.com/Destaq/cedict"

Getting Started

d := cedict.New()
fmt.Printf("%s\n", cedict.PinyinTones(d.HanziToPinyin("你好,世界!")))

Contributing

  1. Fork the repo
  2. Clone your fork (git clone https://github.com/<username>/cedict && cd cedict)
  3. Create your own branch (git checkout -b my-patch)
  4. Make changes and add them (git add .)
  5. Commit your changes (git commit -m 'Fixed #123')
  6. Push to the branch (git push origin my-patch)
  7. Create new pull request

License

Copyright 2020 John Cramb. All rights reserved.

Licensed under the MIT License. See LICENSE in the project root for license information.

Documentation

Index

Examples

Constants

View Source
const (

	// URL of the latest CC-CEDICT data in .tar.gz archive format.
	URL = "https://www.mdbg.net/chinese/export/cedict/cedict_1_0_ts_utf-8_mdbg.txt.gz"

	// LineEnding used by Save(), defaults to "\r\n" to match original content.
	LineEnding = "\r\n"

	// MaxResults determines the most entries returned for any Dict method.
	MaxResults = 50

	// MaxLD controls the max levenshtein distance allowed for matches.
	MaxLD = 10
)

Variables

This section is empty.

Functions

func ConvertSymbols

func ConvertSymbols(s string) string

ConvertSymbols replaces common hanzi symbols with latin symbols.

func Download

func Download() (io.ReadCloser, error)

Download returns a Dict using the latest CC-CEDICT archive from MDBG. This file is regularly updated but relatively small at approx 4MB.

func FixSymbolSpaces

func FixSymbolSpaces(s string) string

FixSymbolSpaces removes spaces added by HanziToPinyin conversion and makes the string look more natural.

func IsHanzi

func IsHanzi(s string) bool

IsHanzi returns true if the string contains only han characters. http://www.unicode.org/reports/tr38/tr38-27.html HAN Unification

func PinyinPlaintext

func PinyinPlaintext(s string) string

PinyinPlaintext returns pinyin string without tones or tone numbers.

func PinyinToneNums

func PinyinToneNums(s string) string

PinyinToneNums returns pinyin string converting tones to tone numbers.

func PinyinTones

func PinyinTones(s string) string

PinyinTones returns pinyin string converting tone numbers to tones. It supports both CC-CEDICT format, with tones at the end of syllables i.e. Zhong1 wen2, as well as inline format with tones after their respective character i.e. Zho1ng we2n.

func StripDigits

func StripDigits(s string) string

StripDigits returns the string with all unicode digits removed.

func StripTones

func StripTones(s string) string

StripTones returns the string with all (mark, nonspacing) removed.

Types

type Dict

type Dict struct {
	// contains filtered or unexported fields
}

Dict represents an instance of the CC-CEDICT entries. By default, the latest version will be downloaded on creation.

Example (GetByPinyin)
d := New()
elements := d.GetByPinyin("mei guo ren")
for _, e := range elements {
	fmt.Printf("%s - %s\n", e.Traditional, FixSymbolSpaces(PinyinTones(e.Pinyin)))
}
Output:

美國人 - Měi guó rén
Example (HanziToPinyin)
d := New()
hans := "你喜歡學中文嗎?"
fmt.Printf("%s (plaintext) '%s'\n", hans, PinyinPlaintext(d.HanziToPinyin(hans)))
fmt.Printf("%s (tonenums)  '%s'\n", hans, d.HanziToPinyin(hans))
fmt.Printf("%s (tones)     '%s'\n", hans, FixSymbolSpaces(PinyinTones(d.HanziToPinyin(hans))))
Output:

你喜歡學中文嗎? (plaintext) 'Ni xi huan xue zhong wen ma ?'
你喜歡學中文嗎? (tonenums)  'Ni3 xi3 huan5 xue2 zhong1 wen2 ma2 ?'
你喜歡學中文嗎? (tones)     'Nǐ xǐ huan xué zhōng wén má?'

func Load

func Load(filename string) (*Dict, error)

Load returns a Dict loaded from a CC-CEDICT formatted file. This is provided for completeness, but I encourage you to use default behaviour of downloading the latest dict each time.

func New

func New() *Dict

New returns a Dict immediately but downloads the latest CC-CEDICT data in the background. Dict methods can be safely called, but will block until parsing is complete.

func Parse

func Parse(r io.Reader) (*Dict, error)

Parse creates a Dict instance from an io.Reader It expects text input in the format, https://cc-cedict.org/wiki/format:syntax

func (*Dict) DefaultFilename

func (d *Dict) DefaultFilename() string

DefaultFilename returns the CC-CEDICT filename format. constructed using the Dict's parsed metadata.

func (*Dict) Err

func (d *Dict) Err() error

Err blocks until the Dict is finished parsing and then returns any errors encountered during loading/download.

func (*Dict) GetAllByHanzi

func (d *Dict) GetAllByHanzi(s string) []*Entry

GetAllByHanzi returns all the Dict entries that match the hanzi. Supports input using traditional or simplified characters.

func (*Dict) GetByHanzi

func (d *Dict) GetByHanzi(s string) *Entry

GetByHanzi returns the Dict entry for the hanzi, if found. Supports input using traditional or simplified characters.

func (*Dict) GetByMeaning

func (d *Dict) GetByMeaning(s string) []*Entry

GetByMeaning returns entries containing the specified meaning. Matching is not case-sensitive and can be exact/non-exact.

func (*Dict) GetByPinyin

func (d *Dict) GetByPinyin(s string) []*Entry

GetByPinyin returns hanzi matching the given pinyin string. Supports pinyin in plaintext or with tones/tone numbers. With plaintext, all tone variations are considered matching.

func (*Dict) HanziToPinyin

func (d *Dict) HanziToPinyin(s string) string

HanziToPinyin converts hanzi to their pinyin representation. It implements greedy matching for longest character combos.

func (*Dict) Metadata

func (d *Dict) Metadata() Metadata

Metadata returns the Dict's metadata parsed from header comments.

func (*Dict) Save

func (d *Dict) Save(filename string) error

Save writes the Dict to a file using the format at https://cc-cedict.org/wiki/format:syntax, and should be identical to the unpacked CC-CEDICT file download. Saved as gzip archive if filename ends in '.gz'.

type Entry

type Entry struct {
	Traditional string
	Simplified  string
	Pinyin      string
	Meanings    []string
}

Entry represents a single entry in the CC-CEDICT dictionary.

func (*Entry) Marshal

func (e *Entry) Marshal() string

Marshal returns the entry, formatted according to https://cc-cedict.org/wiki/format:syntax

func (*Entry) Unmarshal

func (e *Entry) Unmarshal(s string) error

Unmarshal populates the entry, from input text formatted according to https://cc-cedict.org/wiki/format:syntax

type Metadata

type Metadata struct {
	Version    int
	Subversion int
	Format     string
	Charset    string
	Entries    int
	Publisher  string
	License    string
	Timestamp  time.Time
}

Metadata represents information embedded in the CC-CEDICT header.

Directories

Path Synopsis
cmd

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL