README

Whatlanggo

Build Status Go Report Card GoDoc Coverage Status

Natural language detection for Go.

Features

  • Supports 84 languages
  • 100% written in Go
  • No external dependencies
  • Fast
  • Recognizes not only a language, but also a script (Latin, Cyrillic, etc)

Getting started

Installation:

    go get -u github.com/abadojack/whatlanggo

Simple usage example:

package main

import (
	"fmt"

	"github.com/abadojack/whatlanggo"
)

func main() {
	info := whatlanggo.Detect("Foje funkcias kaj foje ne funkcias")
	fmt.Println("Language:", info.Lang.String(), " Script:", whatlanggo.Scripts[info.Script], " Confidence: ", info.Confidence)
}

Blacklisting and whitelisting

package main

import (
	"fmt"

	"github.com/abadojack/whatlanggo"
)

func main() {
	//Blacklist
	options := whatlanggo.Options{
		Blacklist: map[whatlanggo.Lang]bool{
			whatlanggo.Ydd: true,
		},
	}

	info := whatlanggo.DetectWithOptions("האקדמיה ללשון העברית", options)

	fmt.Println("Language:", info.Lang.String(), "Script:", whatlanggo.Scripts[info.Script])

	//Whitelist
	options1 := whatlanggo.Options{
		Whitelist: map[whatlanggo.Lang]bool{
			whatlanggo.Epo: true,
			whatlanggo.Ukr: true,
		},
	}

	info = whatlanggo.DetectWithOptions("Mi ne scias", options1)
	fmt.Println("Language:", info.Lang.String(), " Script:", whatlanggo.Scripts[info.Script])
}

For more details, please check the documentation.

Requirements

Go 1.8 or higher

How does it work?

How does the language recognition work?

The algorithm is based on the trigram language models, which is a particular case of n-grams. To understand the idea, please check the original whitepaper Cavnar and Trenkle '94: N-Gram-Based Text Categorization'.

How IsReliable calculated?

It is based on the following factors:

  • How many unique trigrams are in the given text
  • How big is the difference between the first and the second(not returned) detected languages? This metric is called rate in the code base.

Therefore, it can be presented as 2d space with threshold functions, that splits it into "Reliable" and "Not reliable" areas. This function is a hyperbola and it looks like the following one:

Language recognition whatlang rust

For more details, please check a blog article Introduction to Rust Whatlang Library and Natural Language Identification Algorithms.

License

MIT

Derivation

whatlanggo is a derivative of Franc (JavaScript, MIT) by Titus Wormer.

Acknowledgements

Thanks to greyblake (Potapov Sergey) for creating whatlang-rs from where I got the idea and algorithms.

Documentation

Overview

Package whatlanggo detects natural languages and scripts ( writing systems ). Languages are represented by a determined list of constants while scripts are represented by *unicode.RangeTable.

Index

Constants

View Source
const ReliableConfidenceThreshold = 0.8

ReliableConfidenceThreshold is confidence rating that has to be succeeded for the language detection to be considered reliable.

Variables

View Source
var Langs = map[Lang]string{
	Afr: "Afrikaans",
	Aka: "Akan",
	Amh: "Amharic",
	Arb: "Arabic",
	Azj: "Azerbaijani",
	Bel: "Belarusian",
	Ben: "Bengali",
	Bho: "Bhojpuri",
	Bul: "Bulgarian",
	Ceb: "Cebuano",
	Ces: "Czech",
	Cmn: "Mandarin",
	Dan: "Danish",
	Deu: "German",
	Ell: "Greek",
	Eng: "English",
	Epo: "Esperanto",
	Est: "Estonian",
	Fin: "Finnish",
	Fra: "French",
	Guj: "Gujarati",
	Hat: "Haitian Creole",
	Hau: "Hausa",
	Heb: "Hebrew",
	Hin: "Hindi",
	Hrv: "Croatian",
	Hun: "Hungarian",
	Ibo: "Igbo",
	Ilo: "Ilocano",
	Ind: "Indonesian",
	Ita: "Italian",
	Jav: "Javanese",
	Jpn: "Japanese",
	Kan: "Kannada",
	Kat: "Georgian",
	Khm: "Khmer",
	Kin: "Kinyarwanda",
	Kor: "Korean",
	Kur: "Kurdish",
	Lav: "Latvian",
	Lit: "Lithuanian",
	Mai: "Maithili",
	Mal: "Malayalam",
	Mar: "Marathi",
	Mkd: "Macedonian",
	Mlg: "Malagasy",
	Mya: "Burmese",
	Nep: "Nepali",
	Nld: "Dutch",
	Nno: "Nynorsk",
	Nob: "Bokmal",
	Nya: "Chewa",
	Ori: "Oriya",
	Orm: "Oromo",
	Pan: "Punjabi",
	Pes: "Persian",
	Pol: "Polish",
	Por: "Portuguese",
	Ron: "Romanian",
	Run: "Rundi",
	Rus: "Russian",
	Sin: "Sinhalese",
	Skr: "Saraiki",
	Slv: "Slovene",
	Sna: "Shona",
	Som: "Somali",
	Spa: "Spanish",
	Srp: "Serbian",
	Swe: "Swedish",
	Tam: "Tamil",
	Tel: "Telugu",
	Tgl: "Tagalog",
	Tha: "Thai",
	Tir: "Tigrinya",
	Tuk: "Turkmen",
	Tur: "Turkish",
	Uig: "Uyghur",
	Ukr: "Ukrainian",
	Urd: "Urdu",
	Uzb: "Uzbek",
	Vie: "Vietnamese",
	Ydd: "Yiddish",
	Yor: "Yoruba",
	Zul: "Zulu",
}

Langs represents a map of Lang to language name.

View Source
var Scripts = map[*unicode.RangeTable]string{
	unicode.Arabic:     "Arabic",
	unicode.Bengali:    "Bengali",
	unicode.Cyrillic:   "Cyrillic",
	unicode.Ethiopic:   "Ethiopic",
	unicode.Devanagari: "Devanagari",
	unicode.Han:        "Han",
	unicode.Georgian:   "Georgian",
	unicode.Greek:      "Greek",
	unicode.Gujarati:   "Gujarati",
	unicode.Gurmukhi:   "Gurmukhi",
	unicode.Hangul:     "Hangul",
	unicode.Hebrew:     "Hebrew",
	unicode.Hiragana:   "Hiragana",
	unicode.Kannada:    "Kannada",
	unicode.Katakana:   "Katakana",
	unicode.Khmer:      "Khmer",
	unicode.Latin:      "Latin",
	unicode.Malayalam:  "Malayalam",
	unicode.Myanmar:    "Myanmar",
	unicode.Oriya:      "Oriya",
	unicode.Sinhala:    "Sinhala",
	unicode.Tamil:      "Tamil",
	unicode.Telugu:     "Telugu",
	unicode.Thai:       "Thai",
}

Scripts is the set of Unicode script tables.

Functions

func DetectScript

func DetectScript(text string) *unicode.RangeTable

DetectScript returns only the script of the given text.

func LangToString

func LangToString(lang Lang) string

LangToString converts enum into ISO 639-3 code as a string. Deprecated: LangToString is deprected and exists for historical compatibility. Please use `Lang.Iso6393()` instead.

func LangToStringShort

func LangToStringShort(lang Lang) string

LangToStringShort converts enum into ISO 639-1 code as a string. Return empty string when there is no ISO 639-1 code. Deprecated: LangToStringShort is deprected and exists for historical compatibility. Please use `Lang.Iso6391()` instead.

Types

type Info

type Info struct {
	Lang       Lang
	Script     *unicode.RangeTable
	Confidence float64
}

Info represents a full outcome of language detection.

func Detect

func Detect(text string) Info

Detect language and script of the given text.

func DetectWithOptions

func DetectWithOptions(text string, options Options) Info

DetectWithOptions detects the language and script of the given text with the provided options.

func (*Info) IsReliable

func (info *Info) IsReliable() bool

IsReliable returns true if Confidence is greater than the Reliable Confidence Threshold

type Lang

type Lang int

Lang represents a language following ISO 639-3 standard.

const (
	Afr Lang = iota
	Aka
	Amh
	Arb
	Azj
	Bel
	Ben
	Bho
	Bul
	Ceb
	Ces
	Cmn
	Dan
	Deu
	Ell
	Eng
	Epo
	Est
	Fin
	Fra
	Guj
	Hat
	Hau
	Heb
	Hin
	Hrv
	Hun
	Ibo
	Ilo
	Ind
	Ita
	Jav
	Jpn
	Kan
	Kat
	Khm
	Kin
	Kor
	Kur
	Lav
	Lit
	Mai
	Mal
	Mar
	Mkd
	Mlg
	Mya
	Nep
	Nld
	Nno
	Nob
	Nya
	Ori
	Orm
	Pan
	Pes
	Pol
	Por
	Ron
	Run
	Rus
	Sin
	Skr
	Slv
	Sna
	Som
	Spa
	Srp
	Swe
	Tam
	Tel
	Tgl
	Tha
	Tir
	Tuk
	Tur
	Uig
	Ukr
	Urd
	Uzb
	Vie
	Ydd
	Yor
	Zul
)

Aka ...

func CodeToLang

func CodeToLang(code string) Lang

CodeToLang gets enum by ISO 639-3 code as a string.

func DetectLang

func DetectLang(text string) Lang

DetectLang detects only the language by a given text.

func DetectLangWithOptions

func DetectLangWithOptions(text string, options Options) Lang

DetectLangWithOptions detects only the language of the given text with the provided options.

func (Lang) Iso6391

func (lang Lang) Iso6391() string

Iso6391 returns ISO 639-1 code of Lang as a string.

func (Lang) Iso6393

func (lang Lang) Iso6393() string

Iso6393 returns ISO 639-3 code of Lang as a string.

func (Lang) String

func (lang Lang) String() string

String returns the human-readable name of the language as a string.

type Options

type Options struct {
	Whitelist map[Lang]bool
	Blacklist map[Lang]bool
}

Options represents options that can be set when detecting a language or/and script such blacklisting languages to skip checking.