README

Whatlanggo

Build Status Go Report Card GoDoc Coverage Status

Natural language detection for Go.

Features

  • Supports 84 languages
  • 100% written in Go
  • No external dependencies
  • Fast
  • Recognizes not only a language, but also a script (Latin, Cyrillic, etc)

Getting started

Installation:

    go get -u github.com/abadojack/whatlanggo

Simple usage example:

package main

import (
	"fmt"

	"github.com/abadojack/whatlanggo"
)

func main() {
	info := whatlanggo.Detect("Foje funkcias kaj foje ne funkcias")
	fmt.Println("Language:", info.Lang.String(), " Script:", whatlanggo.Scripts[info.Script], " Confidence: ", info.Confidence)
}

Blacklisting and whitelisting

package main

import (
	"fmt"

	"github.com/abadojack/whatlanggo"
)

func main() {
	//Blacklist
	options := whatlanggo.Options{
		Blacklist: map[whatlanggo.Lang]bool{
			whatlanggo.Ydd: true,
		},
	}

	info := whatlanggo.DetectWithOptions("האקדמיה ללשון העברית", options)

	fmt.Println("Language:", info.Lang.String(), "Script:", whatlanggo.Scripts[info.Script])

	//Whitelist
	options1 := whatlanggo.Options{
		Whitelist: map[whatlanggo.Lang]bool{
			whatlanggo.Epo: true,
			whatlanggo.Ukr: true,
		},
	}

	info = whatlanggo.DetectWithOptions("Mi ne scias", options1)
	fmt.Println("Language:", info.Lang.String(), " Script:", whatlanggo.Scripts[info.Script])
}

For more details, please check the documentation.

Requirements

Go 1.8 or higher

How does it work?

How does the language recognition work?

The algorithm is based on the trigram language models, which is a particular case of n-grams. To understand the idea, please check the original whitepaper Cavnar and Trenkle '94: N-Gram-Based Text Categorization'.

How IsReliable calculated?

It is based on the following factors:

  • How many unique trigrams are in the given text
  • How big is the difference between the first and the second(not returned) detected languages? This metric is called rate in the code base.

Therefore, it can be presented as 2d space with threshold functions, that splits it into "Reliable" and "Not reliable" areas. This function is a hyperbola and it looks like the following one:

Language recognition whatlang rust

For more details, please check a blog article Introduction to Rust Whatlang Library and Natural Language Identification Algorithms.

License

MIT

Derivation

whatlanggo is a derivative of Franc (JavaScript, MIT) by Titus Wormer.

Acknowledgements

Thanks to greyblake (Potapov Sergey) for creating whatlang-rs from where I got the idea and algorithms.

Documentation

Overview

    Package whatlanggo detects natural languages and scripts ( writing systems ). Languages are represented by a determined list of constants while scripts are represented by *unicode.RangeTable.

    Index

    Constants

    View Source
    const ReliableConfidenceThreshold = 0.8

      ReliableConfidenceThreshold is confidence rating that has to be succeeded for the language detection to be considered reliable.

      Variables

      View Source
      var Langs = map[Lang]string{
      	Afr: "Afrikaans",
      	Aka: "Akan",
      	Amh: "Amharic",
      	Arb: "Arabic",
      	Azj: "Azerbaijani",
      	Bel: "Belarusian",
      	Ben: "Bengali",
      	Bho: "Bhojpuri",
      	Bul: "Bulgarian",
      	Ceb: "Cebuano",
      	Ces: "Czech",
      	Cmn: "Mandarin",
      	Dan: "Danish",
      	Deu: "German",
      	Ell: "Greek",
      	Eng: "English",
      	Epo: "Esperanto",
      	Est: "Estonian",
      	Fin: "Finnish",
      	Fra: "French",
      	Guj: "Gujarati",
      	Hat: "Haitian Creole",
      	Hau: "Hausa",
      	Heb: "Hebrew",
      	Hin: "Hindi",
      	Hrv: "Croatian",
      	Hun: "Hungarian",
      	Ibo: "Igbo",
      	Ilo: "Ilocano",
      	Ind: "Indonesian",
      	Ita: "Italian",
      	Jav: "Javanese",
      	Jpn: "Japanese",
      	Kan: "Kannada",
      	Kat: "Georgian",
      	Khm: "Khmer",
      	Kin: "Kinyarwanda",
      	Kor: "Korean",
      	Kur: "Kurdish",
      	Lav: "Latvian",
      	Lit: "Lithuanian",
      	Mai: "Maithili",
      	Mal: "Malayalam",
      	Mar: "Marathi",
      	Mkd: "Macedonian",
      	Mlg: "Malagasy",
      	Mya: "Burmese",
      	Nep: "Nepali",
      	Nld: "Dutch",
      	Nno: "Nynorsk",
      	Nob: "Bokmal",
      	Nya: "Chewa",
      	Ori: "Oriya",
      	Orm: "Oromo",
      	Pan: "Punjabi",
      	Pes: "Persian",
      	Pol: "Polish",
      	Por: "Portuguese",
      	Ron: "Romanian",
      	Run: "Rundi",
      	Rus: "Russian",
      	Sin: "Sinhalese",
      	Skr: "Saraiki",
      	Slv: "Slovene",
      	Sna: "Shona",
      	Som: "Somali",
      	Spa: "Spanish",
      	Srp: "Serbian",
      	Swe: "Swedish",
      	Tam: "Tamil",
      	Tel: "Telugu",
      	Tgl: "Tagalog",
      	Tha: "Thai",
      	Tir: "Tigrinya",
      	Tuk: "Turkmen",
      	Tur: "Turkish",
      	Uig: "Uyghur",
      	Ukr: "Ukrainian",
      	Urd: "Urdu",
      	Uzb: "Uzbek",
      	Vie: "Vietnamese",
      	Ydd: "Yiddish",
      	Yor: "Yoruba",
      	Zul: "Zulu",
      }

        Langs represents a map of Lang to language name.

        View Source
        var Scripts = map[*unicode.RangeTable]string{
        	unicode.Arabic:     "Arabic",
        	unicode.Bengali:    "Bengali",
        	unicode.Cyrillic:   "Cyrillic",
        	unicode.Ethiopic:   "Ethiopic",
        	unicode.Devanagari: "Devanagari",
        	unicode.Han:        "Han",
        	unicode.Georgian:   "Georgian",
        	unicode.Greek:      "Greek",
        	unicode.Gujarati:   "Gujarati",
        	unicode.Gurmukhi:   "Gurmukhi",
        	unicode.Hangul:     "Hangul",
        	unicode.Hebrew:     "Hebrew",
        	unicode.Hiragana:   "Hiragana",
        	unicode.Kannada:    "Kannada",
        	unicode.Katakana:   "Katakana",
        	unicode.Khmer:      "Khmer",
        	unicode.Latin:      "Latin",
        	unicode.Malayalam:  "Malayalam",
        	unicode.Myanmar:    "Myanmar",
        	unicode.Oriya:      "Oriya",
        	unicode.Sinhala:    "Sinhala",
        	unicode.Tamil:      "Tamil",
        	unicode.Telugu:     "Telugu",
        	unicode.Thai:       "Thai",
        }

          Scripts is the set of Unicode script tables.

          Functions

          func DetectScript

          func DetectScript(text string) *unicode.RangeTable

            DetectScript returns only the script of the given text.

            func LangToString

            func LangToString(lang Lang) string

              LangToString converts enum into ISO 639-3 code as a string. Deprecated: LangToString is deprected and exists for historical compatibility. Please use `Lang.Iso6393()` instead.

              func LangToStringShort

              func LangToStringShort(lang Lang) string

                LangToStringShort converts enum into ISO 639-1 code as a string. Return empty string when there is no ISO 639-1 code. Deprecated: LangToStringShort is deprected and exists for historical compatibility. Please use `Lang.Iso6391()` instead.

                Types

                type Info

                type Info struct {
                	Lang       Lang
                	Script     *unicode.RangeTable
                	Confidence float64
                }

                  Info represents a full outcome of language detection.

                  func Detect

                  func Detect(text string) Info

                    Detect language and script of the given text.

                    func DetectWithOptions

                    func DetectWithOptions(text string, options Options) Info

                      DetectWithOptions detects the language and script of the given text with the provided options.

                      func (*Info) IsReliable

                      func (info *Info) IsReliable() bool

                        IsReliable returns true if Confidence is greater than the Reliable Confidence Threshold

                        type Lang

                        type Lang int

                          Lang represents a language following ISO 639-3 standard.

                          const (
                          	Afr Lang = iota
                          	Aka
                          	Amh
                          	Arb
                          	Azj
                          	Bel
                          	Ben
                          	Bho
                          	Bul
                          	Ceb
                          	Ces
                          	Cmn
                          	Dan
                          	Deu
                          	Ell
                          	Eng
                          	Epo
                          	Est
                          	Fin
                          	Fra
                          	Guj
                          	Hat
                          	Hau
                          	Heb
                          	Hin
                          	Hrv
                          	Hun
                          	Ibo
                          	Ilo
                          	Ind
                          	Ita
                          	Jav
                          	Jpn
                          	Kan
                          	Kat
                          	Khm
                          	Kin
                          	Kor
                          	Kur
                          	Lav
                          	Lit
                          	Mai
                          	Mal
                          	Mar
                          	Mkd
                          	Mlg
                          	Mya
                          	Nep
                          	Nld
                          	Nno
                          	Nob
                          	Nya
                          	Ori
                          	Orm
                          	Pan
                          	Pes
                          	Pol
                          	Por
                          	Ron
                          	Run
                          	Rus
                          	Sin
                          	Skr
                          	Slv
                          	Sna
                          	Som
                          	Spa
                          	Srp
                          	Swe
                          	Tam
                          	Tel
                          	Tgl
                          	Tha
                          	Tir
                          	Tuk
                          	Tur
                          	Uig
                          	Ukr
                          	Urd
                          	Uzb
                          	Vie
                          	Ydd
                          	Yor
                          	Zul
                          )

                            Aka ...

                            func CodeToLang

                            func CodeToLang(code string) Lang

                              CodeToLang gets enum by ISO 639-3 code as a string.

                              func DetectLang

                              func DetectLang(text string) Lang

                                DetectLang detects only the language by a given text.

                                func DetectLangWithOptions

                                func DetectLangWithOptions(text string, options Options) Lang

                                  DetectLangWithOptions detects only the language of the given text with the provided options.

                                  func (Lang) Iso6391

                                  func (lang Lang) Iso6391() string

                                    Iso6391 returns ISO 639-1 code of Lang as a string.

                                    func (Lang) Iso6393

                                    func (lang Lang) Iso6393() string

                                      Iso6393 returns ISO 639-3 code of Lang as a string.

                                      func (Lang) String

                                      func (lang Lang) String() string

                                        String returns the human-readable name of the language as a string.

                                        type Options

                                        type Options struct {
                                        	Whitelist map[Lang]bool
                                        	Blacklist map[Lang]bool
                                        }

                                          Options represents options that can be set when detecting a language or/and script such blacklisting languages to skip checking.