langid

package module
v0.1.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 18, 2026 License: BSD-2-Clause-Views Imports: 7 Imported by: 0

README

langid-go

langid-go is a high-performance Go port of the popular language identification tool langid.py and its C counterpart langid.c.

Like the originals, it comes pre-trained on 97 languages and is virtually insensitive to domain-specific features (e.g. HTML/XML markup). By leveraging Go's concurrency primitives (sync.Pool) and a flat-array "sparse set" architecture borrowed from langid.c, this port achieves zero-allocation inference on the hot loop, making it extremely fast and suitable for high-throughput stream processing.

Background & Motivation

In building production-grade language classification pipelines in Go, developers face significant gaps in the native NLP ecosystem:

  • Limitations of lingua-go: While feature-rich, the popular lingua-go library has been largely abandoned for years. Operationally, it suffers from severe bugs when processing short texts, where it frequently misclassifies inputs and returns random or incorrect languages. It also introduces high computational and memory overhead.
  • Limitations of whatlanggo: While extremely fast, whatlanggo supports a limited set of languages and formats its output exclusively in ISO 639-3 (three-letter codes), which is incompatible with downstream pipelines that require standard ISO 639-1 two-letter codes.
  • Fragility of CGO Wrappers: Previous attempts to bring the proven, robust Naive Bayes algorithm of langid to Go relied entirely on fragile CGO bindings (such as dbalan/langid_go wrapping langid.c). CGO introduces severe runtime thread overhead, interferes with Go's garbage collector and memory tracking, and complicates cross-compilation.

langid-go solves these issues by offering a pure, 100% Go implementation that achieves exact mathematical parity with the original Python unpickler and Naive Bayes vector engine, while running with zero allocations in standard concurrency-safe hot paths.

Features

  • Pre-trained on 97 languages (ISO 639-1 codes).
  • Embedded Model: Zero dependencies. The default model is compiled directly into the binary via go:embed.
  • Zero-Allocation Inference: Highly optimized engine minimizes garbage collection overhead.
  • Language Subsetting: Restrict predictions to a known subset of languages for improved accuracy and speed.
  • Probability Normalization: Output standardized probabilities (0.0 - 1.0) rather than raw log-scores.
  • Versatile CLI: Includes interactive, batch, and stream modes.
  • CGO-Free: 100% pure Go, guaranteeing simple cross-compilation and native GC performance.

Compatibility Matrix

langid-go has been designed as a fully-featured, production-ready superset of the original ecosystem libraries:

Feature langid.py langid.c langid.js langid-go
Default 97-language model yes yes yes yes
Custom model loading yes yes via generated JS yes (via .lidg binary)
Classify text yes yes yes yes
Return raw log-score yes no yes (as ranks) yes
Rank all languages yes no yes yes
Normalize probabilities yes no no yes
Language subsetting yes no no yes
Reset language subset yes no no yes
File helper API yes CLI only no yes
CLI document mode yes yes no yes
CLI line mode yes yes no yes
CLI batch mode yes yes no yes
Python-compatible flags yes partial no yes
URL classification yes no no yes
HTTP service mode yes no no yes
Web browser demo yes no yes yes
Training tools yes no no Planned Future Feature (TODO)

Library Usage

go get github.com/ilpy20/langid-go@latest
Basic Usage

The simplest way to classify text is by using the package-level Classify function, which automatically loads the embedded model on its first invocation.

package main

import (
	"fmt"
	"github.com/ilpy20/langid-go"
)

func main() {
	res, err := langid.Classify("This is a short English sentence")
	if err == nil {
		// res.Language == "en", res.Score is the raw negative log probability
		fmt.Printf("Language: %s\n", res.Language)
	}
}
Advanced Usage

To utilize advanced features like probability normalization, language ranking, or subsetting, you must instantiate an Identifier:

package main

import (
	"fmt"
	"github.com/ilpy20/langid-go"
)

func main() {
	id, _ := langid.NewDefaultIdentifier()

	// 1. Restrict the language set 
	// This drastically improves accuracy and speed if your domain is known.
	// Use SetLanguages(...) instead of the deprecated KeepOnly(...)
	id.SetLanguages("en", "fr", "es", "de", "it")

	// 2. Rank all languages instead of just returning the best match
	results, _ := id.RankString("Bonjour tout le monde")

	// 3. Normalize raw log-scores into a standard 0.0 - 1.0 probability distribution
	langid.Normalize(results)

	for i := 0; i < 3; i++ {
		fmt.Printf("%s: %.2f%%\n", results[i].Language, results[i].Score*100)
	}
	// Output: 
	// fr: 99.98%
	// en: 0.02%
	// es: 0.00%
}

[!NOTE] You can restore the full language list at any time by calling id.ResetLanguages() or invoking id.SetLanguages() with empty/no arguments.

File Helper APIs

langid-go provides optimized native file-reading classification helpers at both the package and instance levels, ensuring clean error propagation:

// Package-level helpers (uses default embedded model)
res, err := langid.IdentifyFile("document.txt")
results, err := langid.RankFile("document.txt")

// Instance-level helpers
res, err := id.IdentifyFile("document.txt")
results, err := id.RankFile("document.txt")
URL Classification API

The urlclass package provides a programmatic client to fetch and classify the text contents of standard web pages with automatic timeout management:

package main

import (
	"fmt"
	"time"

	"github.com/ilpy20/langid-go"
	"github.com/ilpy20/langid-go/urlclass"
)

func main() {
	id, _ := langid.NewDefaultIdentifier()
	client := urlclass.NewClient(id)

	// Fetch a URL and classify its language with a 5-second timeout
	res, bytesFetched, err := client.ClassifyURL("https://example.com", 5*time.Second)
	if err != nil {
		panic(err)
	}

	fmt.Printf("Fetched %d bytes. Language: %s (Log Score: %.2f)\n", bytesFetched, res.Language, res.Score)
}
HTTP Service API

The service package provides a highly-concurrent HTTP router and server wrapping langid for exposing classification over REST endpoints or hosting an interactive local sandbox:

package main

import (
	"github.com/ilpy20/langid-go"
	"github.com/ilpy20/langid-go/service"
)

func main() {
	id, _ := langid.NewDefaultIdentifier()
	srv := service.NewServer(id)

	// Starts an HTTP server on http://localhost:9008
	if err := srv.Start("127.0.0.1", 9008); err != nil {
		panic(err)
	}
}

CLI Usage

langid provides a powerful pure-Go command-line interface fully backwards-compatible with the original Python and C versions, while introducing advanced modern tooling.

Build
go build ./cmd/langid
Options
Usage of langid:
  -m, -model string
    	path to .lidg model (optional, uses default if omitted)
  -l, --langs string
    	comma-separated set of target ISO639 language codes (e.g en,de)
      --line
    	line mode: classify each input line (legacy alias: -l)
  -b, --batch
    	batch mode: treat stdin lines as file paths to classify
  -d, --dist
    	show full distribution over languages (rank mode)
  -n, --normalize
    	normalize confidence scores to probability values (0.0 to 1.0)
  -f, --format string
    	output format for batch mode: classic, csv, or jsonl (default "classic")
      --ignore-missing
    	silently skip missing or unreadable files in batch mode
      --serve
    	start HTTP service mode
      --demo
    	start HTTP service mode and open demo page in web browser
      --host string
    	host to bind HTTP service to (default "127.0.0.1")
      --port int
    	port to bind HTTP service to (default 9008)
  -u, --url string
    	classify the content of a URL
Modes of Operation
1. Standard / Pipe Mode

Process a single document from standard input.

$ ./langid -n <<< "Hello World"
('en', 1.0000)
2. Interactive REPL Mode

Running ./langid with standard input attached to a terminal automatically boots an interactive shell.

$ ./langid -n
>>> Hello World
('en', 1.0000)
>>> Bonjour tout le monde
('fr', 1.0000)
3. Line Mode (--line)

Process standard input line-by-line, treating each as a distinct classification job.

$ printf "hello world\nbonjour tout le monde\n" | ./langid --line
('en', -102.5)
('fr', -105.1)
4. Batch Mode (-b / --batch / -f / --format)

Treat inputs as file paths, processing them in bulk. Files can be passed directly as command-line arguments, falling back to reading paths from stdin if none are specified.

Output Formats (--format classic | csv | jsonl)
  • classic (default): Prints path,('lang', score)
  • csv:
    • In standard mode, outputs: path,lang,score (no header)
    • In distribution mode (-d), outputs a header row followed by scores for all supported columns:
      path,en,fr,es,de,...
      file1.txt,-105.1,-240.2,...
      
  • jsonl:
    • In standard mode:
      {"path":"file1.txt","language":"en","confidence":-105.1}
      
    • In distribution mode (-d):
      {"path":"file1.txt","ranking":[{"language":"en","score":-105.1},{"language":"fr","score":-240.2},...]}
      
Command Examples:
# Pass files directly as arguments
./langid --batch --format csv file1.txt file2.txt

# Pipe file lists from Unix utilities
find . -name "*.md" | ./langid -b -n --format jsonl

# Ignore missing/unreadable files (instead of returning "NOSUCHFILE")
./langid --batch --ignore-missing file1.txt missing_file.txt
5. URL Classification (-u / --url)

Directly retrieve and classify webpage contents from the command line:

$ ./langid --url "https://example.com" -n
https://example.com 1256 ('en', 1.0000)

(Outputs the target URL, the response body length in bytes, and the predicted language metadata).

6. HTTP Server & Web Demo Mode (--serve / --demo)

Expose language identification as an HTTP microservice:

# Starts service locally
./langid --serve --port 9008

# Starts service and opens the interactive jQuery sandbox demo in your default browser
./langid --demo
API Specifications:
  • POST /detect or GET /detect?q=<text>: Predict the language.
    • Request Body: Raw text or standard application/x-www-form-urlencoded string containing parameter q.
    • Response Envelope:
      {
        "responseData": {
          "language": "en",
          "confidence": 1.0
        },
        "responseDetails": null,
        "responseStatus": 200
      }
      
  • POST /rank or GET /rank?q=<text>: Retrieve full confidence list.
    • Response Envelope:
      {
        "responseData": [
          ["en", 1.0],
          ["fr", 0.0],
          ["es", 0.0]
        ],
        "responseDetails": null,
        "responseStatus": 200
      }
      
  • GET /demo: Returns the interactive web UI sandbox.

Model Training & Customization

Scope and Decisions

langid-go is designed as a high-speed, highly concurrent, zero-allocation inference engine. To keep the Go package optimized, secure, and free from external runtime dependencies or floating-point precision drift, the following architectural choices have been made:

  • Go-Native Training is a Planned Future Feature (TODO): Model training requires a multi-stage statistical pipeline (corpus indexing, byte-level sliding window tokenization, Shannon information-gain calculations, n-gram optimization, and Aho-Corasick DFA state-machine construction). The reference langid.py implementation utilizes the Python scientific stack (numpy and scipy) for these calculations. Implementing a native Go training pipeline remains a planned future feature (TODO) once suitable Go NLP, matrix computation, or scanner compiling libraries are identified.
  • Direct Legacy Model Loading (.model files) is Out of Scope: Original models produced by Python are base64-encoded, bz2-compressed Python 2 pickle files. Reading Python pickles directly in Go is fragile, insecure, and computationally expensive.

Instead, custom models are trained using the reference Python pipeline and converted to the highly-optimized, type-safe Go .lidg binary format. The provided conversion utility (scripts/convert_model.py) is modeled directly on and adapted from the original ldpy2ldc.py script in the langid.c package.

Custom Model Workflow
1. Train Your Model in Python

Use the official Python training toolkit located in the reference repository under langid.py/langid/train. To train a model from a corpus directory (where each subdirectory corresponds to a language code containing text documents), run:

python3 path/to/langid.py/langid/train/train.py -m /path/to/output_dir /path/to/corpus

This produces a legacy Python model file (e.g., my_custom.model).

2. Convert Your Model to Go .lidg Format

Convert the legacy pickle model into the highly-optimized .lidg binary format using the provided conversion utility:

python3 scripts/convert_model.py my_custom.model model/my_custom.lidg
3. Load Your Custom Model in Go

You can load and run your custom .lidg model programmatically or via the command line.

Programmatically:

package main

import (
	"fmt"
	"github.com/ilpy20/langid-go"
)

func main() {
	id, err := langid.LoadModel("model/my_custom.lidg")
	if err != nil {
		panic(err)
	}

	res, _ := id.IdentifyString("This text will be classified by your custom model")
	fmt.Printf("Language: %s (Log Score: %.2f)\n", res.Language, res.Score)
}

Via the CLI: Specify your custom model using the -m or --model flag:

./langid -m model/my_custom.lidg <<< "This text will be classified by your custom model"

Acknowledgements and References

langid.go is a port of the Naive Bayes / DFA language identification algorithm originally created by Marco Lui and Timothy Baldwin.

  • [1] Lui, Marco and Timothy Baldwin (2011) Cross-domain Feature Selection for Language Identification, In Proceedings of the Fifth International Joint Conference on Natural Language Processing (IJCNLP 2011), Chiang Mai, Thailand, pp. 553—561. PDF
  • [2] Lui, Marco and Timothy Baldwin (2012) langid.py: An Off-the-shelf Language Identification Tool, In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (ACL 2012), Demo Session, Jeju, Republic of Korea. PDF

Documentation

Overview

Package langid provides a high-performance, zero-allocation, zero-dependency natural language identifier supporting 97 languages. It is a pure Go port of the popular langid.py and langid.c tools, achieving exact mathematical parity with their Naive Bayes classifiers and DFA state transition engines.

The package is completely CGO-free, making it simple to cross-compile and safe for highly concurrent production pipelines.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func Normalize

func Normalize(results []Result)

Normalize converts a list of raw log-probabilities into a proper probability distribution (0.0 to 1.0).

Types

type Identifier

type Identifier struct {
	// contains filtered or unexported fields
}

Identifier classifies text by language using a pre-trained model.

func LoadModel

func LoadModel(path string) (*Identifier, error)

LoadModel reads a .lidg model file.

func NewDefaultIdentifier

func NewDefaultIdentifier() (*Identifier, error)

NewDefaultIdentifier loads the embedded default model (ldpy).

func (*Identifier) Classes

func (id *Identifier) Classes() []string

Classes returns the active language classes supported by the identifier.

func (*Identifier) IdentifyBytes

func (id *Identifier) IdentifyBytes(text []byte) (Result, error)

IdentifyBytes predicts a language label for bytes.

func (*Identifier) IdentifyFile

func (id *Identifier) IdentifyFile(path string) (Result, error)

IdentifyFile reads the file at the specified path and predicts its language. If reading the file fails, it returns the wrapped filesystem error without swallowing context.

func (*Identifier) IdentifyString

func (id *Identifier) IdentifyString(text string) (Result, error)

IdentifyString predicts a language label for text.

func (*Identifier) KeepOnly deprecated

func (id *Identifier) KeepOnly(langs ...string) error

KeepOnly restricts the identifier to a specific subset of languages.

Deprecated: Use SetLanguages instead, which has identical behavior with stricter language validation and support for resetting subsets.

func (*Identifier) RankBytes

func (id *Identifier) RankBytes(text []byte) ([]Result, error)

RankBytes returns a sorted list of all languages and their raw log scores.

func (*Identifier) RankFile

func (id *Identifier) RankFile(path string) ([]Result, error)

RankFile reads the file at the specified path and ranks all supported languages by likelihood. If reading the file fails, it returns the wrapped filesystem error without swallowing context.

func (*Identifier) RankString

func (id *Identifier) RankString(text string) ([]Result, error)

RankString returns a sorted list of all languages and their raw log scores.

func (*Identifier) ResetLanguages

func (id *Identifier) ResetLanguages()

ResetLanguages restores the active language set of the identifier to include all languages present in the original loaded model.

func (*Identifier) SetLanguages

func (id *Identifier) SetLanguages(langs ...string) error

SetLanguages restricts the active language set of the identifier to the specified subset. If langs is empty or nil, it resets the active languages to the original model languages. If any requested language is not supported by the model, it returns an error and leaves the active language set unmodified (atomic operation).

type Result

type Result struct {
	Language string
	Score    float64
}

Result contains the best predicted class and its raw log score.

func Classify

func Classify(text string) (Result, error)

Classify uses a lazily-initialized embedded default model.

func IdentifyFile

func IdentifyFile(path string) (Result, error)

IdentifyFile reads the file at the specified path and predicts its language using the default identifier.

func RankFile

func RankFile(path string) ([]Result, error)

RankFile reads the file at the specified path and ranks all supported languages by likelihood using the default identifier.

Directories

Path Synopsis
cmd
langid command
internal

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL