langid

package module
v0.1.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 18, 2026 License: BSD-2-Clause Imports: 8 Imported by: 0

README

langid-go

Go Reference Go Version License Build Status Go Report Card

langid-go is a high-performance Go natural language identifier pre-trained on 97 languages. By leveraging Go's concurrency primitives (sync.Pool) and a flat-array "sparse set" architecture, this port achieves zero-allocation inference in the hot loop, making it extremely fast and suitable for high-throughput stream processing.


Key Features

  • Pre-trained on 97 languages (using standard ISO 639-1 two-letter codes).
  • Embedded Model: Completely CGO-free, requiring no external file or runtime dependencies. The default model is compiled directly into the binary via go:embed.
  • Zero-Allocation Hot-Loop: Designed for real-time chatbot gateways and NLP pipelines with garbage collection (GC) overhead eliminated on hot execution paths.
  • Language Subsetting: Restrict language predictions to a known subset of languages for even greater accuracy and classification speeds.
  • Probability Normalization: Standardizes raw log-scores into a standard 0.0 - 1.0 probability distribution.
  • Feature-Rich CLI: Supports interactive REPL, document, line-by-line, batch, URL, and local HTTP service modes.

Quick Start

Installation
go get github.com/ilpy20/langid-go
Basic Library Usage
package main

import (
	"fmt"
	"github.com/ilpy20/langid-go"
)

func main() {
	// Identify text using the default embedded model
	res, err := langid.Classify("This is a short English sentence")
	if err == nil {
		// res.Language == "en", res.Score is the raw negative log probability
		fmt.Printf("Predicted Language: %s (Log Score: %.2f)\n", res.Language, res.Score)
	}
}

Documentation Index

Explore our comprehensive guides for specialized usage patterns and architectural deep dives:

Document Description
📖 Library API Guide Detailed programmatical guide to subsetting, score normalization, file helpers, urlclass, and service APIs.
💻 CLI Usage Guide How to build and run the CLI for REPL, batch file streaming, URL parsing, and running the HTTP microservice.
🧠 Model Training & Conversion Technical decisions, step-by-step workflow for training custom models in Python and compiling them to .lidg binary files.
Architecture & Motivation Why langid-go exists, comparison with alternative libraries (lingua-go, whatlanggo), and deep-dive into zero-allocation pooling.

Compatibility Matrix

langid-go is designed as a fully featured, production-ready superset of the original ecosystem libraries:

Feature langid.py langid.c langid.js langid-go
Default 97-language model yes yes yes yes
Custom model loading yes yes via generated JS yes (via .lidg binary)
Classify text yes yes yes yes
Return raw log-score yes no yes (as ranks) yes
Rank all languages yes no yes yes
Normalize probabilities yes no no yes
Language subsetting yes no no yes
File helper API yes CLI only no yes
CLI document mode yes yes no yes
CLI line mode yes yes no yes
CLI batch mode yes yes no yes
Python-compatible flags yes partial no yes
URL classification yes no no yes
HTTP service mode yes no no yes
Web browser demo yes no yes yes
Training tools yes no no Planned Future Feature (TODO)

Contributing

We welcome community feedback, issue reports, and pull requests! Please feel free to open a ticket on our GitHub issues page. See our planned roadmap features in TODO.md.


Acknowledgements and References

langid.go is a port of the Naive Bayes / DFA language identification algorithm originally created by Marco Lui and Timothy Baldwin.

  • [1] Lui, Marco and Timothy Baldwin (2011) Cross-domain Feature Selection for Language Identification, In Proceedings of the Fifth International Joint Conference on Natural Language Processing (IJCNLP 2011), Chiang Mai, Thailand, pp. 553—561. PDF
  • [2] Lui, Marco and Timothy Baldwin (2012) langid.py: An Off-the-shelf Language Identification Tool, In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (ACL 2012), Demo Session, Jeju, Republic of Korea. PDF

License

This project is licensed under the BSD-2 Clause License - see the LICENSE file for details.

Documentation

Overview

Package langid provides a high-performance natural language identifier library supporting 97 languages. It is a pure Go runtime port of the langid inference stack, initially derived from langid.c and later expanded for parity with langid.js and langid.py.

The package ports the Naive Bayes/DFA inference path, not the original training pipeline. It is CGO-free, making it simple to cross-compile and safe for highly concurrent production pipelines.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func Normalize

func Normalize(results []Result)

Normalize converts a list of raw log-probabilities into a proper probability distribution (0.0 to 1.0).

Types

type Identifier

type Identifier struct {
	// contains filtered or unexported fields
}

Identifier classifies text by language using a pre-trained model.

func LoadModel

func LoadModel(path string) (*Identifier, error)

LoadModel reads a .lidg model file.

func NewDefaultIdentifier

func NewDefaultIdentifier() (*Identifier, error)

NewDefaultIdentifier loads the embedded default model (ldpy).

func (*Identifier) Classes

func (id *Identifier) Classes() []string

Classes returns the active language classes supported by the identifier.

func (*Identifier) IdentifyBytes

func (id *Identifier) IdentifyBytes(text []byte) (Result, error)

IdentifyBytes predicts a language label for bytes.

func (*Identifier) IdentifyFile

func (id *Identifier) IdentifyFile(path string) (Result, error)

IdentifyFile reads the file at the specified path and predicts its language. If reading the file fails, it returns the wrapped filesystem error without swallowing context.

func (*Identifier) IdentifyString

func (id *Identifier) IdentifyString(text string) (Result, error)

IdentifyString predicts a language label for text.

func (*Identifier) KeepOnly deprecated

func (id *Identifier) KeepOnly(langs ...string) error

KeepOnly restricts the identifier to a specific subset of languages.

Deprecated: Use SetLanguages instead, which has identical behavior with stricter language validation and support for resetting subsets.

func (*Identifier) RankBytes

func (id *Identifier) RankBytes(text []byte) ([]Result, error)

RankBytes returns a sorted list of all languages and their raw log scores.

func (*Identifier) RankFile

func (id *Identifier) RankFile(path string) ([]Result, error)

RankFile reads the file at the specified path and ranks all supported languages by likelihood. If reading the file fails, it returns the wrapped filesystem error without swallowing context.

func (*Identifier) RankString

func (id *Identifier) RankString(text string) ([]Result, error)

RankString returns a sorted list of all languages and their raw log scores.

func (*Identifier) ResetLanguages

func (id *Identifier) ResetLanguages()

ResetLanguages restores the active language set of the identifier to include all languages present in the original loaded model.

func (*Identifier) SetLanguages

func (id *Identifier) SetLanguages(langs ...string) error

SetLanguages restricts the active language set of the identifier to the specified subset. If langs is empty or nil, it resets the active languages to the original model languages. If any requested language is not supported by the model, it returns an error and leaves the active language set unmodified (atomic operation).

type Result

type Result struct {
	Language string
	Score    float64
}

Result contains the best predicted class and its raw log score.

func Classify

func Classify(text string) (Result, error)

Classify uses a lazily-initialized embedded default model.

func IdentifyFile

func IdentifyFile(path string) (Result, error)

IdentifyFile reads the file at the specified path and predicts its language using the default identifier.

func RankFile

func RankFile(path string) ([]Result, error)

RankFile reads the file at the specified path and ranks all supported languages by likelihood using the default identifier.

Directories

Path Synopsis
cmd
langid command
internal

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL