zemberek-go

module

v0.1.4 Latest Latest Go to latest Published: Nov 9, 2025 License: Apache-2.0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/kalaomer/zemberek-go

Links

Open Source Insights

README ¶

Zemberek-Go

Go implementation of the original zemberek-nlp Java library for Turkish language processing.

Features

Currently, the following modules have been ported:

Core

Turkish alphabet and phonetic attributes
Multi-level perfect hash functions and compression primitives
Text utilities for casing, diacritics and token helpers

Tokenization

Token/span types and sentence boundary detection

Language Model (LM)

Compressed vocabulary and n‑gram accessors
SmoothLM reader with MPHFs

Morphology

Binary lexicon loader and dictionary items
Morphotactics graph, analysis and generation helpers

Normalization

Full sentence normalizer with spell checker + LM ranking
Deasciifier and ASCII tolerant utilities

Installation

go get github.com/kalaomer/zemberek-go

Usage

package main

import (
    "fmt"
    "github.com/kalaomer/zemberek-go/core/turkish"
    "github.com/kalaomer/zemberek-go/tokenization"
)

func main() {
    // Use Turkish alphabet
    alphabet := turkish.Instance
    fmt.Println("Is 'ı' a vowel?", alphabet.IsVowel('ı'))

    // Tokenize text
    extractor, _ := tokenization.NewTurkishSentenceExtractor(false, "")
    sentences := extractor.FromParagraph("Merhaba dünya! Bu bir test cümlesidir.")
    for _, sentence := range sentences {
        fmt.Println(sentence)
    }
}

Sentence normalization

package main

import (
    "fmt"
    "log"

    "github.com/kalaomer/zemberek-go/morphology"
    "github.com/kalaomer/zemberek-go/normalization"
)

func main() {
    morph := morphology.CreateWithDefaults()
    normalizer, err := normalization.NewTurkishSentenceNormalizerAdvanced(morph, "data")
    if err != nil {
        log.Fatalf("normalizer init: %v", err)
    }

    input := "Yrn okua gidicem"
    fmt.Println(normalizer.Normalize(input))
}

Dependencies

Go 1.18 or higher
Standard library only (no external dependencies for core functionality)

Resource data

Language resources (lexicon binaries, normalization tables, language models) are expected under data/ by default. If you keep them elsewhere, export ZEMBEREK_DATA_ROOT=/absolute/path/to/your/data so both the examples and the advanced normalizer can locate them.

Example data bundles (LM and normalization folders) are available here: https://drive.google.com/drive/folders/1tztjRiUs9BOTH-tb1v7FWyixl-iUpydW. Download the archive, extract it to a directory of your choice, and point ZEMBEREK_DATA_ROOT to that directory before running the examples.

Development Status

The port follows zemberek-nlp’s architecture module by module. Core components, tokenization, lexicon handling, language model loading and advanced normalization are functional; remaining work focuses on fine-tuning morphology generation/ambiguity resolution and extending test coverage as the Java baseline evolves.

Notes

This port mirrors the Java implementation’s architecture while adapting to Go idioms:

Java classes → Go structs/interfaces
Java enums → Go iota constants
Immutable data → Go value types and generated readers

Credits

Original Java implementation: zemberek-nlp by Ahmet A. Akın
Go port: This repository and its contributors

License

Apache License 2.0

Contributing

Contributions are welcome! This is a large codebase and help with porting remaining modules would be appreciated.

Directories ¶

Path	Synopsis
core
compression
data
hash
quantization
text
turkish
utils
lm
compression
morphology
analysis
generator
lexicon
lexicon/proto
morphotactics
normalization
deasciifier
sqlite_extension module
tokenization

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL