xunicode

module
v0.0.0-...-e34c6fc Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 10, 2026 License: Apache-2.0

README

XUnicode

XUnicode is a library for handling Unicode text segmentation. It implements the Unicode Text Segmentation algorithms as defined in UAX#29 and UAX#14. The library provides functions for segmenting text into grapheme clusters, words, sentences, and line breaks.

Features

  • Zero-allocation text segmentation
  • Locale tailoring support
  • CSS-style line break rules

Usage

package main

import (
    "fmt"
    "charm.land/xunicode/grapheme"
    "charm.land/xunicode/word"
    "charm.land/xunicode/sentence"
    "charm.land/xunicode/line"
)

func main() {
	input := []byte("Hello, 世界! 👨‍👩‍👧‍👦")

	seg := grapheme.NewSegmenter(input)
	for seg.Next() {
		start, end := seg.Position()
		fmt.Printf("[%d:%d] %q\n", start, end, seg.Text())
	}
	// Output:
	// [0:1] "H"
	// [1:2] "e"
	// [2:3] "l"
	// [3:4] "l"
	// [4:5] "o"
	// [5:6] ","
	// [6:7] " "
	// [7:10] "世"
	// [10:13] "界"
	// [13:14] "!"
	// [14:15] " "
	// [15:40] "👨\u200d👩\u200d👧\u200d👦"

	input = []byte("Hello, World!")

	seg = word.NewSegmenter(input)
	for seg.Next() {
		fmt.Printf("%q\n", seg.Text())
	}
	// Output:
	// "Hello"
	// ","
	// " "
	// "World"
	// "!"

	input = []byte("First. Second! Third?")

	seg = sentence.NewSegmenter(input)
	for seg.Next() {
		start, end := seg.Position()
		fmt.Printf("[%d:%d] %q\n", start, end, seg.Text())
	}
	// Output:
	// [0:7] "First. "
	// [7:15] "Second! "
	// [15:21] "Third?"

	input = []byte("Hello, world! How are you?")
	seg = line.NewSegmenter(input)
	for seg.Next() {
		start, end := seg.Position()
		fmt.Printf("[%d:%d] %q\n", start, end, seg.Text())
	}

	// Output:
	// [0:7] "Hello, "
	// [7:14] "world! "
	// [14:18] "How "
	// [18:22] "are "
	// [22:26] "you?"

Contributing

See contributing.

Feedback

We’d love to hear your thoughts on this project. Feel free to drop us a note!

Acknowledgments

XUnicode is inspired by and based on the work of the Go team, the Unicode Consortium, and many Go libraries that have implemented Unicode text segmentation. This includes but is not limited to:

  • github.com/blevesearch/segment: A Go library for text segmentation that includes support for Unicode text segmentation algorithms.
  • github.com/clipperhouse/uax14: A Go library that implements the Unicode Line Breaking Algorithm as defined in UAX#14.
  • github.com/clipperhouse/uax29: A Go library that implements the Unicode Text Segmentation algorithms as defined in UAX#29.
  • github.com/rivo/uniseg: A Go library for Unicode text segmentation that provides a simple API for segmenting text into grapheme clusters, words, sentences, and line breaks.
  • github.com/unicode-org/icu4x: A project that provides a set of libraries for Unicode support in Rust, including text segmentation algorithms.
  • github.com/unicode-org/icu: The International Components for Unicode (ICU) project, which provides a comprehensive set of libraries for Unicode support, including text segmentation algorithms.
  • golang.org/x/text: The Go team's official text processing library.

License

BSD and Apache 2.0


Part of Charm.

The Charm logo

Charm热爱开源 • Charm loves open source • نحنُ نحب المصادر المفتوحة

Directories

Path Synopsis
Package grapheme implements Unicode grapheme cluster segmentation as defined by UAX #29.
Package grapheme implements Unicode grapheme cluster segmentation as defined by UAX #29.
internal
gen
Package gen contains common code for the various code generation tools in the text repository.
Package gen contains common code for the various code generation tools in the text repository.
gen/bitfield
Package bitfield converts annotated structs into integer values.
Package bitfield converts annotated structs into integer values.
segmenter
Package segmenter implements a state machine engine for Unicode text segmentation (UAX #29 and UAX #14).
Package segmenter implements a state machine engine for Unicode text segmentation (UAX #29 and UAX #14).
testtext
Package testtext contains test data that is of common use to the text repository.
Package testtext contains test data that is of common use to the text repository.
triegen
Package triegen implements a code generator for a trie for associating unsigned integer values with UTF-8 encoded runes.
Package triegen implements a code generator for a trie for associating unsigned integer values with UTF-8 encoded runes.
ucd
Package ucd provides a parser for Unicode Character Database files, the format of which is defined in https://www.unicode.org/reports/tr44/.
Package ucd provides a parser for Unicode Character Database files, the format of which is defined in https://www.unicode.org/reports/tr44/.
Package line implements Unicode line break segmentation as defined by UAX #14.
Package line implements Unicode line break segmentation as defined by UAX #14.
Package sentence implements Unicode sentence segmentation as defined by UAX #29.
Package sentence implements Unicode sentence segmentation as defined by UAX #29.
Package word implements Unicode word segmentation as defined by UAX #29.
Package word implements Unicode word segmentation as defined by UAX #29.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL