xunicode

module

v0.0.0-...-e34c6fc Latest Latest Go to latest Published: Mar 10, 2026 License: Apache-2.0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/charmbracelet/xunicode

Links

Open Source Insights

README ¶

XUnicode

XUnicode is a library for handling Unicode text segmentation. It implements the Unicode Text Segmentation algorithms as defined in UAX#29 and UAX#14. The library provides functions for segmenting text into grapheme clusters, words, sentences, and line breaks.

Features

Zero-allocation text segmentation
Locale tailoring support
CSS-style line break rules

Usage

package main

import (
    "fmt"
    "charm.land/xunicode/grapheme"
    "charm.land/xunicode/word"
    "charm.land/xunicode/sentence"
    "charm.land/xunicode/line"
)

func main() {
	input := []byte("Hello, 世界! 👨‍👩‍👧‍👦")

	seg := grapheme.NewSegmenter(input)
	for seg.Next() {
		start, end := seg.Position()
		fmt.Printf("[%d:%d] %q\n", start, end, seg.Text())
	}
	// Output:
	// [0:1] "H"
	// [1:2] "e"
	// [2:3] "l"
	// [3:4] "l"
	// [4:5] "o"
	// [5:6] ","
	// [6:7] " "
	// [7:10] "世"
	// [10:13] "界"
	// [13:14] "!"
	// [14:15] " "
	// [15:40] "👨\u200d👩\u200d👧\u200d👦"

	input = []byte("Hello, World!")

	seg = word.NewSegmenter(input)
	for seg.Next() {
		fmt.Printf("%q\n", seg.Text())
	}
	// Output:
	// "Hello"
	// ","
	// " "
	// "World"
	// "!"

	input = []byte("First. Second! Third?")

	seg = sentence.NewSegmenter(input)
	for seg.Next() {
		start, end := seg.Position()
		fmt.Printf("[%d:%d] %q\n", start, end, seg.Text())
	}
	// Output:
	// [0:7] "First. "
	// [7:15] "Second! "
	// [15:21] "Third?"

	input = []byte("Hello, world! How are you?")
	seg = line.NewSegmenter(input)
	for seg.Next() {
		start, end := seg.Position()
		fmt.Printf("[%d:%d] %q\n", start, end, seg.Text())
	}

	// Output:
	// [0:7] "Hello, "
	// [7:14] "world! "
	// [14:18] "How "
	// [18:22] "are "
	// [22:26] "you?"

Contributing

See contributing.

Feedback

We’d love to hear your thoughts on this project. Feel free to drop us a note!

Acknowledgments

XUnicode is inspired by and based on the work of the Go team, the Unicode Consortium, and many Go libraries that have implemented Unicode text segmentation. This includes but is not limited to:

github.com/blevesearch/segment: A Go library for text segmentation that includes support for Unicode text segmentation algorithms.
github.com/clipperhouse/uax14: A Go library that implements the Unicode Line Breaking Algorithm as defined in UAX#14.
github.com/clipperhouse/uax29: A Go library that implements the Unicode Text Segmentation algorithms as defined in UAX#29.
github.com/rivo/uniseg: A Go library for Unicode text segmentation that provides a simple API for segmenting text into grapheme clusters, words, sentences, and line breaks.
github.com/unicode-org/icu4x: A project that provides a set of libraries for Unicode support in Rust, including text segmentation algorithms.
github.com/unicode-org/icu: The International Components for Unicode (ICU) project, which provides a comprehensive set of libraries for Unicode support, including text segmentation algorithms.
golang.org/x/text: The Go team's official text processing library.

License

BSD and Apache 2.0

Part of Charm.

Charm热爱开源 • Charm loves open source • نحنُ نحب المصادر المفتوحة

Directories ¶

Path	Synopsis
grapheme Package grapheme implements Unicode grapheme cluster segmentation as defined by UAX #29.	Package grapheme implements Unicode grapheme cluster segmentation as defined by UAX #29.
internal
gen Package gen contains common code for the various code generation tools in the text repository.	Package gen contains common code for the various code generation tools in the text repository.
gen/bitfield Package bitfield converts annotated structs into integer values.	Package bitfield converts annotated structs into integer values.
segmenter Package segmenter implements a state machine engine for Unicode text segmentation (UAX #29 and UAX #14).	Package segmenter implements a state machine engine for Unicode text segmentation (UAX #29 and UAX #14).
testtext Package testtext contains test data that is of common use to the text repository.	Package testtext contains test data that is of common use to the text repository.
triegen Package triegen implements a code generator for a trie for associating unsigned integer values with UTF-8 encoded runes.	Package triegen implements a code generator for a trie for associating unsigned integer values with UTF-8 encoded runes.
ucd Package ucd provides a parser for Unicode Character Database files, the format of which is defined in https://www.unicode.org/reports/tr44/.	Package ucd provides a parser for Unicode Character Database files, the format of which is defined in https://www.unicode.org/reports/tr44/.
line Package line implements Unicode line break segmentation as defined by UAX #14.	Package line implements Unicode line break segmentation as defined by UAX #14.
sentence Package sentence implements Unicode sentence segmentation as defined by UAX #29.	Package sentence implements Unicode sentence segmentation as defined by UAX #29.
word Package word implements Unicode word segmentation as defined by UAX #29.	Package word implements Unicode word segmentation as defined by UAX #29.

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL