graphemes

package
v1.16.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Sep 6, 2025 License: MIT Imports: 2 Imported by: 13

README

An implementation of grapheme cluster boundaries from Unicode text segmentation (UAX 29), for Unicode version 15.0.0.

Quick start

go get "github.com/clipperhouse/uax29/graphemes"
import "github.com/clipperhouse/uax29/graphemes"

text := []byte("Hello, 世界. Nice dog! 👍🐶")

segments := graphemes.NewSegmenter(text)        // A segmenter is an iterator over the graphemes

for segments.Next() {                           // Next() returns true until end of data or error
	fmt.Printf("%q\n", segments.Bytes())        // Do something with the current grapheme
}

if err := segments.Err(); err != nil {          // Check the error
	log.Fatal(err)
}

Documentation

A grapheme is a “single visible character”, which might be a simple as a single letter, or a complex emoji that consists of several Unicode code points. For our purposes, “segment”, “grapheme”, and “token” are used synonymously.

Conformance

We use the Unicode test suite. Status:

Go

APIs

If you have a []byte

Use Segmenter for bounded memory and best performance:

text := []byte("Hello, 世界. Nice dog! 👍🐶")

segments := graphemes.NewSegmenter(text)        // A segmenter is an iterator over the graphemes

for segments.Next() {                           // Next() returns true until end of data or error
	fmt.Printf("%q\n", segments.Bytes())        // Do something with the current grapheme
}

if err := segments.Err(); err != nil {          // Check the error
	log.Fatal(err)
}

Use SegmentAll() if you prefer brevity, are not too concerned about allocations, or would be populating a [][]byte anyway.

text := []byte("Hello, 世界. Nice dog! 👍🐶")
segments := graphemes.SegmentAll(text)          // Returns a slice of byte slices; each slice is a grapheme

fmt.Println("Graphemes: %q", segments)

If you have an io.Reader

Use Scanner (which is a bufio.Scanner, those docs will tell you what to do).

r := getYourReader()                            // from a file or network maybe
scanner := graphemes.NewScanner(r)

for scanner.Scan() {                            // Scan() returns true until error or EOF
	fmt.Println(scanner.Text())                 // Do something with the current grapheme
}

if err := scanner.Err(); err != nil {           // Check the error
	log.Fatal(err)
}

Performance

On a Mac laptop, we see around 70MB/s, which works out to around 70 million graphemes per second.

You should see approximately constant memory when using Segmenter or Scanner, independent of data size. When using SegmentAll(), expect memory to be O(n) on the number of graphemes.

Invalid inputs

Invalid UTF-8 input is considered undefined behavior. We test to ensure that bad inputs will not cause pathological outcomes, such as a panic or infinite loop. Callers should expect “garbage-in, garbage-out”.

Your pipeline should probably include a call to utf8.Valid().

Documentation

Overview

Package graphemes implements Unicode grapheme cluster boundaries: https://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries

Index

Examples

Constants

This section is empty.

Variables

This section is empty.

Functions

func NewScanner added in v1.0.4

func NewScanner(r io.Reader) *iterators.Scanner

NewScanner returns a Scanner, to tokenize graphemes per https://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries. Iterate through graphemes by calling Scan() until false, then check Err(). See also the bufio.Scanner docs.

Example
package main

import (
	"fmt"
	"log"
	"strings"

	"github.com/clipperhouse/uax29/graphemes"
)

func main() {
	text := "Hello, 世界. Nice dog! 👍🐶"
	reader := strings.NewReader(text)

	scanner := graphemes.NewScanner(reader)

	// Scan returns true until error or EOF
	for scanner.Scan() {
		fmt.Printf("%q\n", scanner.Text())
	}

	// Gotta check the error!
	if err := scanner.Err(); err != nil {
		log.Fatal(err)
	}
}
Output:
"H"
"e"
"l"
"l"
"o"
","
" "
"世"
"界"
"."
" "
"N"
"i"
"c"
"e"
" "
"d"
"o"
"g"
"!"
" "
"👍"
"🐶"

func NewSegmenter added in v1.7.0

func NewSegmenter(data []byte) *iterators.Segmenter

NewSegmenter returns a Segmenter, which is an iterator over the source text. Iterate while Next() is true, and access the grapheme via Bytes().

Example
package main

import (
	"fmt"
	"log"

	"github.com/clipperhouse/uax29/graphemes"
)

func main() {
	text := []byte("Hello, 世界. Nice dog! 👍🐶")

	segments := graphemes.NewSegmenter(text)

	// Next() returns true until end of data or error
	for segments.Next() {
		fmt.Printf("%q\n", segments.Bytes())
	}

	// Should check the error
	if err := segments.Err(); err != nil {
		log.Fatal(err)
	}
}
Output:
"H"
"e"
"l"
"l"
"o"
","
" "
"世"
"界"
"."
" "
"N"
"i"
"c"
"e"
" "
"d"
"o"
"g"
"!"
" "
"👍"
"🐶"

func NewStringSegmenter added in v1.16.0

func NewStringSegmenter(data string) *iterators.StringSegmenter

NewStringSegmenter returns a StringSegmenter, which is an iterator over the source text. Iterate while Next() is true, and access the grapheme via Text().

func SegmentAll added in v1.7.0

func SegmentAll(data []byte) [][]byte

SegmentAll will iterate through all graphemes and collect them into a [][]byte. This is a convenience method -- if you will be allocating such a slice anyway, this will save you some code.

The downside is that this allocation is unbounded -- O(n) on the number of graphemes. Use Segmenter for more bounded memory usage.

Example
package main

import (
	"fmt"

	"github.com/clipperhouse/uax29/graphemes"
)

func main() {
	text := []byte("Hello, 世界. Nice dog! 👍🐶")

	segments := graphemes.SegmentAll(text)
	fmt.Printf("%q\n", segments)

}
Output:
["H" "e" "l" "l" "o" "," " " "世" "界" "." " " "N" "i" "c" "e" " " "d" "o" "g" "!" " " "👍" "🐶"]

func SegmentAllString added in v1.16.0

func SegmentAllString(data string) []string

SegmentAllString will iterate through all graphemes and collect them into a []string. This is a convenience method -- if you will be allocating such a slice anyway, this will save you some code.

The downside is that this allocation is unbounded -- O(n) on the number of graphemes. Use StringSegmenter for more bounded memory usage.

func SplitFunc added in v1.2.0

func SplitFunc(data []byte, atEOF bool) (advance int, token []byte, err error)

SplitFunc is a bufio.SplitFunc implementation of word segmentation, for use with bufio.Scanner.

Types

This section is empty.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL