uniseg

package module
v0.2.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 19, 2020 License: MIT Imports: 1 Imported by: 245

README

Unicode Text Segmentation for Go

Godoc Reference Go Report

This Go package implements Unicode Text Segmentation according to Unicode Standard Annex #29 (Unicode version 12.0.0).

At this point, only the determination of grapheme cluster boundaries is implemented.

Background

In Go, strings are read-only slices of bytes. They can be turned into Unicode code points using the for loop or by casting: []rune(str). However, multiple code points may be combined into one user-perceived character or what the Unicode specification calls "grapheme cluster". Here are some examples:

String Bytes (UTF-8) Code points (runes) Grapheme clusters
Käse 6 bytes: 4b 61 cc 88 73 65 5 code points: 4b 61 308 73 65 4 clusters: [4b],[61 308],[73],[65]
🏳️‍🌈 14 bytes: f0 9f 8f b3 ef b8 8f e2 80 8d f0 9f 8c 88 4 code points: 1f3f3 fe0f 200d 1f308 1 cluster: [1f3f3 fe0f 200d 1f308]
🇩🇪 8 bytes: f0 9f 87 a9 f0 9f 87 aa 2 code points: 1f1e9 1f1ea 1 cluster: [1f1e9 1f1ea]

This package provides a tool to iterate over these grapheme clusters. This may be used to determine the number of user-perceived characters, to split strings in their intended places, or to extract individual characters which form a unit.

Installation

go get github.com/rivo/uniseg

Basic Example

package uniseg

import (
	"fmt"

	"github.com/rivo/uniseg"
)

func main() {
	gr := uniseg.NewGraphemes("👍🏼!")
	for gr.Next() {
		fmt.Printf("%x ", gr.Runes())
	}
	// Output: [1f44d 1f3fc] [21]
}

Documentation

Refer to https://godoc.org/github.com/rivo/uniseg for the package's documentation.

Dependencies

This package does not depend on any packages outside the standard library.

Your Feedback

Add your issue here on GitHub. Feel free to get in touch if you have any questions.

Version

Version tags will be introduced once Golang modules are official. Consider this version 0.1.

Documentation

Overview

Package uniseg implements Unicode Text Segmentation according to Unicode Standard Annex #29 (http://unicode.org/reports/tr29/).

At this point, only the determination of grapheme cluster boundaries is implemented.

Index

Examples

Constants

This section is empty.

Variables

This section is empty.

Functions

func GraphemeClusterCount

func GraphemeClusterCount(s string) (n int)

GraphemeClusterCount returns the number of user-perceived characters (grapheme clusters) for the given string. To calculate this number, it iterates through the string using the Graphemes iterator.

Types

type Graphemes

type Graphemes struct {
	// contains filtered or unexported fields
}

Graphemes implements an iterator over Unicode extended grapheme clusters, specified in the Unicode Standard Annex #29. Grapheme clusters correspond to "user-perceived characters". These characters often consist of multiple code points (e.g. the "woman kissing woman" emoji consists of 8 code points: woman + ZWJ + heavy black heart (2 code points) + ZWJ + kiss mark + ZWJ + woman) and the rules described in Annex #29 must be applied to group those code points into clusters perceived by the user as one character.

Example

Type example.

gr := NewGraphemes("👍🏼!")
for gr.Next() {
	fmt.Printf("%x ", gr.Runes())
}
Output:

[1f44d 1f3fc] [21]

func NewGraphemes

func NewGraphemes(s string) *Graphemes

NewGraphemes returns a new grapheme cluster iterator.

func (*Graphemes) Bytes

func (g *Graphemes) Bytes() []byte

Bytes returns a byte slice which corresponds to the current grapheme cluster. If the iterator is already past the end or Next() has not yet been called, nil is returned.

func (*Graphemes) Next

func (g *Graphemes) Next() bool

Next advances the iterator by one grapheme cluster and returns false if no clusters are left. This function must be called before the first cluster is accessed.

func (*Graphemes) Positions

func (g *Graphemes) Positions() (int, int)

Positions returns the interval of the current grapheme cluster as byte positions into the original string. The first returned value "from" indexes the first byte and the second returned value "to" indexes the first byte that is not included anymore, i.e. str[from:to] is the current grapheme cluster of the original string "str". If Next() has not yet been called, both values are 0. If the iterator is already past the end, both values are 1.

func (*Graphemes) Reset

func (g *Graphemes) Reset()

Reset puts the iterator into its initial state such that the next call to Next() sets it to the first grapheme cluster again.

func (*Graphemes) Runes

func (g *Graphemes) Runes() []rune

Runes returns a slice of runes (code points) which corresponds to the current grapheme cluster. If the iterator is already past the end or Next() has not yet been called, nil is returned.

func (*Graphemes) Str

func (g *Graphemes) Str() string

Str returns a substring of the original string which corresponds to the current grapheme cluster. If the iterator is already past the end or Next() has not yet been called, an empty string is returned.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL