utf8n

package module
v0.0.0-...-e1fbafc Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Sep 10, 2019 License: MIT Imports: 4 Imported by: 0

README

go-utf8n

Package utf8n implements functions and constants to support normalizing text encoded in UTF-8.

This package is similar to the Go built-in "unicode/utf8" package, except it normalizes ‘line separator’ and ‘paragraph separator’ characters.

So that it transforms:

	CR LF ⇒ LS

	LF    ⇒ LS

	CR    ⇒ LS

	NEL   ⇒ LS

And then after (conceptually) doing that, transforms:

	LS LS ⇒ PS

The meanings of LF, CR, NEL, LS, and PS are:

	LF  = “line feed”            = U+000A = '\u000A' = '\n'

	CR  = “carriage return”      = U+000D = '\u000D' = '\r'

	NEL = “next line”            = U+0085 = '\u0085'

	LS  = “line separator”       = U+2028 = '\u2028'

	PS  = “paragraph separator”  = U+2029 = '\u2029'

The result of these transformations is that:

№1: ‘line separator’, and ‘paragraph separator’ characters are always represented by a single rune,

№2: ‘line separator’, and ‘paragraph separator’ characters are always represented by the same runes.

Documention

Online documentation, which includes examples, can be found at: http://godoc.org/github.com/reiver/go-utf8n

GoDoc

Documentation

Overview

Package utf8n implements functions and constants to support normalizing text encoded in UTF-8.

This package is similar to the Go built-in "unicode/utf8" package, except it normalizes ‘line separator’ and ‘paragraph separator’ characters.

So that it transforms:

CR LF ⇒ LS

LF    ⇒ LS

CR    ⇒ LS

NEL   ⇒ LS

And then after (conceptually) doing that, transforms:

LS LS ⇒ PS

The meanings of LF, CR, NEL, LS, and PS are:

LF  = “line feed”            = U+000A = '\u000A' = '\n'

CR  = “carriage return”      = U+000D = '\u000D' = '\r'

NEL = “next line”            = U+0085 = '\u0085'

LS  = “line separator”       = U+2028 = '\u2028'

PS  = “paragraph separator”  = U+2029 = '\u2029'

The result of these transformations is that:

№1: ‘line separator’, and ‘paragraph separator’ characters are always represented by a single rune,

№2: ‘line separator’, and ‘paragraph separator’ characters are always represented by the same runes.

Index

Examples

Constants

View Source
const (
	LF  = '\n'     // line feed
	CR  = '\r'     // carriage return
	NEL = '\u0085' // next line
	LS  = '\u2028' // line separator
	PS  = '\u2029' // paragraph separator
)
View Source
const (
	RuneError rune = utf8.RuneError
)
View Source
const (
	UTF8Max = 4 // the maximum number of bytes a UTF-8 encoded Unicode character can be.
)

Variables

This section is empty.

Functions

func DecodeRune

func DecodeRune(p []byte) (r rune, size int)

DecodeRune is similar to utf8.DecodeRune() in the Go built-in "unicode/utf8" package, except it normalizes ‘line separator’ and ‘paragraph separator’ characters.

So that it transforms:

CR LF ⇒ LS

LF    ⇒ LS

CR    ⇒ LS

NEL   ⇒ LS

And then after (conceptually) doing that, transforms:

LS LS ⇒ PS

The meanings of LF, CR, NEL, LS, and PS are:

LF  = “line feed”            = U+000A = '\u000A' = '\n'

CR  = “carriage return”      = U+000D = '\u000D' = '\r'

NEL = “next line”            = U+0085 = '\u0085'

LS  = “line separator”       = U+2028 = '\u2028'

PS  = “paragraph separator”  = U+2029 = '\u2029'

The returned ‘size’ is the pre-transformed number of bytes read from ‘p’.

That way you can do stuff such as:

p = p[size:]

And the returned ‘r’ is the transformed rune.

For example:

var utf8Bytes []byte = []byte("This is the 1st line\r\nThis is the second line\r\n\r\nThis is the 2nd paragraph.")

var p []byte = utf8Bytes

var builder strings.Builder // <--- We will put out result here.

for {
	r, size := utf8n.DecodeRune(p)

	if utf8n.RuneError == r && 0 == size { // Nothing more to decode.

		break // <-------------- We get out of the loop with this!
	}
	if utf8n.RuneError == r { // An actual error.
		fmt.Println("ERROR: invalid UTF-8")
		return
	}

	p = p[size:] // <---- We skip past what we just decoded.

	builder.WriteRune(r)
}

fmt.Printf("RESULT: %s\n", builder.String())

func RuneScanner

func RuneScanner(readSeeker io.ReadSeeker) io.RuneScanner
Example
var text = `Hello world!

Khodafez.

apple
BANANA
Cherry
dATE
`

var readSeeker io.ReadSeeker = strings.NewReader(text)

var runeScanner io.RuneScanner = utf8n.RuneScanner(readSeeker)

var buffer strings.Builder

for {
	r, _, err := runeScanner.ReadRune()
	if nil != err && io.EOF == err {
		break
	}
	if nil != err {
		fmt.Printf("ERROR: problem getting next rune: %s\n", err)
		return
	}

	switch r {
	case '\t':
		fmt.Printf("%q\n", buffer.String())
		buffer.Reset()

		fmt.Printf("%q (tab)\n", string(r))

	case ' ':
		fmt.Printf("%q\n", buffer.String())
		buffer.Reset()

		fmt.Printf("%q (space)\n", string(r))

	case '\u2028':
		fmt.Printf("%q\n", buffer.String())
		buffer.Reset()

		fmt.Printf("%q (line separator)\n", string(r))

	case '\u2029':
		fmt.Printf("%q\n", buffer.String())
		buffer.Reset()

		fmt.Printf("%q (paragraph separator)\n", string(r))

	default:
		buffer.WriteRune(r)
	}
}
if 0 < buffer.Len() {
	fmt.Printf("%q\n", buffer.String())
	buffer.Reset()
}
Output:

"Hello"
" " (space)
"world!"
"\u2029" (paragraph separator)
"Khodafez."
"\u2029" (paragraph separator)
"apple"
"\u2028" (line separator)
"BANANA"
"\u2028" (line separator)
"Cherry"
"\u2028" (line separator)
"dATE"
"\u2028" (line separator)

Types

This section is empty.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL