utf8n

package module

v0.0.0-...-e1fbafc Latest Latest Go to latest Published: Sep 10, 2019 License: MIT Imports: 4 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/reiver/go-utf8n

Links

Open Source Insights

README ¶

go-utf8n

Package utf8n implements functions and constants to support normalizing text encoded in UTF-8.

This package is similar to the Go built-in "unicode/utf8" package, except it normalizes ‘line separator’ and ‘paragraph separator’ characters.

So that it transforms:

	CR LF ⇒ LS

	LF    ⇒ LS

	CR    ⇒ LS

	NEL   ⇒ LS

And then after (conceptually) doing that, transforms:

	LS LS ⇒ PS

The meanings of LF, CR, NEL, LS, and PS are:

	LF  = “line feed”            = U+000A = '\u000A' = '\n'

	CR  = “carriage return”      = U+000D = '\u000D' = '\r'

	NEL = “next line”            = U+0085 = '\u0085'

	LS  = “line separator”       = U+2028 = '\u2028'

	PS  = “paragraph separator”  = U+2029 = '\u2029'

The result of these transformations is that:

№1: ‘line separator’, and ‘paragraph separator’ characters are always represented by a single rune,

№2: ‘line separator’, and ‘paragraph separator’ characters are always represented by the same runes.

Documention

Online documentation, which includes examples, can be found at: http://godoc.org/github.com/reiver/go-utf8n

Documentation ¶

Overview ¶

Package utf8n implements functions and constants to support normalizing text encoded in UTF-8.

This package is similar to the Go built-in "unicode/utf8" package, except it normalizes ‘line separator’ and ‘paragraph separator’ characters.

So that it transforms:

CR LF ⇒ LS

LF    ⇒ LS

CR    ⇒ LS

NEL   ⇒ LS

And then after (conceptually) doing that, transforms:

LS LS ⇒ PS

The meanings of LF, CR, NEL, LS, and PS are:

LF  = “line feed”            = U+000A = '\u000A' = '\n'

CR  = “carriage return”      = U+000D = '\u000D' = '\r'

NEL = “next line”            = U+0085 = '\u0085'

LS  = “line separator”       = U+2028 = '\u2028'

PS  = “paragraph separator”  = U+2029 = '\u2029'

The result of these transformations is that:

№1: ‘line separator’, and ‘paragraph separator’ characters are always represented by a single rune,

№2: ‘line separator’, and ‘paragraph separator’ characters are always represented by the same runes.

Index ¶

Constants
func DecodeRune(p []byte) (r rune, size int)
func RuneScanner(readSeeker io.ReadSeeker) io.RuneScanner

Examples ¶

RuneScanner

Constants ¶

View Source

const (
	LF  = '\n'     // line feed
	CR  = '\r'     // carriage return
	NEL = '\u0085' // next line
	LS  = '\u2028' // line separator
	PS  = '\u2029' // paragraph separator
)

View Source

const (
	RuneError rune = utf8.RuneError
)

View Source

const (
	UTF8Max = 4 // the maximum number of bytes a UTF-8 encoded Unicode character can be.
)

Variables ¶

This section is empty.

Functions ¶

func DecodeRune ¶

func DecodeRune(p []byte) (r rune, size int)

DecodeRune is similar to utf8.DecodeRune() in the Go built-in "unicode/utf8" package, except it normalizes ‘line separator’ and ‘paragraph separator’ characters.

So that it transforms:

CR LF ⇒ LS

LF    ⇒ LS

CR    ⇒ LS

NEL   ⇒ LS

And then after (conceptually) doing that, transforms:

LS LS ⇒ PS

The meanings of LF, CR, NEL, LS, and PS are:

LF  = “line feed”            = U+000A = '\u000A' = '\n'

CR  = “carriage return”      = U+000D = '\u000D' = '\r'

NEL = “next line”            = U+0085 = '\u0085'

LS  = “line separator”       = U+2028 = '\u2028'

PS  = “paragraph separator”  = U+2029 = '\u2029'

The returned ‘size’ is the pre-transformed number of bytes read from ‘p’.

That way you can do stuff such as:

p = p[size:]

And the returned ‘r’ is the transformed rune.

For example:

var utf8Bytes []byte = []byte("This is the 1st line\r\nThis is the second line\r\n\r\nThis is the 2nd paragraph.")

var p []byte = utf8Bytes

var builder strings.Builder // <--- We will put out result here.

for {
	r, size := utf8n.DecodeRune(p)

	if utf8n.RuneError == r && 0 == size { // Nothing more to decode.

		break // <-------------- We get out of the loop with this!
	}
	if utf8n.RuneError == r { // An actual error.
		fmt.Println("ERROR: invalid UTF-8")
		return
	}

	p = p[size:] // <---- We skip past what we just decoded.

	builder.WriteRune(r)
}

fmt.Printf("RESULT: %s\n", builder.String())

func RuneScanner ¶

func RuneScanner(readSeeker io.ReadSeeker) io.RuneScanner

Example ¶

var text = `Hello world!

Khodafez.

apple
BANANA
Cherry
dATE
`

var readSeeker io.ReadSeeker = strings.NewReader(text)

var runeScanner io.RuneScanner = utf8n.RuneScanner(readSeeker)

var buffer strings.Builder

for {
	r, _, err := runeScanner.ReadRune()
	if nil != err && io.EOF == err {
		break
	}
	if nil != err {
		fmt.Printf("ERROR: problem getting next rune: %s\n", err)
		return
	}

	switch r {
	case '\t':
		fmt.Printf("%q\n", buffer.String())
		buffer.Reset()

		fmt.Printf("%q (tab)\n", string(r))

	case ' ':
		fmt.Printf("%q\n", buffer.String())
		buffer.Reset()

		fmt.Printf("%q (space)\n", string(r))

	case '\u2028':
		fmt.Printf("%q\n", buffer.String())
		buffer.Reset()

		fmt.Printf("%q (line separator)\n", string(r))

	case '\u2029':
		fmt.Printf("%q\n", buffer.String())
		buffer.Reset()

		fmt.Printf("%q (paragraph separator)\n", string(r))

	default:
		buffer.WriteRune(r)
	}
}
if 0 < buffer.Len() {
	fmt.Printf("%q\n", buffer.String())
	buffer.Reset()
}

Output:

"Hello"
" " (space)
"world!"
"\u2029" (paragraph separator)
"Khodafez."
"\u2029" (paragraph separator)
"apple"
"\u2028" (line separator)
"BANANA"
"\u2028" (line separator)
"Cherry"
"\u2028" (line separator)
"dATE"
"\u2028" (line separator)

Types ¶

This section is empty.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL