Documentation
¶
Overview ¶
Package utf8n implements functions and constants to support normalizing text encoded in UTF-8.
This package is similar to the Go built-in "unicode/utf8" package, except it normalizes ‘line separator’ and ‘paragraph separator’ characters.
So that it transforms:
CR LF ⇒ LS LF ⇒ LS CR ⇒ LS NEL ⇒ LS
And then after (conceptually) doing that, transforms:
LS LS ⇒ PS
The meanings of LF, CR, NEL, LS, and PS are:
LF = “line feed” = U+000A = '\u000A' = '\n' CR = “carriage return” = U+000D = '\u000D' = '\r' NEL = “next line” = U+0085 = '\u0085' LS = “line separator” = U+2028 = '\u2028' PS = “paragraph separator” = U+2029 = '\u2029'
The result of these transformations is that:
№1: ‘line separator’, and ‘paragraph separator’ characters are always represented by a single rune,
№2: ‘line separator’, and ‘paragraph separator’ characters are always represented by the same runes.
Index ¶
Examples ¶
Constants ¶
const ( LF = '\n' // line feed CR = '\r' // carriage return NEL = '\u0085' // next line LS = '\u2028' // line separator PS = '\u2029' // paragraph separator )
const (
RuneError rune = utf8.RuneError
)
const (
UTF8Max = 4 // the maximum number of bytes a UTF-8 encoded Unicode character can be.
)
Variables ¶
This section is empty.
Functions ¶
func DecodeRune ¶
DecodeRune is similar to utf8.DecodeRune() in the Go built-in "unicode/utf8" package, except it normalizes ‘line separator’ and ‘paragraph separator’ characters.
So that it transforms:
CR LF ⇒ LS LF ⇒ LS CR ⇒ LS NEL ⇒ LS
And then after (conceptually) doing that, transforms:
LS LS ⇒ PS
The meanings of LF, CR, NEL, LS, and PS are:
LF = “line feed” = U+000A = '\u000A' = '\n' CR = “carriage return” = U+000D = '\u000D' = '\r' NEL = “next line” = U+0085 = '\u0085' LS = “line separator” = U+2028 = '\u2028' PS = “paragraph separator” = U+2029 = '\u2029'
The returned ‘size’ is the pre-transformed number of bytes read from ‘p’.
That way you can do stuff such as:
p = p[size:]
And the returned ‘r’ is the transformed rune.
For example:
var utf8Bytes []byte = []byte("This is the 1st line\r\nThis is the second line\r\n\r\nThis is the 2nd paragraph.") var p []byte = utf8Bytes var builder strings.Builder // <--- We will put out result here. for { r, size := utf8n.DecodeRune(p) if utf8n.RuneError == r && 0 == size { // Nothing more to decode. break // <-------------- We get out of the loop with this! } if utf8n.RuneError == r { // An actual error. fmt.Println("ERROR: invalid UTF-8") return } p = p[size:] // <---- We skip past what we just decoded. builder.WriteRune(r) } fmt.Printf("RESULT: %s\n", builder.String())
func RuneScanner ¶
func RuneScanner(readSeeker io.ReadSeeker) io.RuneScanner
Example ¶
var text = `Hello world! Khodafez. apple BANANA Cherry dATE ` var readSeeker io.ReadSeeker = strings.NewReader(text) var runeScanner io.RuneScanner = utf8n.RuneScanner(readSeeker) var buffer strings.Builder for { r, _, err := runeScanner.ReadRune() if nil != err && io.EOF == err { break } if nil != err { fmt.Printf("ERROR: problem getting next rune: %s\n", err) return } switch r { case '\t': fmt.Printf("%q\n", buffer.String()) buffer.Reset() fmt.Printf("%q (tab)\n", string(r)) case ' ': fmt.Printf("%q\n", buffer.String()) buffer.Reset() fmt.Printf("%q (space)\n", string(r)) case '\u2028': fmt.Printf("%q\n", buffer.String()) buffer.Reset() fmt.Printf("%q (line separator)\n", string(r)) case '\u2029': fmt.Printf("%q\n", buffer.String()) buffer.Reset() fmt.Printf("%q (paragraph separator)\n", string(r)) default: buffer.WriteRune(r) } } if 0 < buffer.Len() { fmt.Printf("%q\n", buffer.String()) buffer.Reset() }
Output: "Hello" " " (space) "world!" "\u2029" (paragraph separator) "Khodafez." "\u2029" (paragraph separator) "apple" "\u2028" (line separator) "BANANA" "\u2028" (line separator) "Cherry" "\u2028" (line separator) "dATE" "\u2028" (line separator)
Types ¶
This section is empty.