Documentation
¶
Overview ¶
Package utf8 mirrors FVUTF8.pas: encoding sniffing, conversion to UTF-8, and grapheme-cluster-aware display-width / slicing helpers.
The Pascal version juggles UTF-16 surrogates because Delphi's string type is UTF-16; Go strings are already UTF-8, so the codec layer is thin (mostly the standard `unicode/utf8` package). Cluster-aware width and slicing delegate to github.com/rivo/uniseg, which implements UAX #29 (text segmentation) and the wcwidth-style monospace cell-width tables — handling ZWJ clusters (family 👨👩👧👦, rainbow flag 🏳️🌈), VS16 emoji-presentation promotion, regional indicators (🇩🇪), skin-tone modifiers (👋🏼), and combining marks without per-codepoint special cases.
Index ¶
- func ANSIToUTF8(data []byte) []byte
- func BOMLength(enc FileEncoding) int
- func CStrDisplayWidth(s string) int
- func CharLen(b byte) int
- func ConvertToUTF8(data []byte, enc FileEncoding) []byte
- func CopyDisplayCells(s string, startCol, maxWidth int) string
- func DecodeRune(buf []byte) (r rune, n int)
- func IsTrailByte(b byte) bool
- func StringDisplayWidth(s string) int
- func UTF16BEToUTF8(data []byte, skipBOM bool) []byte
- func UTF16LEToUTF8(data []byte, skipBOM bool) []byte
- type FileEncoding
Examples ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func ANSIToUTF8 ¶
ANSIToUTF8 converts a CP1252 byte slice to UTF-8.
func BOMLength ¶
func BOMLength(enc FileEncoding) int
BOMLength returns the number of leading bytes the encoding's BOM occupies, or 0 if none.
func CStrDisplayWidth ¶
CStrDisplayWidth is like StringDisplayWidth but skips '~' hotkey markers that don't render as glyphs. '~' is single-byte ASCII and never appears inside an emoji cluster, so a simple byte strip is safe.
func CharLen ¶
CharLen returns the expected total length in bytes of a UTF-8 sequence whose lead byte is b. Returns 1 for ASCII and 0 for invalid lead.
func ConvertToUTF8 ¶
func ConvertToUTF8(data []byte, enc FileEncoding) []byte
ConvertToUTF8 strips any BOM and returns UTF-8 bytes for the given encoding. Pure UTF-8 inputs are returned as-is (BOM removed).
func CopyDisplayCells ¶
CopyDisplayCells returns the substring of s that occupies cells [startCol, startCol+maxWidth) in the rendered terminal. Whole grapheme clusters are kept together; if a wide cluster would straddle the end it is omitted entirely.
func DecodeRune ¶
DecodeRune decodes one rune from buf and returns it together with the byte count it consumed. Mirrors DecodeUTF8CodePoint.
func IsTrailByte ¶
IsTrailByte reports whether b is a UTF-8 continuation byte (10xxxxxx).
func StringDisplayWidth ¶
StringDisplayWidth returns the number of terminal cells s occupies. Grapheme clusters (ZWJ sequences, regional indicators, skin-tone modifiers, combining marks) count by the cluster's monospace width via UAX #11 / wcwidth.
func UTF16BEToUTF8 ¶
UTF16BEToUTF8 converts a UTF-16 big-endian byte slice to UTF-8.
func UTF16LEToUTF8 ¶
UTF16LEToUTF8 converts a UTF-16 little-endian byte slice to UTF-8. If skipBOM is true, a leading FF FE is removed.
Types ¶
type FileEncoding ¶
type FileEncoding int
FileEncoding identifies the encoding DetectEncoding inferred for a byte slice. Mirrors TFileEncoding in FVUTF8.pas. Saved on Editor / FileEditor so SaveFile can round-trip the original BOM.
const ( EncUnknown FileEncoding = iota // sniff failed; treat as binary or ANSI EncUTF8 // plain UTF-8, no BOM EncUTF8BOM // UTF-8 with leading EF BB BF EncUTF16LE // UTF-16 little-endian (FF FE BOM optional) EncUTF16BE // UTF-16 big-endian (FE FF BOM optional) EncANSI // CP1252 fallback for legacy 8-bit files )
func DetectEncoding ¶
func DetectEncoding(data []byte) FileEncoding
DetectEncoding examines up to len(data) bytes and returns the most likely encoding. Mirrors DetectEncoding in FVUTF8.pas.
Example ¶
Sniff the encoding of a few byte slices. DetectEncoding inspects the leading bytes for a BOM, then falls back to UTF-8 validation, then ANSI/CP1252.
package main
import (
"fmt"
"github.com/oldwired/fv-go/pkg/fv/utf8"
)
func main() {
plain := []byte("hello, world")
bom := []byte{0xEF, 0xBB, 0xBF, 'h', 'i'}
utf16le := []byte{0xFF, 0xFE, 'h', 0, 'i', 0}
fmt.Println(utf8.DetectEncoding(plain) == utf8.EncUTF8)
fmt.Println(utf8.DetectEncoding(bom) == utf8.EncUTF8BOM)
fmt.Println(utf8.DetectEncoding(utf16le) == utf8.EncUTF16LE)
}
Output: true true true