utf8

package
v0.5.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 6, 2026 License: LGPL-2.1 Imports: 3 Imported by: 0

Documentation

Overview

Package utf8 mirrors FVUTF8.pas: encoding sniffing, conversion to UTF-8, and grapheme-cluster-aware display-width / slicing helpers.

The Pascal version juggles UTF-16 surrogates because Delphi's string type is UTF-16; Go strings are already UTF-8, so the codec layer is thin (mostly the standard `unicode/utf8` package). Cluster-aware width and slicing delegate to github.com/rivo/uniseg, which implements UAX #29 (text segmentation) and the wcwidth-style monospace cell-width tables — handling ZWJ clusters (family 👨‍👩‍👧‍👦, rainbow flag 🏳️‍🌈), VS16 emoji-presentation promotion, regional indicators (🇩🇪), skin-tone modifiers (👋🏼), and combining marks without per-codepoint special cases.

Index

Examples

Constants

This section is empty.

Variables

This section is empty.

Functions

func ANSIToUTF8

func ANSIToUTF8(data []byte) []byte

ANSIToUTF8 converts a CP1252 byte slice to UTF-8.

func BOMLength

func BOMLength(enc FileEncoding) int

BOMLength returns the number of leading bytes the encoding's BOM occupies, or 0 if none.

func CStrDisplayWidth

func CStrDisplayWidth(s string) int

CStrDisplayWidth is like StringDisplayWidth but skips '~' hotkey markers that don't render as glyphs. '~' is single-byte ASCII and never appears inside an emoji cluster, so a simple byte strip is safe.

func CharLen

func CharLen(b byte) int

CharLen returns the expected total length in bytes of a UTF-8 sequence whose lead byte is b. Returns 1 for ASCII and 0 for invalid lead.

func ConvertToUTF8

func ConvertToUTF8(data []byte, enc FileEncoding) []byte

ConvertToUTF8 strips any BOM and returns UTF-8 bytes for the given encoding. Pure UTF-8 inputs are returned as-is (BOM removed).

func CopyDisplayCells

func CopyDisplayCells(s string, startCol, maxWidth int) string

CopyDisplayCells returns the substring of s that occupies cells [startCol, startCol+maxWidth) in the rendered terminal. Whole grapheme clusters are kept together; if a wide cluster would straddle the end it is omitted entirely.

func DecodeRune

func DecodeRune(buf []byte) (r rune, n int)

DecodeRune decodes one rune from buf and returns it together with the byte count it consumed. Mirrors DecodeUTF8CodePoint.

func IsTrailByte

func IsTrailByte(b byte) bool

IsTrailByte reports whether b is a UTF-8 continuation byte (10xxxxxx).

func StringDisplayWidth

func StringDisplayWidth(s string) int

StringDisplayWidth returns the number of terminal cells s occupies. Grapheme clusters (ZWJ sequences, regional indicators, skin-tone modifiers, combining marks) count by the cluster's monospace width via UAX #11 / wcwidth.

func UTF16BEToUTF8

func UTF16BEToUTF8(data []byte, skipBOM bool) []byte

UTF16BEToUTF8 converts a UTF-16 big-endian byte slice to UTF-8.

func UTF16LEToUTF8

func UTF16LEToUTF8(data []byte, skipBOM bool) []byte

UTF16LEToUTF8 converts a UTF-16 little-endian byte slice to UTF-8. If skipBOM is true, a leading FF FE is removed.

Types

type FileEncoding

type FileEncoding int

FileEncoding identifies the encoding DetectEncoding inferred for a byte slice. Mirrors TFileEncoding in FVUTF8.pas. Saved on Editor / FileEditor so SaveFile can round-trip the original BOM.

const (
	EncUnknown FileEncoding = iota // sniff failed; treat as binary or ANSI
	EncUTF8                        // plain UTF-8, no BOM
	EncUTF8BOM                     // UTF-8 with leading EF BB BF
	EncUTF16LE                     // UTF-16 little-endian (FF FE BOM optional)
	EncUTF16BE                     // UTF-16 big-endian (FE FF BOM optional)
	EncANSI                        // CP1252 fallback for legacy 8-bit files
)

func DetectEncoding

func DetectEncoding(data []byte) FileEncoding

DetectEncoding examines up to len(data) bytes and returns the most likely encoding. Mirrors DetectEncoding in FVUTF8.pas.

Example

Sniff the encoding of a few byte slices. DetectEncoding inspects the leading bytes for a BOM, then falls back to UTF-8 validation, then ANSI/CP1252.

package main

import (
	"fmt"

	"github.com/oldwired/fv-go/pkg/fv/utf8"
)

func main() {
	plain := []byte("hello, world")
	bom := []byte{0xEF, 0xBB, 0xBF, 'h', 'i'}
	utf16le := []byte{0xFF, 0xFE, 'h', 0, 'i', 0}

	fmt.Println(utf8.DetectEncoding(plain) == utf8.EncUTF8)
	fmt.Println(utf8.DetectEncoding(bom) == utf8.EncUTF8BOM)
	fmt.Println(utf8.DetectEncoding(utf16le) == utf8.EncUTF16LE)
}
Output:
true
true
true

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL