scsu

package module
v0.0.0-...-84ac880 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 6, 2022 License: MIT Imports: 7 Imported by: 4

README

SCSU

A Standard Compression Scheme for Unicode implementation in Go.

GoDoc

This in an implementation of SCSU as described in https://www.unicode.org/reports/tr6/tr6-4.html

Although UTF-8 is now the most commonly used and recommended encoding, in some cases the use of SCSU can be beneficial. For example when storing or transmitting short alphabetical texts (Arabic, Hebrew, Russian, etc.) where general-purpose compression algorithms are inefficient, but SCSU provides nearly 50% compression ratio over UTF-8.

The code is based on the sample Java implementation found at ftp://ftp.unicode.org/Public/PROGRAMS/SCSU/ however the encoding algorithm has been slightly modified as the implementation above contains a few bugs.

The code has been fuzz-tested using https://github.com/dvyukov/go-fuzz to ensure that random input neither crashes the Encoder nor the Decoder, and if it happens to be a valid UTF-8, an Encode/Decode cycle produces identical output.

Usage Scenarios.

Encode a string into a []byte:

b, err := scsu.Encode(s, nil) // the second argument can be an existing slice which will be appended

Decode a []byte into a string:

s, err := scsu.Decode(b)

Use a Writer:

writer := scsu.NewWriter(outWriter)
n, err := writer.WriteString(s)
n, err = writer.WriteRune(r)
n, err = writer.WriteRunes(runeSource)

Use an Encoder:

encoder := scsu.NewEncoder()
buf, err := encoder.Encode(runeSource, buf) // assuming buf has enough capacity this does zero allocs
// encoder then can be re-used

Use a Reader:

reader := scsu.NewReader(byteReader)
s, err := reader.ReadString() // read the entire string
r, size, err := reader.ReadRune() // or a single rune

Documentation

Index

Examples

Constants

View Source
const (
	SQ0 = 0x01 // Quote from window pair 0
	SQ1 = 0x02 // Quote from window pair 1
	SQ2 = 0x03 // Quote from window pair 2
	SQ3 = 0x04 // Quote from window pair 3
	SQ4 = 0x05 // Quote from window pair 4
	SQ5 = 0x06 // Quote from window pair 5
	SQ6 = 0x07 // Quote from window pair 6
	SQ7 = 0x08 // Quote from window pair 7

	SDX = 0x0B // Define a window as extended
	Srs = 0x0C // reserved

	SQU = 0x0E // Quote a single Unicode character
	SCU = 0x0F // Change to Unicode mode

	/** SC<i>n</i> Change to Window <i>n</i>. <p>
	  If the following bytes are less than 0x80, interpret them
	  as command bytes or pass them through, else add the offset
	  for dynamic window <i>n</i>. */
	SC0 = 0x10 // Select window 0
	SC1 = 0x11 // Select window 1
	SC2 = 0x12 // Select window 2
	SC3 = 0x13 // Select window 3
	SC4 = 0x14 // Select window 4
	SC5 = 0x15 // Select window 5
	SC6 = 0x16 // Select window 6
	SC7 = 0x17 // Select window 7
	SD0 = 0x18 // Define and select window 0
	SD1 = 0x19 // Define and select window 1
	SD2 = 0x1A // Define and select window 2
	SD3 = 0x1B // Define and select window 3
	SD4 = 0x1C // Define and select window 4
	SD5 = 0x1D // Define and select window 5
	SD6 = 0x1E // Define and select window 6
	SD7 = 0x1F // Define and select window 7

	UC0 = 0xE0 // Select window 0
	UC1 = 0xE1 // Select window 1
	UC2 = 0xE2 // Select window 2
	UC3 = 0xE3 // Select window 3
	UC4 = 0xE4 // Select window 4
	UC5 = 0xE5 // Select window 5
	UC6 = 0xE6 // Select window 6
	UC7 = 0xE7 // Select window 7
	UD0 = 0xE8 // Define and select window 0
	UD1 = 0xE9 // Define and select window 1
	UD2 = 0xEA // Define and select window 2
	UD3 = 0xEB // Define and select window 3
	UD4 = 0xEC // Define and select window 4
	UD5 = 0xED // Define and select window 5
	UD6 = 0xEE // Define and select window 6
	UD7 = 0xEF // Define and select window 7

	UQU = 0xF0 // Quote a single Unicode character
	UDX = 0xF1 // Define a Window as extended
	Urs = 0xF2 // reserved

)

Variables

View Source
var (
	ErrIllegalInput = errors.New("illegal input")
)
View Source
var (
	ErrInvalidUTF8 = errors.New("invalid UTF-8")
)

Functions

func Decode

func Decode(b []byte) (string, error)

Decode a byte array as a string.

func Encode

func Encode(src string, dst []byte) ([]byte, error)

Encode src and append to dst. If dst does not have enough capacity it will be re-allocated. It can be nil.

func EncodeStrict

func EncodeStrict(src string, dst []byte) ([]byte, error)

EncodeStrict is the same as Encode, however it stops and returns ErrInvalidUTF8 if an invalid UTF-8 sequence is encountered rather than replacing it with utf8.RuneError.

func FindFirstEncodable

func FindFirstEncodable(src string) int

FindFirstEncodable returns the position of the first byte that is not pass-through. Returns -1 if the entire string is pass-through (i.e. encoding it would return the string unchanged).

Example
encodeOrPassthrough := func(s string) ([]byte, error) {
	pos := FindFirstEncodable(s)
	if pos >= 0 {
		buf := make([]byte, pos, len(s))
		// First, copy the pass-through part
		copy(buf, s)
		// ... then append the encoded tail.
		buf, err := Encode(s[pos:], buf)
		if err != nil {
			return nil, err
		}
		return buf, nil
	}
	// The string only contains pass-through characters, save buffer allocation.
	// The caller can check if the returned []byte is nil and use the original string instead.
	return nil, nil
}
fmt.Println(encodeOrPassthrough("Sample ASCII"))
fmt.Println(encodeOrPassthrough("Sample Unicode 😀"))
Output:

[] <nil>
[83 97 109 112 108 101 32 85 110 105 99 111 100 101 32 11 97 236 128] <nil>

Types

type Encoder

type Encoder struct {
	// contains filtered or unexported fields
}

Encoder can be used to encode a string into []byte. Zero value is ready to use.

func (*Encoder) Encode

func (e *Encoder) Encode(src RuneSource, dst []byte) ([]byte, error)

Encode the given RuneSource and append to dst. If dst does not have enough capacity it will be re-allocated. It can be nil. Not goroutine-safe. The instance can be re-used after.

type Reader

type Reader struct {
	// contains filtered or unexported fields
}

func NewReader

func NewReader(r io.ByteReader) *Reader

func (*Reader) ReadRune

func (r *Reader) ReadRune() (rune, int, error)

ReadRune reads a single SCSU encoded Unicode character and returns the rune and the amount of bytes consumed. If no character is available, err will be set.

func (*Reader) ReadString

func (r *Reader) ReadString() (string, error)

ReadString reads all available input as a string. It keeps reading the source reader until it returns io.EOF or an error occurs. In case of io.EOF the error returned by ReadString will be nil.

func (*Reader) ReadStringSizeHint

func (r *Reader) ReadStringSizeHint(sizeHint int) (string, error)

ReadStringSizeHint is like ReadString, but takes a hint about the expected string size. Note this is the size of the UTF-8 encoded string in bytes.

func (*Reader) Reset

func (r *Reader) Reset(rd io.ByteReader)

type RuneSlice

type RuneSlice []rune

RuneSlice is a RuneSource backed by []rune.

func (RuneSlice) RuneAt

func (s RuneSlice) RuneAt(pos int) (rune, int, error)

type RuneSource

type RuneSource interface {
	RuneAt(pos int) (r rune, nextPos int, err error)
}

A RuneSource represents a sequence of runes with look-behind support.

RuneAt returns a rune at a given position. The position starts at zero and is not guaranteed to be sequential, therefore the only valid arguments are 0 or one of the previously returned as nextPos. Supplying anything else results in an unspecified behaviour. Returns io.EOF when there are no more runes left. If a rune was read err must be nil (i.e. (rune, EOF) combination is not possible)

type SingleRuneSource

type SingleRuneSource rune

SingleRuneSource that contains a single rune.

func (SingleRuneSource) RuneAt

func (r SingleRuneSource) RuneAt(pos int) (rune, int, error)

type StrictStringRuneSource

type StrictStringRuneSource string

StrictStringRuneSource does not tolerate invalid UTF-8 sequences.

func (StrictStringRuneSource) RuneAt

func (s StrictStringRuneSource) RuneAt(pos int) (rune, int, error)

type StringRuneSource

type StringRuneSource string

StringRuneSource represents an UTF-8 string. Invalid sequences are replaced with utf8.RuneError.

func (StringRuneSource) RuneAt

func (s StringRuneSource) RuneAt(pos int) (rune, int, error)

type Writer

type Writer struct {
	// contains filtered or unexported fields
}

func NewWriter

func NewWriter(wr io.Writer) *Writer

func (*Writer) Reset

func (w *Writer) Reset(out io.Writer)

Reset discards the encoder's state and makes it equivalent to the result of NewEncoder called with w allowing to re-use the instance.

func (*Writer) WriteRune

func (w *Writer) WriteRune(r rune) (int, error)

WriteRune encodes the given rune and writes the binary representation into the writer. Returns the number of bytes written and an error (if any).

func (*Writer) WriteRunes

func (w *Writer) WriteRunes(src RuneSource) (int, error)

func (*Writer) WriteString

func (w *Writer) WriteString(in string) (int, error)

WriteString encodes the given string and writes the binary representation into the writer. Invalid UTF-8 sequences are replaced with utf8.RuneError. Returns the number of bytes written and an error (if any).

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL