scsu

package module

v0.0.0-...-84ac880 Latest Latest Go to latest Published: Jan 6, 2022 License: MIT Imports: 7 Imported by: 4

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/dop251/scsu

Links

Open Source Insights

README ¶

SCSU

A Standard Compression Scheme for Unicode implementation in Go.

This in an implementation of SCSU as described in https://www.unicode.org/reports/tr6/tr6-4.html

Although UTF-8 is now the most commonly used and recommended encoding, in some cases the use of SCSU can be beneficial. For example when storing or transmitting short alphabetical texts (Arabic, Hebrew, Russian, etc.) where general-purpose compression algorithms are inefficient, but SCSU provides nearly 50% compression ratio over UTF-8.

The code is based on the sample Java implementation found at ftp://ftp.unicode.org/Public/PROGRAMS/SCSU/ however the encoding algorithm has been slightly modified as the implementation above contains a few bugs.

The code has been fuzz-tested using https://github.com/dvyukov/go-fuzz to ensure that random input neither crashes the Encoder nor the Decoder, and if it happens to be a valid UTF-8, an Encode/Decode cycle produces identical output.

Usage Scenarios.

Encode a string into a []byte:

b, err := scsu.Encode(s, nil) // the second argument can be an existing slice which will be appended

Decode a []byte into a string:

s, err := scsu.Decode(b)

Use a Writer:

writer := scsu.NewWriter(outWriter)
n, err := writer.WriteString(s)
n, err = writer.WriteRune(r)
n, err = writer.WriteRunes(runeSource)

Use an Encoder:

encoder := scsu.NewEncoder()
buf, err := encoder.Encode(runeSource, buf) // assuming buf has enough capacity this does zero allocs
// encoder then can be re-used

Use a Reader:

reader := scsu.NewReader(byteReader)
s, err := reader.ReadString() // read the entire string
r, size, err := reader.ReadRune() // or a single rune

Documentation ¶

Index ¶

Constants
Variables
func Decode(b []byte) (string, error)
func Encode(src string, dst []byte) ([]byte, error)
func EncodeStrict(src string, dst []byte) ([]byte, error)
func FindFirstEncodable(src string) int
type Encoder
- func (e *Encoder) Encode(src RuneSource, dst []byte) ([]byte, error)
type Reader
- func NewReader(r io.ByteReader) *Reader
- func (r *Reader) ReadRune() (rune, int, error)
- func (r *Reader) ReadString() (string, error)
- func (r *Reader) ReadStringSizeHint(sizeHint int) (string, error)
- func (r *Reader) Reset(rd io.ByteReader)
type RuneSlice
- func (s RuneSlice) RuneAt(pos int) (rune, int, error)
type RuneSource
type SingleRuneSource
- func (r SingleRuneSource) RuneAt(pos int) (rune, int, error)
type StrictStringRuneSource
- func (s StrictStringRuneSource) RuneAt(pos int) (rune, int, error)
type StringRuneSource
- func (s StringRuneSource) RuneAt(pos int) (rune, int, error)
type Writer
- func NewWriter(wr io.Writer) *Writer
- func (w *Writer) Reset(out io.Writer)
- func (w *Writer) WriteRune(r rune) (int, error)
- func (w *Writer) WriteRunes(src RuneSource) (int, error)
- func (w *Writer) WriteString(in string) (int, error)

Examples ¶

FindFirstEncodable

Constants ¶

View Source

const (
	SQ0 = 0x01 // Quote from window pair 0
	SQ1 = 0x02 // Quote from window pair 1
	SQ2 = 0x03 // Quote from window pair 2
	SQ3 = 0x04 // Quote from window pair 3
	SQ4 = 0x05 // Quote from window pair 4
	SQ5 = 0x06 // Quote from window pair 5
	SQ6 = 0x07 // Quote from window pair 6
	SQ7 = 0x08 // Quote from window pair 7

	SDX = 0x0B // Define a window as extended
	Srs = 0x0C // reserved

	SQU = 0x0E // Quote a single Unicode character
	SCU = 0x0F // Change to Unicode mode

	/** SC<i>n</i> Change to Window <i>n</i>. <p>
	  If the following bytes are less than 0x80, interpret them
	  as command bytes or pass them through, else add the offset
	  for dynamic window <i>n</i>. */
	SC0 = 0x10 // Select window 0
	SC1 = 0x11 // Select window 1
	SC2 = 0x12 // Select window 2
	SC3 = 0x13 // Select window 3
	SC4 = 0x14 // Select window 4
	SC5 = 0x15 // Select window 5
	SC6 = 0x16 // Select window 6
	SC7 = 0x17 // Select window 7
	SD0 = 0x18 // Define and select window 0
	SD1 = 0x19 // Define and select window 1
	SD2 = 0x1A // Define and select window 2
	SD3 = 0x1B // Define and select window 3
	SD4 = 0x1C // Define and select window 4
	SD5 = 0x1D // Define and select window 5
	SD6 = 0x1E // Define and select window 6
	SD7 = 0x1F // Define and select window 7

	UC0 = 0xE0 // Select window 0
	UC1 = 0xE1 // Select window 1
	UC2 = 0xE2 // Select window 2
	UC3 = 0xE3 // Select window 3
	UC4 = 0xE4 // Select window 4
	UC5 = 0xE5 // Select window 5
	UC6 = 0xE6 // Select window 6
	UC7 = 0xE7 // Select window 7
	UD0 = 0xE8 // Define and select window 0
	UD1 = 0xE9 // Define and select window 1
	UD2 = 0xEA // Define and select window 2
	UD3 = 0xEB // Define and select window 3
	UD4 = 0xEC // Define and select window 4
	UD5 = 0xED // Define and select window 5
	UD6 = 0xEE // Define and select window 6
	UD7 = 0xEF // Define and select window 7

	UQU = 0xF0 // Quote a single Unicode character
	UDX = 0xF1 // Define a Window as extended
	Urs = 0xF2 // reserved

)

Variables ¶

View Source

var (
	ErrIllegalInput = errors.New("illegal input")
)

View Source

var (
	ErrInvalidUTF8 = errors.New("invalid UTF-8")
)

Functions ¶

func Decode ¶

func Decode(b []byte) (string, error)

Decode a byte array as a string.

func Encode ¶

func Encode(src string, dst []byte) ([]byte, error)

Encode src and append to dst. If dst does not have enough capacity it will be re-allocated. It can be nil.

func EncodeStrict ¶

func EncodeStrict(src string, dst []byte) ([]byte, error)

EncodeStrict is the same as Encode, however it stops and returns ErrInvalidUTF8 if an invalid UTF-8 sequence is encountered rather than replacing it with utf8.RuneError.

func FindFirstEncodable ¶

func FindFirstEncodable(src string) int

FindFirstEncodable returns the position of the first byte that is not pass-through. Returns -1 if the entire string is pass-through (i.e. encoding it would return the string unchanged).

Example ¶

encodeOrPassthrough := func(s string) ([]byte, error) {
	pos := FindFirstEncodable(s)
	if pos >= 0 {
		buf := make([]byte, pos, len(s))
		// First, copy the pass-through part
		copy(buf, s)
		// ... then append the encoded tail.
		buf, err := Encode(s[pos:], buf)
		if err != nil {
			return nil, err
		}
		return buf, nil
	}
	// The string only contains pass-through characters, save buffer allocation.
	// The caller can check if the returned []byte is nil and use the original string instead.
	return nil, nil
}
fmt.Println(encodeOrPassthrough("Sample ASCII"))
fmt.Println(encodeOrPassthrough("Sample Unicode 😀"))

Output:

[] <nil>
[83 97 109 112 108 101 32 85 110 105 99 111 100 101 32 11 97 236 128] <nil>

Types ¶

type Encoder ¶

type Encoder struct {
	// contains filtered or unexported fields
}

Encoder can be used to encode a string into []byte. Zero value is ready to use.

func (*Encoder) Encode ¶

func (e *Encoder) Encode(src RuneSource, dst []byte) ([]byte, error)

Encode the given RuneSource and append to dst. If dst does not have enough capacity it will be re-allocated. It can be nil. Not goroutine-safe. The instance can be re-used after.

type Reader ¶

type Reader struct {
	// contains filtered or unexported fields
}

func NewReader ¶

func NewReader(r io.ByteReader) *Reader

func (*Reader) ReadRune ¶

func (r *Reader) ReadRune() (rune, int, error)

ReadRune reads a single SCSU encoded Unicode character and returns the rune and the amount of bytes consumed. If no character is available, err will be set.

func (*Reader) ReadString ¶

func (r *Reader) ReadString() (string, error)

ReadString reads all available input as a string. It keeps reading the source reader until it returns io.EOF or an error occurs. In case of io.EOF the error returned by ReadString will be nil.

func (*Reader) ReadStringSizeHint ¶

func (r *Reader) ReadStringSizeHint(sizeHint int) (string, error)

ReadStringSizeHint is like ReadString, but takes a hint about the expected string size. Note this is the size of the UTF-8 encoded string in bytes.

func (*Reader) Reset ¶

func (r *Reader) Reset(rd io.ByteReader)

type RuneSlice ¶

type RuneSlice []rune

RuneSlice is a RuneSource backed by []rune.

func (RuneSlice) RuneAt ¶

func (s RuneSlice) RuneAt(pos int) (rune, int, error)

type RuneSource ¶

type RuneSource interface {
	RuneAt(pos int) (r rune, nextPos int, err error)
}

A RuneSource represents a sequence of runes with look-behind support.

RuneAt returns a rune at a given position. The position starts at zero and is not guaranteed to be sequential, therefore the only valid arguments are 0 or one of the previously returned as nextPos. Supplying anything else results in an unspecified behaviour. Returns io.EOF when there are no more runes left. If a rune was read err must be nil (i.e. (rune, EOF) combination is not possible)

type SingleRuneSource ¶

type SingleRuneSource rune

SingleRuneSource that contains a single rune.

func (SingleRuneSource) RuneAt ¶

func (r SingleRuneSource) RuneAt(pos int) (rune, int, error)

type StrictStringRuneSource ¶

type StrictStringRuneSource string

StrictStringRuneSource does not tolerate invalid UTF-8 sequences.

func (StrictStringRuneSource) RuneAt ¶

func (s StrictStringRuneSource) RuneAt(pos int) (rune, int, error)

type StringRuneSource ¶

type StringRuneSource string

StringRuneSource represents an UTF-8 string. Invalid sequences are replaced with utf8.RuneError.

func (StringRuneSource) RuneAt ¶

func (s StringRuneSource) RuneAt(pos int) (rune, int, error)

type Writer ¶

type Writer struct {
	// contains filtered or unexported fields
}

func NewWriter ¶

func NewWriter(wr io.Writer) *Writer

func (*Writer) Reset ¶

func (w *Writer) Reset(out io.Writer)

Reset discards the encoder's state and makes it equivalent to the result of NewEncoder called with w allowing to re-use the instance.

func (*Writer) WriteRune ¶

func (w *Writer) WriteRune(r rune) (int, error)

WriteRune encodes the given rune and writes the binary representation into the writer. Returns the number of bytes written and an error (if any).

func (*Writer) WriteRunes ¶

func (w *Writer) WriteRunes(src RuneSource) (int, error)

func (*Writer) WriteString ¶

func (w *Writer) WriteString(in string) (int, error)

WriteString encodes the given string and writes the binary representation into the writer. Invalid UTF-8 sequences are replaced with utf8.RuneError. Returns the number of bytes written and an error (if any).

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL