Documentation

Overview

    Package unicode provides Unicode encodings such as UTF-16.

    Index

    Constants

    This section is empty.

    Variables

      All lists a configuration for each IANA-defined UTF-16 variant.

      View Source
      var ErrMissingBOM = errors.New("encoding: missing byte order mark")

        ErrMissingBOM means that decoding UTF-16 input with ExpectBOM did not find a starting byte order mark.

        View Source
        var UTF8 encoding.Encoding = utf8enc

          UTF8 is the UTF-8 encoding. It neither removes nor adds byte order marks.

          View Source
          var UTF8BOM encoding.Encoding = utf8bomEncoding{}

            UTF8BOM is an UTF-8 encoding where the decoder strips a leading byte order mark while the encoder adds one.

            Some editors add a byte order mark as a signature to UTF-8 files. Although the byte order mark is not useful for detecting byte order in UTF-8, it is sometimes used as a convention to mark UTF-8-encoded files. This relies on the observation that the UTF-8 byte order mark is either an illegal or at least very unlikely sequence in any other character encoding.

            Functions

            func BOMOverride

            func BOMOverride(fallback transform.Transformer) transform.Transformer

              BOMOverride returns a new decoder transformer that is identical to fallback, except that the presence of a Byte Order Mark at the start of the input causes it to switch to the corresponding Unicode decoding. It will only consider BOMs for UTF-8, UTF-16BE, and UTF-16LE.

              This differs from using ExpectBOM by allowing a BOM to switch to UTF-8, not just UTF-16 variants, and allowing falling back to any encoding scheme.

              This technique is recommended by the W3C for use in HTML 5: "For compatibility with deployed content, the byte order mark (also known as BOM) is considered more authoritative than anything else." http://www.w3.org/TR/encoding/#specification-hooks

              Using BOMOverride is mostly intended for use cases where the first characters of a fallback encoding are known to not be a BOM, for example, for valid HTML and most encodings.

              func UTF16

                UTF16 returns a UTF-16 Encoding for the given default endianness and byte order mark (BOM) policy.

                When decoding from UTF-16 to UTF-8, if the BOMPolicy is IgnoreBOM then neither BOMs U+FEFF nor noncharacters U+FFFE in the input stream will affect the endianness used for decoding, and will instead be output as their standard UTF-8 encodings: "\xef\xbb\xbf" and "\xef\xbf\xbe". If the BOMPolicy is UseBOM or ExpectBOM a staring BOM is not written to the UTF-8 output. Instead, it overrides the default endianness e for the remainder of the transformation. Any subsequent BOMs U+FEFF or noncharacters U+FFFE will not affect the endianness used, and will instead be output as their standard UTF-8 encodings. For UseBOM, if there is no starting BOM, it will proceed with the default Endianness. For ExpectBOM, in that case, the transformation will return early with an ErrMissingBOM error.

                When encoding from UTF-8 to UTF-16, a BOM will be inserted at the start of the output if the BOMPolicy is UseBOM or ExpectBOM. Otherwise, a BOM will not be inserted. The UTF-8 input does not need to contain a BOM.

                There is no concept of a 'native' endianness. If the UTF-16 data is produced and consumed in a greater context that implies a certain endianness, use IgnoreBOM. Otherwise, use ExpectBOM and always produce and consume a BOM.

                In the language of https://www.unicode.org/faq/utf_bom.html#bom10, IgnoreBOM corresponds to "Where the precise type of the data stream is known... the BOM should not be used" and ExpectBOM corresponds to "A particular protocol... may require use of the BOM".

                Types

                type BOMPolicy

                type BOMPolicy uint8

                  BOMPolicy is a UTF-16 encoding's byte order mark policy.

                  const (
                  
                  	// IgnoreBOM means to ignore any byte order marks.
                  	IgnoreBOM BOMPolicy = 0
                  
                  	// UseBOM means that the UTF-16 form may start with a byte order mark, which
                  	// will be used to override the default encoding.
                  	UseBOM BOMPolicy = writeBOM | acceptBOM
                  
                  	// ExpectBOM means that the UTF-16 form must start with a byte order mark,
                  	// which will be used to override the default encoding.
                  	ExpectBOM BOMPolicy = writeBOM | acceptBOM | requireBOM
                  )

                  type Endianness

                  type Endianness bool

                    Endianness is a UTF-16 encoding's default endianness.

                    const (
                    	// BigEndian is UTF-16BE.
                    	BigEndian Endianness = false
                    	// LittleEndian is UTF-16LE.
                    	LittleEndian Endianness = true
                    )

                    Directories

                    Path Synopsis
                    Package utf32 provides the UTF-32 Unicode encoding.
                    Package utf32 provides the UTF-32 Unicode encoding.