charset

package module
v0.0.7 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 12, 2023 License: MIT Imports: 15 Imported by: 0

README

Charset

Based on saintfish/chardet, Charset make it convient to detect the following charsets from []byte and convert them to UTF-8 encoded []byte

Support charset

  • Unicode: UTF-8, UTF-16-LE, UTF-16-BE, UTF-32-LE, UTF-32-BE

  • Simplified Chinese: GB2312, GBK, GB18030(include GB2312 and GBK)

  • Tranditional Chinese: Big5

  • Janpanese: EUC-JP, Shift JIS, ISO-2022-JP

  • Korean: EUC-KR

  • Russian: Windows 1251

  • Others: ISO-8859-1

Example

package main

import (
    "fmt"
    "github.com/HeapStackTree/charset"
    "os"
)

func ReadAndConvertFile(path string, charsetName string) (contentInUtf8 []byte, res *charset.Result, err error) {
    res = &charset.Result{
        Charset:     "unknown",
        Language:    "unknown",
        Confidence:  0,
        Convertible: false,
    }

    content, err := os.ReadFile(path)
    if err != nil {
        return
    }
    if charsetName == "" {
        contentInUtf8, res, err = charset.DetectAndConvertToUtf8(content)
    } else {
        contentInUtf8, err = charset.ToUtf8WithCharsetName(content, charsetName)
        if err == nil {
            res.Charset = charsetName
            res.Confidence = 100
            res.Convertible = true
        }
    }
    return
}

func main() {
    path := "tests/GB2312/_mozilla_bug171813_text.html"

    // use charset name if you are sure about it
    // use GNU's libiconv or libiconv for windows
    // for other encoding which can't be transformed
    // by this package.
    content, res, err := ReadAndConvertFile(path, "")
    if err != nil {
        return
    }

    // jump ascii parts
    var gbkLoc int
    for i, v := range content {
        if v >= 0x7F {
            gbkLoc = i
            break
        }
    }

    fmt.Printf("Path: %s\nCharset: %s\nLanguage: %s\nConfidence: %d\nConvetible: %t\nContent: %s\n", path, res.Charset, res.Language, res.Confidence, res.Convertible, content[gbkLoc:])
    // Ouput should be:
    // Charset: GB-18030
    // Language: zh
    // Confidence: 100
    // Convetible: true
    // Content: 搜狐在线</b></font></a></div> ...
}

Check godoc for the usage of other methods like GetDecoderFromCharsetName(charsetName string), ToUtf8WithCharsetName(content []byte, charsetName string) ...

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func IsValidBig5

func IsValidBig5(content []byte) bool

Check whether content is valid under Big5 rule, referce: https://zh.wikipedia.org/wiki/Big5

func IsValidGB18030

func IsValidGB18030(content []byte) bool

Check whether content is valid under GB18030 rule, referce: https://zh.wikipedia.org/wiki/GB_18030

func IsValidGBK

func IsValidGBK(content []byte) bool

Check whether content is valid under GBK rule, referce: https://zh.wikipedia.org/wiki/GBK

func IsValidUTF16

func IsValidUTF16(content []byte) (isUTF16 bool, BE bool, LE bool)

Check whether content is valid under UTF-16 rule, reference: https://zh.wikipedia.org/wiki/UTF-16

return: isUTF16 bool, BE bool

BE: true if content is valid under UTF-16 BE rule, false if not BE: true if content is valid under UTF-16 LE rule, false if not

func IsValidUTF16BE

func IsValidUTF16BE(content []byte) bool

Check whether content is valid under UTF-16-BE rule, reference: https://zh.wikipedia.org/wiki/UTF-16

func IsValidUTF16LE

func IsValidUTF16LE(content []byte) bool

Check whether content is valid under UTF-16-LE rule, reference: https://zh.wikipedia.org/wiki/UTF-16 This function assume content is little endian and then use CheckIsValidUTF16BE's method to valid content

func IsValidUTF8

func IsValidUTF8(content []byte) bool

Check whether content is valid under UTF-8 rule

func ToUtf8WithCharsetName added in v0.0.3

func ToUtf8WithCharsetName(content []byte, charsetName string) ([]byte, error)

Get a UTF-8 encoded []byte with charset name.

func ToUtf8WithDecoder

func ToUtf8WithDecoder(content []byte, d Decoder) ([]byte, error)

Get a UTF-8 encoded []byte with Decoder.

func ToUtf8WithEncoding

func ToUtf8WithEncoding(content []byte, e encoding.Encoding) ([]byte, error)

Get a UTF-8 encoded []byte with encoding.Encoding.

func UnicodeRuneToUtf8 added in v0.0.3

func UnicodeRuneToUtf8(unicode rune) (utf8codes []byte)

Types

type Decoder added in v0.0.3

type Decoder interface {
	transform.Transformer
}

alias for transform.Transformer

func GetDecoderFromCharsetName added in v0.0.3

func GetDecoderFromCharsetName(charsetName string) (decoder Decoder, err error)

Get a Decoder from chartset name, return errors.New("No matched decoder!") if no macthed decoder

type Result

type Result struct {
	// IANA name of the detected charset.
	Charset string
	// IANA name of the detected language. It may be empty for some charsets.
	Language string
	// Confidence of the Result. Scale from 1 to 100. The bigger, the more confident.
	Confidence int
	// a Decoder which can convert the Result.Charset to utf-8, default encoding.Nop.NewDecoder() which won't try to convert the charset.
	Decoder transform.Transformer
	// Whether the charset can be converted by this package
	Convertible bool
}

Result contains all the information that charset detector gives.

func DetectAll

func DetectAll(content []byte) (results []*Result, err error)

DetectAll returns all chardet.Results which have non-zero Confidence. The Results are sorted by Confidence in descending order

Same as saintfish/chardet - chardet.NewTextDetector().DetectAll(content) but save matched Decoder in result

func DetectAndConvertToUtf8

func DetectAndConvertToUtf8(content []byte) (convertedContent []byte, res *Result, err error)

Detect and convert content to UTF-8 encoded content.

func DetectEncoding

func DetectEncoding(content []byte) (result *Result, err error)

DetectEncoding return the Result with highest Confidence

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL