chardet

package module
v0.1.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 24, 2025 License: MIT Imports: 4 Imported by: 0

README

chardet: Go character encoding detector

Go Reference License Go Report Card

Introduction

This is a Go port of the python's chardet library. Much respect and appreciation to the original authors for their excellent work.

chardet is a character encoding detector library written in Go. It helps you automatically detect the character encoding of text content.

Installation

To install chardet, use go get:

go get github.com/joshtechnologygroup/chardet

Supported Encodings & Languages

Support Encodings:

Expand the list of supported encodings
  • Ascii
  • UTF-8
  • UTF-8-SIG
  • UTF-16
  • UTF-16LE
  • UTF-16BE
  • UTF-32
  • UTF-32BE
  • UTF-32LE
  • GB2312
  • HZ-GB-2312
  • SHIFT_JIS
  • Big5
  • Johab
  • KOI8-R
  • TIS-620
  • MacCyrillic
  • MacRoman
  • EUC-TW
  • EUC-KR
  • EUC-JP
  • CP932
  • CP949
  • Windows-1250
  • Windows-1251
  • Windows-1252
  • Windows-1253
  • Windows-1254
  • Windows-1255
  • Windows-1256
  • Windows-1257
  • ISO-8859-1
  • ISO-8859-2
  • ISO-8859-5
  • ISO-8859-6
  • ISO-8859-7
  • ISO-8859-8
  • ISO-8859-9
  • ISO-8859-13
  • ISO-2022-CN
  • ISO-2022-JP
  • ISO-2022-KR
  • X-ISO-10646-UCS-4-3412
  • X-ISO-10646-UCS-4-2143
  • IBM855
  • IBM866

Support Languages:

Expand the list of supported languages - Chinese - Japanese - Korean - Hebrew - Russian - Greek - Bulgarian - Thai - Turkish

Usage

Basic Usage

The simplest way to use chardet is with the Detect function:

package main

import (
	"fmt"
	"github.com/joshtechnologygroup/chardet"
)

func main() {
	data := []byte("Your text data here...")
	result := chardet.Detect(data)
	fmt.Printf("Detected result: %+v\n", result)
    //Output: Detected result: {Encoding:Ascii Confidence:1 Language:}
}

Advanced Usage

For handling large amounts of text, you can use the detector incrementally. This allows the detector to stop as soon as it reaches sufficient confidence in its result.

package main

import (
	"fmt"
	"github.com/joshtechnologygroup/chardet"
)

func main() {
	// Create a detector instance
	detector := chardet.NewUniversalDetector(0)
	// Process text in chunks
	chunk1 := []byte("First chunk of text...")
	chunk2 := []byte("Second chunk of text...")
	detector.Feed(chunk1)
	detector.Feed(chunk2)
	// Get the result
	result := detector.GetResult()
	fmt.Printf("Detected result: %+v\n", result)
	// Output: Detected result: {Encoding:Ascii Confidence:1 Language:}
}

Processing Multiple Files

You can reuse the same detector instance for multiple files by using the Reset() method:

package main

import (
	"fmt"
	"os"
	"github.com/joshtechnologygroup/chardet"
)

func main() {
	detector := chardet.NewUniversalDetector(0)
	files := []string{"file1.txt", "file2.txt"}
	for _, file := range files {
		detector.Reset()
		data, err := os.ReadFile(file)
		if err != nil {
			continue
		}
		detector.Feed(data)
		result := detector.GetResult()
		fmt.Printf("File %s encoding: %+v\n", file, result)
	}
}

License

chardet is licensed under the MIT License, 100% free and open-source, forever.

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func EscDetector

func EscDetector(buf []byte) bool

EscDetector checks if the buffer contains escape sequences commonly used in certain character encodings

func HighByteDetector

func HighByteDetector(buf []byte) bool

HighByteDetector checks if the buffer contains any bytes with values >= 0x80

func WinByteDetector

func WinByteDetector(buf []byte) bool

WinByteDetector checks if the buffer contains Windows-specific byte values in the range 0x80-0x9F

Types

type Result

type Result struct {
	// Encoding is the detected character encoding name
	Encoding string `json:"encoding,omitempty"`
	// Confidence indicates how confident the detector is about the result (0.0-1.0)
	Confidence float64 `json:"confidence,omitempty"`
	// Language represents the detected language (if applicable)
	Language string `json:"language,omitempty"`
}

Result represents the character encoding detection result

func Detect

func Detect(buf []byte) Result

Detect the encoding of the given byte string.

func DetectAll

func DetectAll(buf []byte) []Result

DetectAll the possible encodings of the given byte string.

type UniversalDetector

type UniversalDetector struct {
	// MinimumThreshold is the minimum confidence threshold for detection
	MinimumThreshold float64
	// IsoWinMap maps ISO encodings to Windows encodings
	IsoWinMap map[string]string
	// contains filtered or unexported fields
}

UniversalDetector implements universal character encoding detection

func NewUniversalDetector

func NewUniversalDetector(filter consts.LangFilter) *UniversalDetector

NewUniversalDetector creates a new UniversalDetector instance with the specified language filter

func (*UniversalDetector) Feed

func (u *UniversalDetector) Feed(buf []byte)

Feed processes a chunk of bytes for character encoding detection It analyzes the input data and updates the internal state accordingly

func (*UniversalDetector) GetResult

func (u *UniversalDetector) GetResult() Result

GetResult returns the final character encoding detection result If detection is not complete, it will finalize the detection process

func (*UniversalDetector) Reset

func (u *UniversalDetector) Reset()

Reset resets the detector to its initial state

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL