chardet

package module

v0.1.0 Latest Latest Go to latest Published: Jun 24, 2025 License: MIT Imports: 4 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/joshtechnologygroup/chardet

Links

Open Source Insights

README ¶

chardet: Go character encoding detector

Introduction

This is a Go port of the python's chardet library. Much respect and appreciation to the original authors for their excellent work.

chardet is a character encoding detector library written in Go. It helps you automatically detect the character encoding of text content.

Installation

To install chardet, use go get:

go get github.com/joshtechnologygroup/chardet

Supported Encodings & Languages

Support Encodings:

Expand the list of supported encodings

Ascii
UTF-8
UTF-8-SIG
UTF-16
UTF-16LE
UTF-16BE
UTF-32
UTF-32BE
UTF-32LE
GB2312
HZ-GB-2312
SHIFT_JIS
Big5
Johab
KOI8-R
TIS-620
MacCyrillic
MacRoman
EUC-TW
EUC-KR
EUC-JP
CP932
CP949
Windows-1250
Windows-1251
Windows-1252
Windows-1253
Windows-1254
Windows-1255
Windows-1256
Windows-1257
ISO-8859-1
ISO-8859-2
ISO-8859-5
ISO-8859-6
ISO-8859-7
ISO-8859-8
ISO-8859-9
ISO-8859-13
ISO-2022-CN
ISO-2022-JP
ISO-2022-KR
X-ISO-10646-UCS-4-3412
X-ISO-10646-UCS-4-2143
IBM855
IBM866

Support Languages:

Expand the list of supported languages

- Chinese - Japanese - Korean - Hebrew - Russian - Greek - Bulgarian - Thai - Turkish

Usage

Basic Usage

The simplest way to use chardet is with the Detect function:

package main

import (
	"fmt"
	"github.com/joshtechnologygroup/chardet"
)

func main() {
	data := []byte("Your text data here...")
	result := chardet.Detect(data)
	fmt.Printf("Detected result: %+v\n", result)
    //Output: Detected result: {Encoding:Ascii Confidence:1 Language:}
}

Advanced Usage

For handling large amounts of text, you can use the detector incrementally. This allows the detector to stop as soon as it reaches sufficient confidence in its result.

package main

import (
	"fmt"
	"github.com/joshtechnologygroup/chardet"
)

func main() {
	// Create a detector instance
	detector := chardet.NewUniversalDetector(0)
	// Process text in chunks
	chunk1 := []byte("First chunk of text...")
	chunk2 := []byte("Second chunk of text...")
	detector.Feed(chunk1)
	detector.Feed(chunk2)
	// Get the result
	result := detector.GetResult()
	fmt.Printf("Detected result: %+v\n", result)
	// Output: Detected result: {Encoding:Ascii Confidence:1 Language:}
}

Processing Multiple Files

You can reuse the same detector instance for multiple files by using the Reset() method:

package main

import (
	"fmt"
	"os"
	"github.com/joshtechnologygroup/chardet"
)

func main() {
	detector := chardet.NewUniversalDetector(0)
	files := []string{"file1.txt", "file2.txt"}
	for _, file := range files {
		detector.Reset()
		data, err := os.ReadFile(file)
		if err != nil {
			continue
		}
		detector.Feed(data)
		result := detector.GetResult()
		fmt.Printf("File %s encoding: %+v\n", file, result)
	}
}

License

chardet is licensed under the MIT License, 100% free and open-source, forever.

Documentation ¶

Index ¶

func EscDetector(buf []byte) bool
func HighByteDetector(buf []byte) bool
func WinByteDetector(buf []byte) bool
type Result
- func Detect(buf []byte) Result
- func DetectAll(buf []byte) []Result
type UniversalDetector
- func NewUniversalDetector(filter consts.LangFilter) *UniversalDetector

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func EscDetector ¶

func EscDetector(buf []byte) bool

EscDetector checks if the buffer contains escape sequences commonly used in certain character encodings

func HighByteDetector ¶

func HighByteDetector(buf []byte) bool

HighByteDetector checks if the buffer contains any bytes with values >= 0x80

func WinByteDetector ¶

func WinByteDetector(buf []byte) bool

WinByteDetector checks if the buffer contains Windows-specific byte values in the range 0x80-0x9F

Types ¶

type Result ¶

type Result struct {
	// Encoding is the detected character encoding name
	Encoding string `json:"encoding,omitempty"`
	// Confidence indicates how confident the detector is about the result (0.0-1.0)
	Confidence float64 `json:"confidence,omitempty"`
	// Language represents the detected language (if applicable)
	Language string `json:"language,omitempty"`
}

Result represents the character encoding detection result

func Detect ¶

func Detect(buf []byte) Result

Detect the encoding of the given byte string.

func DetectAll ¶

func DetectAll(buf []byte) []Result

DetectAll the possible encodings of the given byte string.

type UniversalDetector ¶

type UniversalDetector struct {
	// MinimumThreshold is the minimum confidence threshold for detection
	MinimumThreshold float64
	// IsoWinMap maps ISO encodings to Windows encodings
	IsoWinMap map[string]string
	// contains filtered or unexported fields
}

UniversalDetector implements universal character encoding detection

func NewUniversalDetector ¶

func NewUniversalDetector(filter consts.LangFilter) *UniversalDetector

NewUniversalDetector creates a new UniversalDetector instance with the specified language filter

func (*UniversalDetector) Feed ¶

func (u *UniversalDetector) Feed(buf []byte)

Feed processes a chunk of bytes for character encoding detection It analyzes the input data and updates the internal state accordingly

func (*UniversalDetector) GetResult ¶

func (u *UniversalDetector) GetResult() Result

GetResult returns the final character encoding detection result If detection is not complete, it will finalize the detection process

func (*UniversalDetector) Reset ¶

func (u *UniversalDetector) Reset()

Reset resets the detector to its initial state

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
cda
consts
probe

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL