normalizer

package

v0.0.0-...-8b90e50 Latest Latest Go to latest Published: Oct 12, 2025 License: MIT Imports: 4 Imported by: 0

Documentation ¶

Overview ¶

Package normalizer provides text normalization utilities for cleaning and standardizing text data.

The normalizer package is designed to work with structured data, automatically applying normalization functions to string fields within complex data structures like structs, slices, arrays, and maps. This is particularly useful when processing data from external sources that may contain inconsistent formatting, encoding, or special characters.

Core Concepts ¶

A Normalizer is a function that takes a string and returns a normalized string. The package provides several built-in normalizers and a mechanism to apply them recursively to complex data structures.

Built-in Normalizers ¶

The package includes several common text normalization functions:

HTMLUnescape: Converts HTML entities to their corresponding characters
UnicodeNFC: Normalizes Unicode text to NFC (Canonical Decomposed, then Canonical Composed) form
Punctuation: Replaces fancy/smart punctuation with standard ASCII equivalents

Usage Example ¶

The most common usage pattern is to define a set of normalizers and apply them to structured data:

normalizers := []normalizer.Normalizer{
	normalizer.HTMLUnescape,
	normalizer.UnicodeNFC,
	normalizer.Punctuation,
}

type Document struct {
	Title   string
	Content string
	Tags    []string
}

doc := &Document{
	Title:   "Hello &amp; Welcome",
	Content: "This is "smart" text with fancy punctuation…",
	Tags:    []string{"tag1", "tag2"},
}

normalizer.Apply(doc, normalizers...)
// doc.Title is now "Hello & Welcome"
// doc.Content is now "This is "smart" text with fancy punctuation..."
// doc.Tags remain unchanged

Custom Normalizers ¶

You can create custom normalizers by implementing the Normalizer function type:

func TrimWhitespace(s string) string {
	return strings.TrimSpace(s)
}

normalizers := []normalizer.Normalizer{
	TrimWhitespace,
	normalizer.HTMLUnescape,
}

Supported Data Types ¶

The Apply function works with the following data types:

Strings: Direct normalization
Structs: Recursively applies to all string fields
Pointers: Dereferences and applies to the underlying value
Interfaces: Applies to the underlying concrete value
Slices/Arrays: Applies to each element
Maps: Applies to string values and recursively to other values

Performance Considerations ¶

The Apply function uses reflection to traverse data structures, which has some performance overhead. For high-performance scenarios, consider applying normalizers directly to known string fields rather than using the generic Apply function.

Thread Safety ¶

All normalizer functions are stateless and thread-safe. The Apply function modifies data in-place and should not be called concurrently on the same data structure.

Index ¶

func Apply(v any, normalizers ...Normalizer)
func HTMLUnescape(s string) string
func Punctuation(s string) string
func UnicodeNFC(s string) string
type Normalizer

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func Apply ¶

func Apply(v any, normalizers ...Normalizer)

Apply recursively applies the given normalizers to all string fields in the provided value. The function uses reflection to traverse complex data structures and applies normalizers to string fields found within structs, slices, arrays, maps, and pointers.

Apply modifies the input value in-place. If the value is nil, the function returns early. For maps, only string values are normalized directly; other value types are processed recursively.

Example:

type Person struct {
	Name string
	Bio  string
}

p := &Person{Name: "John &amp; Jane", Bio: "Hello "world""}
normalizer.Apply(p, normalizer.HTMLUnescape, normalizer.Punctuation)
// p.Name is now "John & Jane", p.Bio is now "Hello "world""

Example ¶

ExampleApply demonstrates basic usage of the Apply function

type Person struct {
	Name string
	Bio  string
}

p := &Person{
	Name: "John &amp; Jane",
	Bio:  "Hello " + string(rune(0x201C)) + "world" + string(rune(0x201D)) + " — nice to meet you" + string(rune(0x2026)),
}

normalizers := []Normalizer{
	HTMLUnescape,
	Punctuation,
}
Apply(p, normalizers...)

// p.Name is now "John & Jane"
// p.Bio is now "Hello "world" - nice to meet you..."
_ = p

Example (Complex) ¶

ExampleApply_complex demonstrates working with complex nested structures

type Comment struct {
	Author  string
	Content string
}

type Post struct {
	Title    string
	Content  string
	Comments []Comment
	Tags     map[string]string
}

post := &Post{
	Title:   "Post &amp; News",
	Content: "This is " + string(rune(0x201C)) + "smart" + string(rune(0x201D)) + " content" + string(rune(0x2026)),
	Comments: []Comment{
		{Author: "John", Content: "Great " + string(rune(0x201C)) + "post" + string(rune(0x201D)) + "!"},
		{Author: "Jane", Content: "I agree &lt; 3"},
	},
	Tags: map[string]string{
		"category": "Tech &amp; News",
		"status":   "Published",
	},
}

normalizers := []Normalizer{
	HTMLUnescape,
	Punctuation,
}
Apply(post, normalizers...)

// All string fields in post and its nested structures are normalized
_ = post

Example (Custom) ¶

ExampleApply_custom demonstrates using custom normalizers

// Define custom normalizers
TrimSpace := func(s string) string {
	return strings.TrimSpace(s)
}

ToTitle := func(s string) string {
	return strings.Title(s)
}

type Article struct {
	Title   string
	Content string
}

article := &Article{
	Title:   "  hello world  ",
	Content: "  this is content  ",
}

normalizers := []Normalizer{
	TrimSpace,
	ToTitle,
}
Apply(article, normalizers...)

// article.Title is now "Hello World"
// article.Content is now "This Is Content"
_ = article

func HTMLUnescape ¶

func HTMLUnescape(s string) string

HTMLUnescape converts HTML entities to their corresponding characters. This function uses the standard library's html.UnescapeString to decode HTML entities like &, <, >, ", etc.

Example:

HTMLUnescape("Hello &amp; Welcome") // Returns "Hello & Welcome"
HTMLUnescape("Price &lt; $100")   // Returns "Price < $100"

Example ¶

ExampleHTMLUnescape demonstrates basic usage of HTMLUnescape

text := "Hello &amp; Welcome to our &lt;website&gt;"
normalized := HTMLUnescape(text)
fmt.Println(normalized)

Output:
Hello & Welcome to our <website>

Example (Advanced) ¶

ExampleHTMLUnescape_advanced demonstrates handling multiple entity types

text := "He said &quot;Hello &amp; goodbye&quot; &lt; 5 minutes"
normalized := HTMLUnescape(text)
fmt.Println(normalized)

Output:
He said "Hello & goodbye" < 5 minutes

func Punctuation ¶

func Punctuation(s string) string

Punctuation replaces fancy/smart punctuation characters with standard ASCII equivalents. This normalizer converts various Unicode punctuation marks to their basic ASCII counterparts:

Smart quotes ("") → straight quotes ("")
Smart apostrophes (”) → straight apostrophes (”)
Em/en dashes (—–) → hyphens (-)
Ellipsis (…) → three dots (...)
Non-breaking space (\u00A0) → regular space

This is useful for standardizing text that may contain fancy punctuation from word processors or web content.

Example:

Punctuation(""Hello" — he said…") // Returns ""Hello" - he said..."

Example ¶

ExamplePunctuation demonstrates basic usage of Punctuation

text := string(rune(0x201C)) + "Hello" + string(rune(0x201D)) + " " + string(rune(0x2014)) + " he said" + string(rune(0x2026)) + " " + string(rune(0x2018)) + "Really?" + string(rune(0x2019))
normalized := Punctuation(text)
fmt.Println(normalized)

Output:
"Hello" - he said... 'Really?'

Example (Mixed) ¶

ExamplePunctuation_mixed demonstrates handling various punctuation types

text := "Smart " + string(rune(0x201C)) + "quotes" + string(rune(0x201D)) + " and " + string(rune(0x2018)) + "apostrophes" + string(rune(0x2019)) + " " + string(rune(0x2014)) + " with dashes" + string(rune(0x2026)) + " and spaces" + string(rune(0x00A0)) + "here"
normalized := Punctuation(text)
fmt.Println(normalized)

Output:
Smart "quotes" and 'apostrophes' - with dashes... and spaces here

func UnicodeNFC ¶

func UnicodeNFC(s string) string

UnicodeNFC normalizes Unicode text to NFC (Canonical Decomposed, then Canonical Composed) form. This function uses the golang.org/x/text/unicode/norm package to ensure consistent Unicode representation by decomposing characters and then recomposing them in canonical order.

NFC normalization is important for:

Consistent string comparison and searching
Preventing duplicate entries that differ only in Unicode representation
Ensuring compatibility across different systems and platforms

Example:

UnicodeNFC("café") // Returns properly normalized "café"
UnicodeNFC("cafe\u0301") // Also returns "café" (normalized form)

Example ¶

ExampleUnicodeNFC demonstrates basic usage of UnicodeNFC

text := "cafe" + string(rune(0x0301)) // decomposed form
normalized := UnicodeNFC(text)
fmt.Println(normalized)

Output:
café

Example (Mixed) ¶

ExampleUnicodeNFC_mixed demonstrates handling mixed normalized and decomposed text

text := "café cafe" + string(rune(0x0301)) + " naïve nai" + string(rune(0x0308)) + "ve"
normalized := UnicodeNFC(text)
fmt.Println(normalized)

Output:
café café naïve naïve

Types ¶

type Normalizer ¶

type Normalizer func(string) string

Normalizer is a function that takes a string and returns a normalized string. Normalizers are stateless functions that transform text in a consistent way.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL