Documentation
¶
Overview ¶
Package normalizer provides text normalization utilities for cleaning and standardizing text data.
The normalizer package is designed to work with structured data, automatically applying normalization functions to string fields within complex data structures like structs, slices, arrays, and maps. This is particularly useful when processing data from external sources that may contain inconsistent formatting, encoding, or special characters.
Core Concepts ¶
A Normalizer is a function that takes a string and returns a normalized string. The package provides several built-in normalizers and a mechanism to apply them recursively to complex data structures.
Built-in Normalizers ¶
The package includes several common text normalization functions:
- HTMLUnescape: Converts HTML entities to their corresponding characters
- UnicodeNFC: Normalizes Unicode text to NFC (Canonical Decomposed, then Canonical Composed) form
- Punctuation: Replaces fancy/smart punctuation with standard ASCII equivalents
Usage Example ¶
The most common usage pattern is to define a set of normalizers and apply them to structured data:
normalizers := []normalizer.Normalizer{
normalizer.HTMLUnescape,
normalizer.UnicodeNFC,
normalizer.Punctuation,
}
type Document struct {
Title string
Content string
Tags []string
}
doc := &Document{
Title: "Hello & Welcome",
Content: "This is "smart" text with fancy punctuation…",
Tags: []string{"tag1", "tag2"},
}
normalizer.Apply(doc, normalizers...)
// doc.Title is now "Hello & Welcome"
// doc.Content is now "This is "smart" text with fancy punctuation..."
// doc.Tags remain unchanged
Custom Normalizers ¶
You can create custom normalizers by implementing the Normalizer function type:
func TrimWhitespace(s string) string {
return strings.TrimSpace(s)
}
normalizers := []normalizer.Normalizer{
TrimWhitespace,
normalizer.HTMLUnescape,
}
Supported Data Types ¶
The Apply function works with the following data types:
- Strings: Direct normalization
- Structs: Recursively applies to all string fields
- Pointers: Dereferences and applies to the underlying value
- Interfaces: Applies to the underlying concrete value
- Slices/Arrays: Applies to each element
- Maps: Applies to string values and recursively to other values
Performance Considerations ¶
The Apply function uses reflection to traverse data structures, which has some performance overhead. For high-performance scenarios, consider applying normalizers directly to known string fields rather than using the generic Apply function.
Thread Safety ¶
All normalizer functions are stateless and thread-safe. The Apply function modifies data in-place and should not be called concurrently on the same data structure.
Index ¶
Examples ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func Apply ¶
func Apply(v any, normalizers ...Normalizer)
Apply recursively applies the given normalizers to all string fields in the provided value. The function uses reflection to traverse complex data structures and applies normalizers to string fields found within structs, slices, arrays, maps, and pointers.
Apply modifies the input value in-place. If the value is nil, the function returns early. For maps, only string values are normalized directly; other value types are processed recursively.
Example:
type Person struct {
Name string
Bio string
}
p := &Person{Name: "John & Jane", Bio: "Hello "world""}
normalizer.Apply(p, normalizer.HTMLUnescape, normalizer.Punctuation)
// p.Name is now "John & Jane", p.Bio is now "Hello "world""
Example ¶
ExampleApply demonstrates basic usage of the Apply function
type Person struct {
Name string
Bio string
}
p := &Person{
Name: "John & Jane",
Bio: "Hello " + string(rune(0x201C)) + "world" + string(rune(0x201D)) + " — nice to meet you" + string(rune(0x2026)),
}
normalizers := []Normalizer{
HTMLUnescape,
Punctuation,
}
Apply(p, normalizers...)
// p.Name is now "John & Jane"
// p.Bio is now "Hello "world" - nice to meet you..."
_ = p
Example (Complex) ¶
ExampleApply_complex demonstrates working with complex nested structures
type Comment struct {
Author string
Content string
}
type Post struct {
Title string
Content string
Comments []Comment
Tags map[string]string
}
post := &Post{
Title: "Post & News",
Content: "This is " + string(rune(0x201C)) + "smart" + string(rune(0x201D)) + " content" + string(rune(0x2026)),
Comments: []Comment{
{Author: "John", Content: "Great " + string(rune(0x201C)) + "post" + string(rune(0x201D)) + "!"},
{Author: "Jane", Content: "I agree < 3"},
},
Tags: map[string]string{
"category": "Tech & News",
"status": "Published",
},
}
normalizers := []Normalizer{
HTMLUnescape,
Punctuation,
}
Apply(post, normalizers...)
// All string fields in post and its nested structures are normalized
_ = post
Example (Custom) ¶
ExampleApply_custom demonstrates using custom normalizers
// Define custom normalizers
TrimSpace := func(s string) string {
return strings.TrimSpace(s)
}
ToTitle := func(s string) string {
return strings.Title(s)
}
type Article struct {
Title string
Content string
}
article := &Article{
Title: " hello world ",
Content: " this is content ",
}
normalizers := []Normalizer{
TrimSpace,
ToTitle,
}
Apply(article, normalizers...)
// article.Title is now "Hello World"
// article.Content is now "This Is Content"
_ = article
func HTMLUnescape ¶
HTMLUnescape converts HTML entities to their corresponding characters. This function uses the standard library's html.UnescapeString to decode HTML entities like &, <, >, ", etc.
Example:
HTMLUnescape("Hello & Welcome") // Returns "Hello & Welcome"
HTMLUnescape("Price < $100") // Returns "Price < $100"
Example ¶
ExampleHTMLUnescape demonstrates basic usage of HTMLUnescape
text := "Hello & Welcome to our <website>" normalized := HTMLUnescape(text) fmt.Println(normalized)
Output: Hello & Welcome to our <website>
Example (Advanced) ¶
ExampleHTMLUnescape_advanced demonstrates handling multiple entity types
text := "He said "Hello & goodbye" < 5 minutes" normalized := HTMLUnescape(text) fmt.Println(normalized)
Output: He said "Hello & goodbye" < 5 minutes
func Punctuation ¶
Punctuation replaces fancy/smart punctuation characters with standard ASCII equivalents. This normalizer converts various Unicode punctuation marks to their basic ASCII counterparts:
- Smart quotes ("") → straight quotes ("")
- Smart apostrophes (”) → straight apostrophes (”)
- Em/en dashes (—–) → hyphens (-)
- Ellipsis (…) → three dots (...)
- Non-breaking space (\u00A0) → regular space
This is useful for standardizing text that may contain fancy punctuation from word processors or web content.
Example:
Punctuation(""Hello" — he said…") // Returns ""Hello" - he said..."
Example ¶
ExamplePunctuation demonstrates basic usage of Punctuation
text := string(rune(0x201C)) + "Hello" + string(rune(0x201D)) + " " + string(rune(0x2014)) + " he said" + string(rune(0x2026)) + " " + string(rune(0x2018)) + "Really?" + string(rune(0x2019)) normalized := Punctuation(text) fmt.Println(normalized)
Output: "Hello" - he said... 'Really?'
Example (Mixed) ¶
ExamplePunctuation_mixed demonstrates handling various punctuation types
text := "Smart " + string(rune(0x201C)) + "quotes" + string(rune(0x201D)) + " and " + string(rune(0x2018)) + "apostrophes" + string(rune(0x2019)) + " " + string(rune(0x2014)) + " with dashes" + string(rune(0x2026)) + " and spaces" + string(rune(0x00A0)) + "here" normalized := Punctuation(text) fmt.Println(normalized)
Output: Smart "quotes" and 'apostrophes' - with dashes... and spaces here
func UnicodeNFC ¶
UnicodeNFC normalizes Unicode text to NFC (Canonical Decomposed, then Canonical Composed) form. This function uses the golang.org/x/text/unicode/norm package to ensure consistent Unicode representation by decomposing characters and then recomposing them in canonical order.
NFC normalization is important for:
- Consistent string comparison and searching
- Preventing duplicate entries that differ only in Unicode representation
- Ensuring compatibility across different systems and platforms
Example:
UnicodeNFC("café") // Returns properly normalized "café"
UnicodeNFC("cafe\u0301") // Also returns "café" (normalized form)
Example ¶
ExampleUnicodeNFC demonstrates basic usage of UnicodeNFC
text := "cafe" + string(rune(0x0301)) // decomposed form normalized := UnicodeNFC(text) fmt.Println(normalized)
Output: café
Example (Mixed) ¶
ExampleUnicodeNFC_mixed demonstrates handling mixed normalized and decomposed text
text := "café cafe" + string(rune(0x0301)) + " naïve nai" + string(rune(0x0308)) + "ve" normalized := UnicodeNFC(text) fmt.Println(normalized)
Output: café café naïve naïve
Types ¶
type Normalizer ¶
Normalizer is a function that takes a string and returns a normalized string. Normalizers are stateless functions that transform text in a consistent way.