normalizer

package
v0.0.0-...-8b90e50 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 12, 2025 License: MIT Imports: 4 Imported by: 0

Documentation

Overview

Package normalizer provides text normalization utilities for cleaning and standardizing text data.

The normalizer package is designed to work with structured data, automatically applying normalization functions to string fields within complex data structures like structs, slices, arrays, and maps. This is particularly useful when processing data from external sources that may contain inconsistent formatting, encoding, or special characters.

Core Concepts

A Normalizer is a function that takes a string and returns a normalized string. The package provides several built-in normalizers and a mechanism to apply them recursively to complex data structures.

Built-in Normalizers

The package includes several common text normalization functions:

  • HTMLUnescape: Converts HTML entities to their corresponding characters
  • UnicodeNFC: Normalizes Unicode text to NFC (Canonical Decomposed, then Canonical Composed) form
  • Punctuation: Replaces fancy/smart punctuation with standard ASCII equivalents

Usage Example

The most common usage pattern is to define a set of normalizers and apply them to structured data:

normalizers := []normalizer.Normalizer{
	normalizer.HTMLUnescape,
	normalizer.UnicodeNFC,
	normalizer.Punctuation,
}

type Document struct {
	Title   string
	Content string
	Tags    []string
}

doc := &Document{
	Title:   "Hello & Welcome",
	Content: "This is "smart" text with fancy punctuation…",
	Tags:    []string{"tag1", "tag2"},
}

normalizer.Apply(doc, normalizers...)
// doc.Title is now "Hello & Welcome"
// doc.Content is now "This is "smart" text with fancy punctuation..."
// doc.Tags remain unchanged

Custom Normalizers

You can create custom normalizers by implementing the Normalizer function type:

func TrimWhitespace(s string) string {
	return strings.TrimSpace(s)
}

normalizers := []normalizer.Normalizer{
	TrimWhitespace,
	normalizer.HTMLUnescape,
}

Supported Data Types

The Apply function works with the following data types:

  • Strings: Direct normalization
  • Structs: Recursively applies to all string fields
  • Pointers: Dereferences and applies to the underlying value
  • Interfaces: Applies to the underlying concrete value
  • Slices/Arrays: Applies to each element
  • Maps: Applies to string values and recursively to other values

Performance Considerations

The Apply function uses reflection to traverse data structures, which has some performance overhead. For high-performance scenarios, consider applying normalizers directly to known string fields rather than using the generic Apply function.

Thread Safety

All normalizer functions are stateless and thread-safe. The Apply function modifies data in-place and should not be called concurrently on the same data structure.

Index

Examples

Constants

This section is empty.

Variables

This section is empty.

Functions

func Apply

func Apply(v any, normalizers ...Normalizer)

Apply recursively applies the given normalizers to all string fields in the provided value. The function uses reflection to traverse complex data structures and applies normalizers to string fields found within structs, slices, arrays, maps, and pointers.

Apply modifies the input value in-place. If the value is nil, the function returns early. For maps, only string values are normalized directly; other value types are processed recursively.

Example:

type Person struct {
	Name string
	Bio  string
}

p := &Person{Name: "John & Jane", Bio: "Hello "world""}
normalizer.Apply(p, normalizer.HTMLUnescape, normalizer.Punctuation)
// p.Name is now "John & Jane", p.Bio is now "Hello "world""
Example

ExampleApply demonstrates basic usage of the Apply function

type Person struct {
	Name string
	Bio  string
}

p := &Person{
	Name: "John & Jane",
	Bio:  "Hello " + string(rune(0x201C)) + "world" + string(rune(0x201D)) + " — nice to meet you" + string(rune(0x2026)),
}

normalizers := []Normalizer{
	HTMLUnescape,
	Punctuation,
}
Apply(p, normalizers...)

// p.Name is now "John & Jane"
// p.Bio is now "Hello "world" - nice to meet you..."
_ = p
Example (Complex)

ExampleApply_complex demonstrates working with complex nested structures

type Comment struct {
	Author  string
	Content string
}

type Post struct {
	Title    string
	Content  string
	Comments []Comment
	Tags     map[string]string
}

post := &Post{
	Title:   "Post & News",
	Content: "This is " + string(rune(0x201C)) + "smart" + string(rune(0x201D)) + " content" + string(rune(0x2026)),
	Comments: []Comment{
		{Author: "John", Content: "Great " + string(rune(0x201C)) + "post" + string(rune(0x201D)) + "!"},
		{Author: "Jane", Content: "I agree < 3"},
	},
	Tags: map[string]string{
		"category": "Tech & News",
		"status":   "Published",
	},
}

normalizers := []Normalizer{
	HTMLUnescape,
	Punctuation,
}
Apply(post, normalizers...)

// All string fields in post and its nested structures are normalized
_ = post
Example (Custom)

ExampleApply_custom demonstrates using custom normalizers

// Define custom normalizers
TrimSpace := func(s string) string {
	return strings.TrimSpace(s)
}

ToTitle := func(s string) string {
	return strings.Title(s)
}

type Article struct {
	Title   string
	Content string
}

article := &Article{
	Title:   "  hello world  ",
	Content: "  this is content  ",
}

normalizers := []Normalizer{
	TrimSpace,
	ToTitle,
}
Apply(article, normalizers...)

// article.Title is now "Hello World"
// article.Content is now "This Is Content"
_ = article

func HTMLUnescape

func HTMLUnescape(s string) string

HTMLUnescape converts HTML entities to their corresponding characters. This function uses the standard library's html.UnescapeString to decode HTML entities like &, <, >, ", etc.

Example:

HTMLUnescape("Hello & Welcome") // Returns "Hello & Welcome"
HTMLUnescape("Price &lt; $100")   // Returns "Price < $100"
Example

ExampleHTMLUnescape demonstrates basic usage of HTMLUnescape

text := "Hello &amp; Welcome to our &lt;website&gt;"
normalized := HTMLUnescape(text)
fmt.Println(normalized)
Output:
Hello & Welcome to our <website>
Example (Advanced)

ExampleHTMLUnescape_advanced demonstrates handling multiple entity types

text := "He said &quot;Hello &amp; goodbye&quot; &lt; 5 minutes"
normalized := HTMLUnescape(text)
fmt.Println(normalized)
Output:
He said "Hello & goodbye" < 5 minutes

func Punctuation

func Punctuation(s string) string

Punctuation replaces fancy/smart punctuation characters with standard ASCII equivalents. This normalizer converts various Unicode punctuation marks to their basic ASCII counterparts:

  • Smart quotes ("") → straight quotes ("")
  • Smart apostrophes (”) → straight apostrophes (”)
  • Em/en dashes (—–) → hyphens (-)
  • Ellipsis (…) → three dots (...)
  • Non-breaking space (\u00A0) → regular space

This is useful for standardizing text that may contain fancy punctuation from word processors or web content.

Example:

Punctuation(""Hello" — he said…") // Returns ""Hello" - he said..."
Example

ExamplePunctuation demonstrates basic usage of Punctuation

text := string(rune(0x201C)) + "Hello" + string(rune(0x201D)) + " " + string(rune(0x2014)) + " he said" + string(rune(0x2026)) + " " + string(rune(0x2018)) + "Really?" + string(rune(0x2019))
normalized := Punctuation(text)
fmt.Println(normalized)
Output:
"Hello" - he said... 'Really?'
Example (Mixed)

ExamplePunctuation_mixed demonstrates handling various punctuation types

text := "Smart " + string(rune(0x201C)) + "quotes" + string(rune(0x201D)) + " and " + string(rune(0x2018)) + "apostrophes" + string(rune(0x2019)) + " " + string(rune(0x2014)) + " with dashes" + string(rune(0x2026)) + " and spaces" + string(rune(0x00A0)) + "here"
normalized := Punctuation(text)
fmt.Println(normalized)
Output:
Smart "quotes" and 'apostrophes' - with dashes... and spaces here

func UnicodeNFC

func UnicodeNFC(s string) string

UnicodeNFC normalizes Unicode text to NFC (Canonical Decomposed, then Canonical Composed) form. This function uses the golang.org/x/text/unicode/norm package to ensure consistent Unicode representation by decomposing characters and then recomposing them in canonical order.

NFC normalization is important for:

  • Consistent string comparison and searching
  • Preventing duplicate entries that differ only in Unicode representation
  • Ensuring compatibility across different systems and platforms

Example:

UnicodeNFC("café") // Returns properly normalized "café"
UnicodeNFC("cafe\u0301") // Also returns "café" (normalized form)
Example

ExampleUnicodeNFC demonstrates basic usage of UnicodeNFC

text := "cafe" + string(rune(0x0301)) // decomposed form
normalized := UnicodeNFC(text)
fmt.Println(normalized)
Output:
café
Example (Mixed)

ExampleUnicodeNFC_mixed demonstrates handling mixed normalized and decomposed text

text := "café cafe" + string(rune(0x0301)) + " naïve nai" + string(rune(0x0308)) + "ve"
normalized := UnicodeNFC(text)
fmt.Println(normalized)
Output:
café café naïve naïve

Types

type Normalizer

type Normalizer func(string) string

Normalizer is a function that takes a string and returns a normalized string. Normalizers are stateless functions that transform text in a consistent way.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL