textnorm

package
v1.2.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 2, 2026 License: MIT Imports: 9 Imported by: 0

Documentation

Overview

Package textnorm provides fluent, deterministic text normalization pipelines.

Use the core Pipeline and SplitTokens(), MapTokens(), FilterTokens(), and JoinTokens() helpers for explicit normalization flows. SearchPreset(), CanonicalPreset(), DBSafePreset(), and WithWidthFold() provide thin reusable presets for the common search, canonical, and persistence-safe cases.

Benchmarks and fuzz targets live in this package so you can measure and harden the current implementation before any optimization work begins.

Examples:

textnorm.SearchPreset().Run("  Café, go!  ")
textnorm.CanonicalPreset().Run("  Hello, World!  ")
textnorm.DBSafePreset(textnorm.WithWidthFold()).Run("  Go\x00  ")

Streaming adapters are intentionally deferred until real usage proves they are worth the extra surface area.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Pipeline

type Pipeline struct {
	// contains filtered or unexported fields
}

Pipeline holds an ordered list of normalization stages. The zero value is valid.

func CanonicalPreset

func CanonicalPreset(opts ...PresetOption) Pipeline

CanonicalPreset builds a general-purpose canonicalization pipeline.

func DBSafePreset

func DBSafePreset(opts ...PresetOption) Pipeline

DBSafePreset builds a persistence-safe normalization pipeline.

func New

func New() Pipeline

New returns a new empty pipeline.

func SearchPreset

func SearchPreset(opts ...PresetOption) Pipeline

SearchPreset builds a search-key pipeline.

func (Pipeline) CollapseWhitespace

func (p Pipeline) CollapseWhitespace() Pipeline

CollapseWhitespace appends a whitespace-collapsing stage.

func (Pipeline) FilterRunes

func (p Pipeline) FilterRunes(keep runes.Set) Pipeline

FilterRunes appends a rune-filtering stage.

func (Pipeline) FoldCase

func (p Pipeline) FoldCase() Pipeline

FoldCase appends full Unicode case folding.

func (Pipeline) FoldWidth

func (p Pipeline) FoldWidth() Pipeline

FoldWidth appends an explicit width-folding stage.

func (Pipeline) Lower

func (p Pipeline) Lower() Pipeline

Lower appends Unicode-aware lowercasing.

func (Pipeline) MapRunes

func (p Pipeline) MapRunes(fn func(rune) rune) Pipeline

MapRunes appends a rune-mapping stage.

func (Pipeline) NormalizeUnicode

func (p Pipeline) NormalizeUnicode() Pipeline

NormalizeUnicode appends a Unicode normalization stage.

func (Pipeline) RemoveAccents

func (p Pipeline) RemoveAccents() Pipeline

RemoveAccents appends a diacritic-removal stage.

func (Pipeline) Run

func (p Pipeline) Run(input string) (string, error)

Run executes all stages in declaration order.

func (Pipeline) SanitizeUTF8

func (p Pipeline) SanitizeUTF8() Pipeline

SanitizeUTF8 appends a UTF-8 and NUL sanitization stage.

func (Pipeline) SplitTokens

func (p Pipeline) SplitTokens() TokenPipeline

SplitTokens turns the current string pipeline into a token pipeline.

func (Pipeline) Then

func (p Pipeline) Then(stage Stage) Pipeline

Then returns a new pipeline with stage appended.

func (Pipeline) TrimSpace

func (p Pipeline) TrimSpace() Pipeline

TrimSpace appends a trimming stage.

type PresetOption

type PresetOption func(*presetConfig)

PresetOption customizes preset builders.

func WithWidthFold

func WithWidthFold() PresetOption

WithWidthFold enables explicit width folding in a preset pipeline.

type Stage

type Stage func(string) (string, error)

Stage transforms input text and may return an error.

type TokenPipeline

type TokenPipeline struct {
	// contains filtered or unexported fields
}

TokenPipeline holds ordered token stages derived from a string pipeline.

func (TokenPipeline) FilterTokens

func (tp TokenPipeline) FilterTokens(fn func(string) bool) TokenPipeline

FilterTokens returns a new token pipeline that keeps matching tokens.

func (TokenPipeline) JoinTokens

func (tp TokenPipeline) JoinTokens(sep string) Pipeline

JoinTokens joins token output back into a string pipeline.

func (TokenPipeline) MapTokens

func (tp TokenPipeline) MapTokens(fn func(string) string) TokenPipeline

MapTokens returns a new token pipeline that maps every token.

func (TokenPipeline) Run

func (tp TokenPipeline) Run(input string) ([]string, error)

Run executes the source pipeline, tokenizes it, and applies token stages.

func (TokenPipeline) Then

func (tp TokenPipeline) Then(stage TokenStage) TokenPipeline

Then returns a new token pipeline with a stage appended.

type TokenStage

type TokenStage func([]string) ([]string, error)

TokenStage transforms token slices and may return an error.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL