Documentation
¶
Overview ¶
Package textnorm provides fluent, deterministic text normalization pipelines.
Use the core Pipeline and SplitTokens(), MapTokens(), FilterTokens(), and JoinTokens() helpers for explicit normalization flows. SearchPreset(), CanonicalPreset(), DBSafePreset(), and WithWidthFold() provide thin reusable presets for the common search, canonical, and persistence-safe cases.
Benchmarks and fuzz targets live in this package so you can measure and harden the current implementation before any optimization work begins.
Examples:
textnorm.SearchPreset().Run(" Café, go! ")
textnorm.CanonicalPreset().Run(" Hello, World! ")
textnorm.DBSafePreset(textnorm.WithWidthFold()).Run(" Go\x00 ")
Streaming adapters are intentionally deferred until real usage proves they are worth the extra surface area.
Index ¶
- type Pipeline
- func (p Pipeline) CollapseWhitespace() Pipeline
- func (p Pipeline) FilterRunes(keep runes.Set) Pipeline
- func (p Pipeline) FoldCase() Pipeline
- func (p Pipeline) FoldWidth() Pipeline
- func (p Pipeline) Lower() Pipeline
- func (p Pipeline) MapRunes(fn func(rune) rune) Pipeline
- func (p Pipeline) NormalizeUnicode() Pipeline
- func (p Pipeline) RemoveAccents() Pipeline
- func (p Pipeline) Run(input string) (string, error)
- func (p Pipeline) SanitizeUTF8() Pipeline
- func (p Pipeline) SplitTokens() TokenPipeline
- func (p Pipeline) Then(stage Stage) Pipeline
- func (p Pipeline) TrimSpace() Pipeline
- type PresetOption
- type Stage
- type TokenPipeline
- func (tp TokenPipeline) FilterTokens(fn func(string) bool) TokenPipeline
- func (tp TokenPipeline) JoinTokens(sep string) Pipeline
- func (tp TokenPipeline) MapTokens(fn func(string) string) TokenPipeline
- func (tp TokenPipeline) Run(input string) ([]string, error)
- func (tp TokenPipeline) Then(stage TokenStage) TokenPipeline
- type TokenStage
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type Pipeline ¶
type Pipeline struct {
// contains filtered or unexported fields
}
Pipeline holds an ordered list of normalization stages. The zero value is valid.
func CanonicalPreset ¶
func CanonicalPreset(opts ...PresetOption) Pipeline
CanonicalPreset builds a general-purpose canonicalization pipeline.
func DBSafePreset ¶
func DBSafePreset(opts ...PresetOption) Pipeline
DBSafePreset builds a persistence-safe normalization pipeline.
func SearchPreset ¶
func SearchPreset(opts ...PresetOption) Pipeline
SearchPreset builds a search-key pipeline.
func (Pipeline) CollapseWhitespace ¶
CollapseWhitespace appends a whitespace-collapsing stage.
func (Pipeline) FilterRunes ¶
FilterRunes appends a rune-filtering stage.
func (Pipeline) NormalizeUnicode ¶
NormalizeUnicode appends a Unicode normalization stage.
func (Pipeline) RemoveAccents ¶
RemoveAccents appends a diacritic-removal stage.
func (Pipeline) SanitizeUTF8 ¶
SanitizeUTF8 appends a UTF-8 and NUL sanitization stage.
func (Pipeline) SplitTokens ¶
func (p Pipeline) SplitTokens() TokenPipeline
SplitTokens turns the current string pipeline into a token pipeline.
type PresetOption ¶
type PresetOption func(*presetConfig)
PresetOption customizes preset builders.
func WithWidthFold ¶
func WithWidthFold() PresetOption
WithWidthFold enables explicit width folding in a preset pipeline.
type TokenPipeline ¶
type TokenPipeline struct {
// contains filtered or unexported fields
}
TokenPipeline holds ordered token stages derived from a string pipeline.
func (TokenPipeline) FilterTokens ¶
func (tp TokenPipeline) FilterTokens(fn func(string) bool) TokenPipeline
FilterTokens returns a new token pipeline that keeps matching tokens.
func (TokenPipeline) JoinTokens ¶
func (tp TokenPipeline) JoinTokens(sep string) Pipeline
JoinTokens joins token output back into a string pipeline.
func (TokenPipeline) MapTokens ¶
func (tp TokenPipeline) MapTokens(fn func(string) string) TokenPipeline
MapTokens returns a new token pipeline that maps every token.
func (TokenPipeline) Run ¶
func (tp TokenPipeline) Run(input string) ([]string, error)
Run executes the source pipeline, tokenizes it, and applies token stages.
func (TokenPipeline) Then ¶
func (tp TokenPipeline) Then(stage TokenStage) TokenPipeline
Then returns a new token pipeline with a stage appended.
type TokenStage ¶
TokenStage transforms token slices and may return an error.