Documentation
¶
Overview ¶
Package stringutil provides string manipulation utilities.
The package is organized into several categories:
Search and Indexing ¶
Functions for finding substrings and patterns:
indices := stringutil.AllIndexes("banana", "an") // [1, 3]
ok := stringutil.HasAnyPrefix(s, "http://", "https://")
ok := stringutil.ContainsAll(s, "foo", "bar")
Transformation ¶
Functions for transforming strings:
reversed := stringutil.Reverse("hello") // "olleh"
truncated := stringutil.Truncate(s, 100, "...")
padded := stringutil.PadLeft("42", 5, '0') // "00042"
Validation ¶
Functions for checking string properties:
if stringutil.IsNumeric(s) { ... }
if stringutil.IsAlpha(s) { ... }
if stringutil.IsPalindrome(s) { ... }
Similarity (see similarity.go) ¶
Algorithms for measuring string similarity:
distance := stringutil.LevenshteinDistance("kitten", "sitting") // 3
score := stringutil.JaroWinklerSimilarity("martha", "marhta") // ~0.96
coefficient := stringutil.DiceCoefficient("night", "nacht")
All functions are designed to be nil-safe and handle edge cases gracefully.
Index ¶
- func AllIndexes(s, substr string) []int
- func Between(s, start, end string) (string, bool)
- func BetweenAll(s, start, end string) []string
- func CamelCase(s string) string
- func Capitalize(s string) string
- func CleanString(input string, options ...CleanOption) (string, error)
- func CommonPrefix(strs ...string) string
- func CommonSuffix(strs ...string) string
- func ContainsAll(s string, substrs ...string) bool
- func ContainsAny(s string, substrs ...string) bool
- func CosineSimilarity(s1, s2 string, n int) float64
- func CountLines(s string) int
- func DamerauLevenshteinDistance(s1, s2 string) int
- func Dedent(s string) string
- func DiceCoefficient(s1, s2 string) float64
- func HammingDistance(s1, s2 string) int
- func HasAnyPrefix(s string, prefixes ...string) bool
- func HasAnySuffix(s string, suffixes ...string) bool
- func Indent(s, prefix string) string
- func IsASCII(s string) bool
- func IsAlpha(s string) bool
- func IsAlphanumeric(s string) bool
- func IsBlank(s string) bool
- func IsEmpty(s string) bool
- func IsLower(s string) bool
- func IsNumeric(s string) bool
- func IsPalindrome(s string, normalize bool) bool
- func IsPrintable(s string) bool
- func IsUpper(s string) bool
- func JaroSimilarity(s1, s2 string) float64
- func JaroWinklerSimilarity(s1, s2 string, prefixScale float64) float64
- func Join(elems []string, sep string) string
- func KebabCase(s string) string
- func LevenshteinDistance(s1, s2 string) int
- func LevenshteinSimilarity(s1, s2 string) float64
- func Lines(s string) []string
- func LongestCommonSubsequence(s1, s2 string) int
- func LongestCommonSubstring(s1, s2 string) string
- func NormalizeUnicode(s string) (string, error)
- func NormalizeWhitespace(s string) string
- func NthRune(s string, n int) (rune, bool)
- func PadCenter(s string, length int, padChar rune) string
- func PadLeft(s string, length int, padChar rune) string
- func PadRight(s string, length int, padChar rune) string
- func PascalCase(s string) string
- func RemoveAccents(s string) (string, error)
- func RemoveAll(s string, substrs ...string) string
- func RemoveNonPrintable(s string) string
- func Repeat(s string, n int) string
- func Reverse(s string) string
- func RuneCount(s string) int
- func SafeSlice(s string, start, end int) string
- func SanitizeUTF8(s string) string
- func Slugify(s string) (string, error)
- func SnakeCase(s string) string
- func SplitAfter(s, sep string) []string
- func SplitAndTrim(s, sep string) []string
- func SplitN(s, sep string, n int) []string
- func StripHTMLEntities(s string) string
- func StripTags(s string) string
- func SwapCase(s string) string
- func Title(s string) string
- func ToASCII(s string) (string, error)
- func Truncate(s string, maxLen int, suffix string) string
- func TruncateRunes(s string, maxLen int) string
- func TruncateWords(s string, maxLen int, suffix string) string
- func Words(s string) []string
- func Wrap(s string, width int) string
- type CleanOption
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func AllIndexes ¶
AllIndexes returns all starting positions of substr in s. Returns nil if substr is empty or s doesn't contain substr.
Example:
indices := AllIndexes("banana", "an")
// indices = [1, 3]
func Between ¶
Between extracts the substring between start and end markers. Returns empty string and false if markers not found in proper order.
Example:
Between("[hello]", "[", "]") // "hello", true
func BetweenAll ¶
BetweenAll extracts all substrings between start and end markers.
Example:
BetweenAll("a[1]b[2]c[3]", "[", "]") // ["1", "2", "3"]
func CamelCase ¶
CamelCase converts s to camelCase.
Example:
CamelCase("hello_world") // "helloWorld"
CamelCase("hello-world") // "helloWorld"
func Capitalize ¶
Capitalize returns s with the first character uppercased and the rest lowercased.
Example:
Capitalize("hELLO") // "Hello"
func CleanString ¶ added in v1.1.0
func CleanString(input string, options ...CleanOption) (string, error)
CleanString applies the specified cleaning options to the input string. Options are applied in a fixed order for consistency and correctness:
- HTML stripping (remove tags, decode entities)
- Unicode normalization (NFKD + diacritic removal)
- Database sanitization (UTF-8 validation, NUL removal, truncation)
This order ensures that:
- HTML entities are decoded before Unicode normalization processes the text
- Database constraints (length, encoding) are applied last on the final result
Returns the cleaned string. Error is returned only for transformer failures (extremely unlikely with valid Go strings).
If no options are provided, the input string is returned unchanged.
Example:
// Apply all three cleaning modes:
result, err := CleanString(
"<p>Héllo & Wörld</p>",
WithHTMLStrip(),
WithUnicodeNorm(),
WithDBSanitize(50),
)
// result = "Hello & World"
// Apply only Unicode normalization:
result, _ := CleanString("café", WithUnicodeNorm())
// result = "cafe"
func CommonPrefix ¶
CommonPrefix returns the longest common prefix of the given strings. Returns empty string if no common prefix or fewer than 2 strings.
Example:
CommonPrefix("interstellar", "internet", "internal") // "inter"
func CommonSuffix ¶
CommonSuffix returns the longest common suffix of the given strings.
func ContainsAll ¶
ContainsAll reports whether s contains all of the given substrings.
func ContainsAny ¶
ContainsAny reports whether s contains any of the given substrings.
Example:
if ContainsAny(text, "error", "fail", "warning") { ... }
func CosineSimilarity ¶
CosineSimilarity computes the cosine similarity of two strings based on their character n-gram vectors. Returns a value between 0 and 1.
This is useful for comparing longer texts.
func CountLines ¶
CountLines returns the number of lines in s. An empty string returns 0; a string without newlines returns 1.
func DamerauLevenshteinDistance ¶
DamerauLevenshteinDistance extends Levenshtein to include transpositions (swapping two adjacent characters) as a single edit operation.
Example:
DamerauLevenshteinDistance("ca", "ac") // 1 (transposition)
LevenshteinDistance("ca", "ac") // 2 (delete + insert)
func Dedent ¶
Dedent removes common leading whitespace from all lines.
Example:
Dedent(" a\n b\n c") // "a\nb\nc"
func DiceCoefficient ¶
DiceCoefficient returns the Sørensen–Dice coefficient comparing bigrams. Returns a value between 0 and 1, where 1 means identical sets of bigrams.
This metric is useful for comparing short strings or when order matters less.
Example:
DiceCoefficient("night", "nacht") // ~0.25
func HammingDistance ¶
HammingDistance returns the number of positions where corresponding characters differ. Only defined for strings of equal length. Returns -1 if strings have different lengths.
Example:
HammingDistance("karolin", "kathrin") // 3
func HasAnyPrefix ¶
HasAnyPrefix reports whether s starts with any of the given prefixes.
Example:
if HasAnyPrefix(url, "http://", "https://") { ... }
func HasAnySuffix ¶
HasAnySuffix reports whether s ends with any of the given suffixes.
func Indent ¶
Indent adds prefix to the beginning of each line in s.
Example:
Indent("a\nb\nc", " ") // " a\n b\n c"
func IsAlphanumeric ¶
IsAlphanumeric reports whether s contains only letters and digits.
func IsLower ¶
IsLower reports whether all letters in s are lowercase. Returns true for strings with no letters.
func IsPalindrome ¶
IsPalindrome reports whether s reads the same forwards and backwards. Case-sensitive and ignores whitespace/punctuation only if normalize is true.
Example:
IsPalindrome("racecar", false) // true
IsPalindrome("A man a plan a canal Panama", true) // true (normalized)
func IsPrintable ¶
IsPrintable reports whether s contains only printable characters.
func IsUpper ¶
IsUpper reports whether all letters in s are uppercase. Returns true for strings with no letters.
func JaroSimilarity ¶
JaroSimilarity returns the Jaro similarity between two strings. Returns a value between 0 (completely different) and 1 (identical).
The algorithm considers: - Number of matching characters - Number of transpositions
Example:
JaroSimilarity("martha", "marhta") // ~0.944
func JaroWinklerSimilarity ¶
JaroWinklerSimilarity returns the Jaro-Winkler similarity between two strings. This is an extension of Jaro that gives more weight to strings with a common prefix.
The prefixScale parameter (0 to 0.25) determines how much weight to give to the common prefix. Standard value is 0.1.
Example:
JaroWinklerSimilarity("martha", "marhta", 0.1) // ~0.961
func KebabCase ¶
KebabCase converts s to kebab-case.
Example:
KebabCase("HelloWorld") // "hello-world"
func LevenshteinDistance ¶
LevenshteinDistance returns the minimum number of single-character edits (insertions, deletions, substitutions) required to change s1 into s2.
Time complexity: O(len(s1) * len(s2)) Space complexity: O(min(len(s1), len(s2)))
Example:
LevenshteinDistance("kitten", "sitting") // 3
func LevenshteinSimilarity ¶
LevenshteinSimilarity returns a similarity score between 0 and 1 based on Levenshtein distance. 1 means identical strings.
Example:
LevenshteinSimilarity("hello", "hallo") // ~0.8
func Lines ¶
Lines splits s into lines. Unlike strings.Split, handles \r\n properly.
Example:
lines := Lines("a\nb\nc") // ["a", "b", "c"]
func LongestCommonSubsequence ¶
LongestCommonSubsequence returns the length of the longest common subsequence. A subsequence is a sequence that can be derived by deleting some elements without changing the order of remaining elements.
Example:
LongestCommonSubsequence("ABCDGH", "AEDFHR") // 3 ("ADH")
func LongestCommonSubstring ¶
LongestCommonSubstring returns the longest common contiguous substring.
Example:
LongestCommonSubstring("ABABC", "BABCA") // "BABC"
func NormalizeUnicode ¶ added in v1.1.0
NormalizeUnicode applies NFKD normalization and removes combining marks (diacritics) from the input string.
The process:
- NFKD (Compatibility Decomposition): decomposes characters into their base form + combining marks (e.g., "é" → "e" + combining acute accent)
- Remove combining marks: strips all Unicode Mn (Mark, Nonspacing) characters
- Recompose to NFC for consistent output
Characters that are not letters or numbers are preserved as-is (spaces, punctuation, etc.).
Example:
NormalizeUnicode("café résumé") // "cafe resume", nil
NormalizeUnicode("naïve") // "naive", nil
NormalizeUnicode("Ångström") // "Angstrom", nil
func NormalizeWhitespace ¶ added in v1.1.0
NormalizeWhitespace collapses all consecutive whitespace characters (spaces, tabs, newlines) into a single space, and trims leading/trailing whitespace.
This is useful for cleaning user input or text extracted from HTML where whitespace may be irregular.
Example:
NormalizeWhitespace(" Hello World \n\t ") // "Hello World"
NormalizeWhitespace("\t\n") // ""
func NthRune ¶
NthRune returns the rune at rune position n (0-indexed). Returns (0, false) if n is out of bounds.
The previous implementation compared the byte offset i to n, which silently returned wrong results for multi-byte UTF-8 strings. This version counts runes explicitly.
func PadCenter ¶
PadCenter centers s by adding padChar on both sides. If odd padding needed, extra character goes on the right.
Example:
PadCenter("hello", 11, '*') // "***hello***"
func PadLeft ¶
PadLeft pads s on the left with padChar to reach the target length. If s is already >= length, returns s unchanged.
Example:
PadLeft("42", 5, '0') // "00042"
func PadRight ¶
PadRight pads s on the right with padChar to reach the target length.
Example:
PadRight("42", 5, '0') // "42000"
func PascalCase ¶
PascalCase converts s to PascalCase.
Example:
PascalCase("hello_world") // "HelloWorld"
func RemoveAccents ¶ added in v1.1.0
RemoveAccents is an alias for NormalizeUnicode that removes diacritical marks from characters. This is a common operation name used in many string-processing libraries.
Example:
RemoveAccents("café") // "cafe", nil
RemoveAccents("über") // "uber", nil
func RemoveAll ¶
RemoveAll removes all occurrences of the given substrings from s.
Example:
clean := RemoveAll("hello world", "l", "o") // "he wrd"
func RemoveNonPrintable ¶ added in v1.1.0
RemoveNonPrintable removes all non-printable characters from s, except for common whitespace (space, tab, newline, carriage return).
This is useful for cleaning user input that may contain control characters, zero-width characters, or other invisible Unicode characters.
Example:
RemoveNonPrintable("Hello\x07World") // "HelloWorld" (bell character removed)
RemoveNonPrintable("Hello\tWorld\n") // "Hello\tWorld\n" (whitespace preserved)
func Repeat ¶
Repeat returns s repeated n times. If n <= 0, returns empty string.
Example:
Repeat("ab", 3) // "ababab"
func Reverse ¶
Reverse returns s with its characters in reverse order. Correctly handles multi-byte UTF-8 characters.
Example:
rev := Reverse("hello") // "olleh"
rev := Reverse("日本語") // "語本日"
func RuneCount ¶
RuneCount returns the number of runes (Unicode code points) in s. This differs from len(s), which returns bytes.
Example:
RuneCount("日本語") // 3
len("日本語") // 9 (bytes)
func SafeSlice ¶
SafeSlice safely slices s by rune indices, returning an empty string for invalid ranges. Useful when working with user input where indices might be out of bounds.
func SanitizeUTF8 ¶ added in v1.1.0
SanitizeUTF8 ensures the string contains only valid UTF-8 and removes NUL bytes (\x00) which are problematic for most databases.
Invalid UTF-8 byte sequences are replaced with U+FFFD (Unicode replacement character), following Go's standard behavior.
Example:
SanitizeUTF8("Hello\x00World") // "HelloWorld"
SanitizeUTF8("Hello\xffWorld") // "Hello\uFFFDWorld"
SanitizeUTF8("Valid UTF-8 string") // "Valid UTF-8 string" (unchanged)
func Slugify ¶ added in v1.1.0
Slugify converts a string to a URL-friendly slug. It normalizes Unicode, lowercases, replaces non-alphanumeric characters with hyphens, collapses multiple hyphens, and trims leading/trailing hyphens.
Example:
Slugify("Hello, World!") // "hello-world", nil
Slugify("Café Résumé") // "cafe-resume", nil
Slugify(" Multiple Spaces ") // "multiple-spaces", nil
func SnakeCase ¶
SnakeCase converts s to snake_case.
Example:
SnakeCase("HelloWorld") // "hello_world"
SnakeCase("helloWorld") // "hello_world"
func SplitAfter ¶
SplitAfter splits s after each instance of sep. Wrapper around strings.SplitAfter for consistency.
func SplitAndTrim ¶ added in v1.1.0
SplitAndTrim splits s by sep, trims whitespace from each token, and drops any tokens that are empty after trimming.
This is the single most common string-processing pattern in Go backends — parsing comma-separated config values, CSV-like user input, and HTTP header lists all require split + trim + discard-blanks.
Example:
SplitAndTrim(" a , b , , c ", ",") // ["a", "b", "c"]
SplitAndTrim("", ",") // nil
func SplitN ¶
SplitN splits s by sep into at most n parts. If n <= 0, returns all parts (same as strings.Split). Wrapper around strings.SplitN for consistency.
func StripHTMLEntities ¶ added in v1.1.0
StripHTMLEntities removes all HTML/XML tags and decodes HTML entities.
Tag removal handles:
- Standard tags: <p>, <br/>, <div class="x">
- Self-closing tags: <br />, <img />
- Script and style tags (content included — for full script removal, use a proper HTML parser)
Entity decoding handles (via html.UnescapeString):
- Named entities: & < > " ' © etc.
- Decimal numeric: © —
- Hex numeric: © —
Example:
StripHTMLEntities("<p>Hello & World</p>") // "Hello & World"
StripHTMLEntities("Price: €10") // "Price: €10"
StripHTMLEntities("5 > 3 && 2 < 4") // "5 > 3 && 2 < 4"
func StripTags ¶
StripTags removes HTML/XML tags from s. This is a simple implementation that may not handle all edge cases.
Example:
StripTags("<p>Hello <b>World</b></p>") // "Hello World"
func SwapCase ¶
SwapCase swaps the case of each letter in s.
Example:
SwapCase("Hello World") // "hELLO wORLD"
func Title ¶
Title returns s with the first character of each word uppercased.
Example:
Title("hello world") // "Hello World"
func ToASCII ¶ added in v1.1.0
ToASCII converts a Unicode string to its closest ASCII representation by removing diacritics, replacing non-letter/non-number characters with spaces, and collapsing whitespace.
This is useful for generating slugs, search keys, or filenames from Unicode input.
Example:
ToASCII("Héllo, Wörld!") // "Hello World", nil
ToASCII("café résumé") // "cafe resume", nil
func Truncate ¶
Truncate shortens s to maxLen characters, appending suffix if truncated. The total length including suffix will not exceed maxLen.
Example:
Truncate("Hello World", 8, "...") // "Hello..."
func TruncateRunes ¶ added in v1.1.0
TruncateRunes truncates s to at most maxLen runes. Unlike byte-level truncation, this is Unicode-safe and will never split a multi-byte character.
If maxLen <= 0, returns empty string. If s has fewer runes than maxLen, returns s unchanged.
Example:
TruncateRunes("Hello, 世界!", 8) // "Hello, 世界" (correct, not "Hello, \xe4")
TruncateRunes("café", 3) // "caf"
func TruncateWords ¶
TruncateWords truncates s at a word boundary, appending suffix if truncated. Attempts to break at word boundaries rather than mid-word.
Types ¶
type CleanOption ¶ added in v1.1.0
type CleanOption func(*cleanConfig)
CleanOption configures a cleaning step for CleanString. Options are applied in a fixed, safe order regardless of the order they are passed:
- HTML stripping (first, to remove markup before text processing)
- Unicode normalization (second, to normalize the text content)
- Database sanitization (last, to enforce encoding/length constraints)
func WithDBSanitize ¶ added in v1.1.0
func WithDBSanitize(maxLen int) CleanOption
WithDBSanitize enables database sanitization:
- Replaces invalid UTF-8 sequences with U+FFFD (replacement character)
- Replaces NUL bytes (\x00) with empty string (NUL breaks PostgreSQL, MySQL, etc.)
- Optionally truncates to maxLen runes (0 = no truncation)
Truncation is rune-aware: it will never cut a multi-byte character in half.
Note: This does NOT escape SQL. Use parameterized queries for SQL injection prevention. This function handles encoding-level sanitization only.
Example:
result, _ := CleanString("Hello\x00World", WithDBSanitize(0))
// result = "HelloWorld"
result, _ := CleanString("Hello World", WithDBSanitize(5))
// result = "Hello"
func WithHTMLStrip ¶ added in v1.1.0
func WithHTMLStrip() CleanOption
WithHTMLStrip enables HTML tag removal and entity decoding. All HTML/XML tags are stripped, and HTML entities are decoded to their Unicode equivalents (e.g., & → &, < → <, → space, © → ©).
Uses Go's standard html.UnescapeString for entity decoding, which handles all named HTML entities, decimal ({), and hex ({) numeric entities.
Example:
result, _ := CleanString("<p>Hello & World</p>", WithHTMLStrip())
// result = "Hello & World"
func WithUnicodeNorm ¶ added in v1.1.0
func WithUnicodeNorm() CleanOption
WithUnicodeNorm enables Unicode normalization: NFKD decomposition followed by removal of combining marks (diacritics). This converts characters like "é" → "e", "ñ" → "n", "ü" → "u".
This is useful for search indexing, comparison, and ensuring ASCII-compatible text from Unicode input.
Example:
result, _ := CleanString("café résumé", WithUnicodeNorm())
// result = "cafe resume"