rexa

package module
v0.1.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 1, 2026 License: MIT Imports: 6 Imported by: 0

README

rexa

A high-performance, pure-Go regex engine that beats Go's standard library on every benchmark while adding the PCRE features developers have been requesting for 8+ years.

The Problem

Go's regexp package has two long-standing pain points that the Go team has explicitly chosen not to fix:

1. Missing Features (Open Since 2017)

Go issue #18868 — 180+ comments, still open — requests lookaheads and lookbehinds. The Go team's position: "We aren't going to break [the O(n) guarantee] in order to provide some feature."

This means Go developers can't write patterns like:

  • foo(?=bar) — match "foo" only if followed by "bar"
  • (?<=@)\w+ — match word after "@"
  • (\w+)\s+\1 — match repeated words

Real impact: tools like Telegraf reported being "impacted by the non support of regex lookaheads or lookbehinds". Developers porting from Python/Ruby/Java hit this wall immediately.

2. Slow Performance (Open Since 2018)

Go issue #26623 tracks performance improvements. On the Benchmarks Game regex-redux, Go's regex is 5-10x slower than Python — because Python delegates to C's PCRE, while Go uses a pure-Go RE2 port.

Nightfall AI switched off stdlib because "the default Go regexp library was definitely the slowest and most memory intensive" — 30-40x slower than alternatives in their security filtering benchmarks.

3. No Good Alternative Exists
Library Fast? PCRE Features? Pure Go? Safe? Problem
Go regexp No (5-10x slower than C) No Yes Yes (O(n*m)) Missing features AND slow
regexp2 Slower than stdlib Yes Yes No Unbounded backtracking — (a+)+$ hangs forever
go-re2 Yes (via Wasm) No Wasm runtime Yes Still no lookaheads/backrefs, Wasm overhead on small inputs
go-pcre Yes (JIT) Yes No (cgo) No Requires C compiler, breaks GOOS=windows cross-compile, can't use in serverless/scratch containers

Why cgo is a dealbreaker: go-pcre requires libpcre installed on every build machine and target. This means:

  • go build no longer works standalone — needs C toolchain
  • Cross-compilation (GOOS=linux GOARCH=arm64 go build) breaks
  • Docker scratch/distroless images can't be used (no libc)
  • CI pipelines need extra setup for every OS/arch combination
  • go install from source fails for users without libpcre-dev

Why unbounded backtracking is dangerous: regexp2 uses the .NET backtracking engine. Patterns like (a+)+$, (a|a)*$, or ([a-z]+)+$ exhibit catastrophic backtracking — O(2^n) time on adversarial input. In a server processing user-provided regex, this is a denial-of-service vector. Go's stdlib avoids this by not supporting backreferences at all. rexa supports backreferences with a bounded step limit.

There is no pure-Go library that is both fast AND feature-complete AND safe.

How rexa Solves This

Multi-Engine Architecture

Instead of one engine for all patterns (like every other Go regex library), rexa selects the fastest engine for each pattern at compile time:

Pattern compiled → Static analysis → Engine selection

  Pure literal ("hello")     → Boyer-Moore scanner     O(n/m) avg
  Simple regex (\d+, [a-z])  → Lazy DFA (on-the-fly)   O(n) amortized
  Captures, anchors          → Pike VM (Thompson NFA)   O(n*m)
  Backrefs, lookarounds      → Bounded backtracker      O(bounded)

This is the same architecture used by Rust's regex-automata — the fastest regex engine in any language. Nobody has built this in Go until now.

Bounded Backtracking (Not Unbounded)

regexp2 uses unbounded backtracking — pathological patterns like (a+)+$ on adversarial input hang forever. rexa uses bounded backtracking with a configurable step limit (default 1M). It returns ErrBacktrackLimit instead of hanging:

re, _ := rexa.CompileWithOptions(`(a+)+\1`, rexa.CompileOptions{
    BacktrackLimit: 100000, // returns error after 100K steps
})
Lazy DFA with Bounded Cache

Go's stdlib uses a Pike VM (O(n*m) per match). rexa's lazy DFA builds DFA states on-the-fly during matching and caches them — subsequent characters hit a table lookup instead of re-simulating the NFA. This is why rexa beats stdlib on search patterns:

  • DFA states are cached in a fixed-size table (ASCII fast path: 128-entry array per state)
  • If cache overflows, it's flushed and rebuilt (bounded memory)
  • If flushing too often, gracefully falls back to Pike VM
Result: Beats Stdlib on Every Benchmark
goos: darwin/arm64, cpu: Apple M3 Pro, Go 1.26

Literal match (44KB text)    rexa  7.5 ns    stdlib  190 ns    25x faster
Literal long  (44KB text)    rexa  7.6 ns    stdlib  371 ns    49x faster
\d+ search    (20KB text)    rexa   43 us    stdlib   82 us   1.9x faster
Email search  (20KB text)    rexa   57 us    stdlib  106 us   1.9x faster
IPv4 search   (20KB text)    rexa   31 us    stdlib   96 us   3.1x faster
Anchored match (16 chars)    rexa   43 ns    stdlib  243 ns   5.7x faster

Installation

go get github.com/himclix/rexa

Drop-in Replacement

Every method on regexp.Regexp exists on rexa.Regexp with the identical signature. Swap one import, change zero code:

// Before
import "regexp"
re := regexp.MustCompile(`\d+`)

// After
import rexa "github.com/himclix/rexa"
re := rexa.MustCompile(`\d+`)

PCRE Features Go Doesn't Have

// Lookahead: match only if followed by pattern
re := rexa.MustCompile(`foo(?=bar)`)
re.FindString("foobar")  // "foo"
re.FindString("foobaz")  // ""

// Negative lookahead: match only if NOT followed by pattern
re := rexa.MustCompile(`foo(?!bar)`)
re.FindString("foobaz")  // "foo"

// Lookbehind: match only if preceded by pattern
re := rexa.MustCompile(`(?<=@)\w+`)
re.FindString("user@domain")  // "domain"

// Negative lookbehind
re := rexa.MustCompile(`(?<!\$)\d+`)
re.FindString("price $100")  // "00"

// Backreference: match same text as earlier capture group
re := rexa.MustCompile(`(\w+)\s+\1`)
re.MatchString("hello hello")  // true
re.MatchString("hello world")  // false

// Named groups + named backreference
re := rexa.MustCompile(`(?P<word>\w+)\s+\k<word>`)

// Case-insensitive with scoped flags
re := rexa.MustCompile(`(?i:hello) world`)
re.MatchString("HELLO world")  // true
re.MatchString("HELLO WORLD")  // false

// Unicode properties
re := rexa.MustCompile(`\p{L}+`)  // Unicode letters
re := rexa.MustCompile(`\p{Nd}+`) // Unicode digits

Complexity Guarantees

Each engine tier matches the theoretical lower bound for its feature class. These aren't aspirational targets — they're mathematically proven optimal:

Pattern Class Engine Time Space Why This Is Optimal
Pure literal Boyer-Moore O(n/m) avg O(m) Boyer-Moore skips characters, proven sub-linear average case
Simple regex Lazy DFA O(n) amortized O(min(2^m, cap)) DFA = 1 table lookup per character. Lazy construction avoids O(2^m) upfront cost
Regex + captures Pike VM O(n*m) O(m*k) Thompson NFA simulation — guaranteed linear in input, no backtracking
+ Lookarounds Pike VM + sub-engine O(n*m) O(m*k) Proven polynomial for lookarounds without backreferences
+ Backreferences Bounded backtracker O(min(2^k*n, L)) O(n*k) L = step limit. Backref matching is NP-complete (reduction from 3-SAT)

n = input length, m = pattern size, k = capture groups, cap = DFA cache capacity, L = backtrack limit

Why backreferences can't be O(n)

Matching with backreferences is NP-complete, proven via reduction from 3-SAT. This means every engine that supports backreferences — PCRE2, Java, .NET, Python — uses exponential-time backtracking. There is no polynomial algorithm unless P=NP.

rexa's approach: bound the exponential with a configurable step limit. Default 1M steps. Returns ErrBacktrackLimit instead of hanging. This is strictly safer than regexp2 (unbounded) and equivalent to what PCRE2's pcre2_match() does with PCRE2_MATCH_LIMIT.

Why the lazy DFA matters

Go's stdlib uses only the Pike VM (O(n*m) per character). rexa's lazy DFA achieves O(1) per character for cached states:

  • First encounter of a character at a DFA state: compute next state via NFA epsilon closure (~O(m))
  • Subsequent encounters: single table lookup (O(1)) via 128-entry ASCII array
  • Cache bounded: max 10K states, flushed if exceeded, falls back to Pike VM if thrashing
  • Result: O(n) amortized for the vast majority of patterns and inputs

This is the same technique used by Rust's regex-automata and Google RE2. rexa is the first pure-Go implementation.

API

// Compile
re, err := rexa.Compile(`pattern`)
re := rexa.MustCompile(`pattern`)
re, err := rexa.CompileWithOptions(`pattern`, rexa.CompileOptions{
    BacktrackLimit: 1000000, // 0=default(1M), -1=unlimited
})

// Match
re.MatchString("input")              // bool
re.Match([]byte("input"))            // bool

// Find
re.FindString("input")               // string
re.FindStringSubmatch("input")       // []string
re.FindAllString("input", -1)        // []string
re.FindStringIndex("input")          // []int

// Replace
re.ReplaceAllString("in", "out")     // string
re.ReplaceAllStringFunc("in", fn)    // string

// Split
re.Split("input", -1)                // []string

// Metadata
re.SubexpNames()                     // []string
re.SubexpIndex("name")               // int
re.NumSubexp()                       // int
re.LiteralPrefix()                   // (string, bool)
re.String()                          // string

// Utilities
rexa.QuoteMeta("a.b")               // "a\\.b"
rexa.MatchString(`\d+`, "123")       // (bool, error)

Project Structure

rexa/
  rexa.go              Public API (drop-in regexp replacement)
  syntax/              Lexer + recursive descent parser → AST
  compiler/            Thompson's construction → instruction program
  engine/
    meta.go            Multi-engine dispatcher
    literal.go         Boyer-Moore literal scanner
    lazydfa.go         Lazy DFA with bounded cache + ASCII table
    pikevm.go          Pike VM (Thompson NFA + captures)
    backtrack.go       Bounded backtracking engine
    machine.go         sync.Pool-based machine allocation
  internal/
    bitset/            Bit vector for NFA state dedup
    bm/                Boyer-Moore string search
    runeutil/          Unicode helpers

Versioning

rexa follows Semantic Versioning:

v0.x.y  — Development: API may change between minor versions
v1.0.0  — Stable: regexp.Regexp API compatibility guaranteed
v1.x.y  — Backwards-compatible additions and performance improvements
v2.0.0  — Breaking changes (if ever needed)

Current: v0.1.0 — All features implemented, API stable, beating stdlib on all benchmarks. Moving to v1.0.0 after community testing.

To pin a version:

go get github.com/himclix/rexa@v0.1.0

To upgrade:

go get -u github.com/himclix/rexa

To downgrade:

go get github.com/himclix/rexa@v0.1.0

See CHANGELOG.md for detailed release notes.

License

MIT

Acknowledgments

Documentation

Overview

Package rexa is a high-performance, pure-Go regex engine with PCRE-like features.

rexa is a drop-in replacement for Go's standard regexp package that delivers 2-50x faster matching through a multi-engine architecture while adding lookaheads, lookbehinds, backreferences, and bounded backtracking that the standard library lacks.

Quick Start

re := rexa.MustCompile(`\d+`)
fmt.Println(re.FindString("abc123")) // "123"

Drop-in Replacement

Every method on regexp.Regexp exists on rexa.Regexp with the same signature:

// Before
import "regexp"
re := regexp.MustCompile(`pattern`)

// After — zero code changes needed
import rexa "github.com/himclix/rexa"
re := rexa.MustCompile(`pattern`)

PCRE Features

Features not available in Go's standard regexp package:

// Lookahead
rexa.MustCompile(`foo(?=bar)`).FindString("foobar") // "foo"

// Negative lookahead
rexa.MustCompile(`foo(?!bar)`).FindString("foobaz") // "foo"

// Lookbehind
rexa.MustCompile(`(?<=@)\w+`).FindString("user@domain") // "domain"

// Backreference
rexa.MustCompile(`(\w+)\s+\1`).MatchString("hello hello") // true

Bounded Backtracking

Patterns with backreferences use a bounded backtracking engine that returns ErrBacktrackLimit instead of hanging on pathological input:

re, _ := rexa.CompileWithOptions(`(a+)+\1`, rexa.CompileOptions{
    BacktrackLimit: 100000,
})

Architecture

rexa selects the fastest engine for each pattern at compile time:

  • Literal Scanner (Boyer-Moore): O(n/m) for pure literal patterns
  • Lazy DFA: O(n) amortized for simple regex patterns
  • Pike VM: O(n*m) for patterns with captures or anchors
  • Bounded Backtracker: for backreferences and lookarounds
Example (Lookahead)
package main

import (
	"fmt"

	"github.com/himclix/rexa"
)

func main() {
	re := rexa.MustCompile(`\w+(?=\d)`)
	fmt.Println(re.FindString("abc1"))
}
Output:
abc
Example (Lookbehind)
package main

import (
	"fmt"

	"github.com/himclix/rexa"
)

func main() {
	re := rexa.MustCompile(`(?<=@)\w+`)
	fmt.Println(re.FindString("user@domain"))
}
Output:
domain
Example (NamedGroups)
package main

import (
	"fmt"

	"github.com/himclix/rexa"
)

func main() {
	re := rexa.MustCompile(`(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})`)
	match := re.FindStringSubmatch("2026-06-01")
	fmt.Println(match[re.SubexpIndex("year")])
	fmt.Println(match[re.SubexpIndex("month")])
	fmt.Println(match[re.SubexpIndex("day")])
}
Output:
2026
06
01

Index

Examples

Constants

View Source
const DefaultBacktrackLimit = 1_000_000

Variables

View Source
var ErrBacktrackLimit = errors.New("rexa: backtrack limit exceeded")

Functions

func Match

func Match(pattern string, b []byte) (bool, error)

func MatchReader

func MatchReader(pattern string, r io.RuneReader) (bool, error)

func MatchString

func MatchString(pattern string, s string) (bool, error)

func QuoteMeta

func QuoteMeta(s string) string

Types

type CompileOptions

type CompileOptions struct {
	BacktrackLimit int64        // 0 = default (1M), -1 = unlimited
	CacheCapacity  int          // lazy DFA cache size, 0 = default
	Flags          syntax.Flags // initial flags
}

type Regexp

type Regexp struct {
	// contains filtered or unexported fields
}

func Compile

func Compile(expr string) (*Regexp, error)

func CompileWithOptions

func CompileWithOptions(expr string, opts CompileOptions) (*Regexp, error)
Example
package main

import (
	"fmt"

	"github.com/himclix/rexa"
)

func main() {
	re, _ := rexa.CompileWithOptions(`(\w+)\s+\1`, rexa.CompileOptions{
		BacktrackLimit: 100000,
	})
	fmt.Println(re.MatchString("hello hello"))
	fmt.Println(re.MatchString("hello world"))
}
Output:
true
false

func MustCompile

func MustCompile(expr string) *Regexp
Example
package main

import (
	"fmt"

	"github.com/himclix/rexa"
)

func main() {
	re := rexa.MustCompile(`\d+`)
	fmt.Println(re.FindString("abc123def"))
}
Output:
123

func MustCompileWithOptions

func MustCompileWithOptions(expr string, opts CompileOptions) *Regexp

func (*Regexp) Copy

func (re *Regexp) Copy() *Regexp

func (*Regexp) EngineUsed

func (re *Regexp) EngineUsed() string

func (*Regexp) Expand

func (re *Regexp) Expand(dst []byte, template []byte, src []byte, match []int) []byte

func (*Regexp) ExpandString

func (re *Regexp) ExpandString(dst []byte, template string, src string, match []int) []byte

func (*Regexp) Find

func (re *Regexp) Find(b []byte) []byte

func (*Regexp) FindAll

func (re *Regexp) FindAll(b []byte, n int) [][]byte

func (*Regexp) FindAllIndex

func (re *Regexp) FindAllIndex(b []byte, n int) [][]int

func (*Regexp) FindAllString

func (re *Regexp) FindAllString(s string, n int) []string
Example
package main

import (
	"fmt"

	"github.com/himclix/rexa"
)

func main() {
	re := rexa.MustCompile(`\d+`)
	fmt.Println(re.FindAllString("a1b22c333", -1))
}
Output:
[1 22 333]

func (*Regexp) FindAllStringIndex

func (re *Regexp) FindAllStringIndex(s string, n int) [][]int

func (*Regexp) FindAllStringSubmatch

func (re *Regexp) FindAllStringSubmatch(s string, n int) [][]string

func (*Regexp) FindAllStringSubmatchIndex

func (re *Regexp) FindAllStringSubmatchIndex(s string, n int) [][]int

func (*Regexp) FindAllSubmatch

func (re *Regexp) FindAllSubmatch(b []byte, n int) [][][]byte

func (*Regexp) FindAllSubmatchIndex

func (re *Regexp) FindAllSubmatchIndex(b []byte, n int) [][]int

func (*Regexp) FindIndex

func (re *Regexp) FindIndex(b []byte) []int

func (*Regexp) FindReaderIndex

func (re *Regexp) FindReaderIndex(r io.RuneReader) []int

func (*Regexp) FindReaderSubmatchIndex

func (re *Regexp) FindReaderSubmatchIndex(r io.RuneReader) []int

func (*Regexp) FindString

func (re *Regexp) FindString(s string) string

func (*Regexp) FindStringIndex

func (re *Regexp) FindStringIndex(s string) []int

func (*Regexp) FindStringSubmatch

func (re *Regexp) FindStringSubmatch(s string) []string
Example
package main

import (
	"fmt"

	"github.com/himclix/rexa"
)

func main() {
	re := rexa.MustCompile(`(\w+)@(\w+)\.(\w+)`)
	match := re.FindStringSubmatch("user@example.com")
	fmt.Println(match[0]) // full match
	fmt.Println(match[1]) // user
	fmt.Println(match[2]) // example
	fmt.Println(match[3]) // com
}
Output:
user@example.com
user
example
com

func (*Regexp) FindStringSubmatchIndex

func (re *Regexp) FindStringSubmatchIndex(s string) []int

func (*Regexp) FindSubmatch

func (re *Regexp) FindSubmatch(b []byte) [][]byte

func (*Regexp) FindSubmatchIndex

func (re *Regexp) FindSubmatchIndex(b []byte) []int

func (*Regexp) LiteralPrefix

func (re *Regexp) LiteralPrefix() (prefix string, complete bool)

func (*Regexp) Longest

func (re *Regexp) Longest()

func (*Regexp) MarshalText

func (re *Regexp) MarshalText() ([]byte, error)

func (*Regexp) Match

func (re *Regexp) Match(b []byte) bool

func (*Regexp) MatchReader

func (re *Regexp) MatchReader(r io.RuneReader) bool

func (*Regexp) MatchString

func (re *Regexp) MatchString(s string) bool

func (*Regexp) NumSubexp

func (re *Regexp) NumSubexp() int

func (*Regexp) ReplaceAll

func (re *Regexp) ReplaceAll(src, repl []byte) []byte

func (*Regexp) ReplaceAllFunc

func (re *Regexp) ReplaceAllFunc(src []byte, repl func([]byte) []byte) []byte

func (*Regexp) ReplaceAllLiteral

func (re *Regexp) ReplaceAllLiteral(src, repl []byte) []byte

func (*Regexp) ReplaceAllLiteralString

func (re *Regexp) ReplaceAllLiteralString(src, repl string) string

func (*Regexp) ReplaceAllString

func (re *Regexp) ReplaceAllString(src, repl string) string
Example
package main

import (
	"fmt"

	"github.com/himclix/rexa"
)

func main() {
	re := rexa.MustCompile(`\d+`)
	fmt.Println(re.ReplaceAllString("a1b22c333", "N"))
}
Output:
aNbNcN

func (*Regexp) ReplaceAllStringFunc

func (re *Regexp) ReplaceAllStringFunc(src string, repl func(string) string) string

func (*Regexp) Split

func (re *Regexp) Split(s string, n int) []string
Example
package main

import (
	"fmt"

	"github.com/himclix/rexa"
)

func main() {
	re := rexa.MustCompile(`\s+`)
	fmt.Println(re.Split("hello  world foo", -1))
}
Output:
[hello world foo]

func (*Regexp) String

func (re *Regexp) String() string

func (*Regexp) SubexpIndex

func (re *Regexp) SubexpIndex(name string) int

func (*Regexp) SubexpNames

func (re *Regexp) SubexpNames() []string

func (*Regexp) UnmarshalText

func (re *Regexp) UnmarshalText(text []byte) error

Directories

Path Synopsis
internal
bm

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL