regexp2

package module
v1.4.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 7, 2020 License: MIT Imports: 10 Imported by: 608

README

Regexp2 is a feature-rich RegExp engine for Go. It doesn't have constant time guarantees like the built-in regexp package, but it allows backtracking and is compatible with Perl5 and .NET. You'll likely be better off with the RE2 engine from the regexp package and should only use this if you need to write very complex patterns or require compatibility with .NET.

Basis of the engine

The engine is ported from the .NET framework's System.Text.RegularExpressions.Regex engine. That engine was open sourced in 2015 under the MIT license. There are some fundamental differences between .NET strings and Go strings that required a bit of borrowing from the Go framework regex engine as well. I cleaned up a couple of the dirtier bits during the port (regexcharclass.cs was terrible), but the parse tree, code emmitted, and therefore patterns matched should be identical.

Installing

This is a go-gettable library, so install is easy:

go get github.com/dlclark/regexp2/...

Usage

Usage is similar to the Go regexp package. Just like in regexp, you start by converting a regex into a state machine via the Compile or MustCompile methods. They ultimately do the same thing, but MustCompile will panic if the regex is invalid. You can then use the provided Regexp struct to find matches repeatedly. A Regexp struct is safe to use across goroutines.

re := regexp2.MustCompile(`Your pattern`, 0)
if isMatch, _ := re.MatchString(`Something to match`); isMatch {
    //do something
}

The only error that the *Match* methods should return is a Timeout if you set the re.MatchTimeout field. Any other error is a bug in the regexp2 package. If you need more details about capture groups in a match then use the FindStringMatch method, like so:

if m, _ := re.FindStringMatch(`Something to match`); m != nil {
    // the whole match is always group 0
    fmt.Printf("Group 0: %v\n", m.String())

    // you can get all the groups too
    gps := m.Groups()

    // a group can be captured multiple times, so each cap is separately addressable
    fmt.Printf("Group 1, first capture", gps[1].Captures[0].String())
    fmt.Printf("Group 1, second capture", gps[1].Captures[1].String())
}

Group 0 is embedded in the Match. Group 0 is an automatically-assigned group that encompasses the whole pattern. This means that m.String() is the same as m.Group.String() and m.Groups()[0].String()

The last capture is embedded in each group, so g.String() will return the same thing as g.Capture.String() and g.Captures[len(g.Captures)-1].String().

Compare regexp and regexp2

Category regexp regexp2
Catastrophic backtracking possible no, constant execution time guarantees yes, if your pattern is at risk you can use the re.MatchTimeout field
Python-style capture groups (?P<name>re) yes no (yes in RE2 compat mode)
.NET-style capture groups (?<name>re) or (?'name're) no yes
comments (?#comment) no yes
branch numbering reset (?|a|b) no no
possessive match (?>re) no yes
positive lookahead (?=re) no yes
negative lookahead (?!re) no yes
positive lookbehind (?<=re) no yes
negative lookbehind (?<!re) no yes
back reference \1 no yes
named back reference \k'name' no yes
named ascii character class [[:foo:]] yes no (yes in RE2 compat mode)
conditionals (?(expr)yes|no) no yes

RE2 compatibility mode

The default behavior of regexp2 is to match the .NET regexp engine, however the RE2 option is provided to change the parsing to increase compatibility with RE2. Using the RE2 option when compiling a regexp will not take away any features, but will change the following behaviors:

  • add support for named ascii character classes (e.g. [[:foo:]])
  • add support for python-style capture groups (e.g. (P<name>re))
  • change singleline behavior for $ to only match end of string (like RE2) (see #24)
re := regexp2.MustCompile(`Your RE2-compatible pattern`, regexp2.RE2)
if isMatch, _ := re.MatchString(`Something to match`); isMatch {
    //do something
}

This feature is a work in progress and I'm open to ideas for more things to put here (maybe more relaxed character escaping rules?).

Library features that I'm still working on

  • Regex split

Potential bugs

I've run a battery of tests against regexp2 from various sources and found the debug output matches the .NET engine, but .NET and Go handle strings very differently. I've attempted to handle these differences, but most of my testing deals with basic ASCII with a little bit of multi-byte Unicode. There's a chance that there are bugs in the string handling related to character sets with supplementary Unicode chars. Right-to-Left support is coded, but not well tested either.

Find a bug?

I'm open to new issues and pull requests with tests if you find something odd!

Documentation

Overview

Package regexp2 is a regexp package that has an interface similar to Go's framework regexp engine but uses a more feature full regex engine behind the scenes.

It doesn't have constant time guarantees, but it allows backtracking and is compatible with Perl5 and .NET. You'll likely be better off with the RE2 engine from the regexp package and should only use this if you need to write very complex patterns or require compatibility with .NET.

Index

Constants

View Source
const (
	None                    RegexOptions = 0x0
	IgnoreCase                           = 0x0001 // "i"
	Multiline                            = 0x0002 // "m"
	ExplicitCapture                      = 0x0004 // "n"
	Compiled                             = 0x0008 // "c"
	Singleline                           = 0x0010 // "s"
	IgnorePatternWhitespace              = 0x0020 // "x"
	RightToLeft                          = 0x0040 // "r"
	Debug                                = 0x0080 // "d"
	ECMAScript                           = 0x0100 // "e"
	RE2                                  = 0x0200 // RE2 (regexp package) compatibility mode
)

Variables

View Source
var DefaultMatchTimeout = time.Duration(math.MaxInt64)

Default timeout used when running regexp matches -- "forever"

Functions

func Escape

func Escape(input string) string

Escape adds backslashes to any special characters in the input string

func Unescape

func Unescape(input string) (string, error)

Unescape removes any backslashes from previously-escaped special characters in the input string

Types

type Capture

type Capture struct {

	// the position in the original string where the first character of
	// captured substring was found.
	Index int
	// the length of the captured substring.
	Length int
	// contains filtered or unexported fields
}

Capture is a single capture of text within the larger original string

func (*Capture) Runes

func (c *Capture) Runes() []rune

Runes returns the captured text as a rune slice

func (*Capture) String

func (c *Capture) String() string

String returns the captured text as a String

type Group

type Group struct {
	Capture // the last capture of this group is embeded for ease of use

	Name     string    // group name
	Captures []Capture // captures of this group
}

Group is an explicit or implit (group 0) matched group within the pattern

type Match

type Match struct {
	Group //embeded group 0
	// contains filtered or unexported fields
}

Match is a single regex result match that contains groups and repeated captures

	-Groups
   -Capture

func (*Match) GroupByName

func (m *Match) GroupByName(name string) *Group

GroupByName returns a group based on the name of the group, or nil if the group name does not exist

func (*Match) GroupByNumber

func (m *Match) GroupByNumber(num int) *Group

GroupByNumber returns a group based on the number of the group, or nil if the group number does not exist

func (*Match) GroupCount

func (m *Match) GroupCount() int

GroupCount returns the number of groups this match has matched

func (*Match) Groups

func (m *Match) Groups() []Group

Groups returns all the capture groups, starting with group 0 (the full match)

type MatchEvaluator added in v1.1.0

type MatchEvaluator func(Match) string

MatchEvaluator is a function that takes a match and returns a replacement string to be used

type RegexOptions

type RegexOptions int32

RegexOptions impact the runtime and parsing behavior for each specific regex. They are setable in code as well as in the regex pattern itself.

type Regexp

type Regexp struct {
	//timeout when trying to find matches
	MatchTimeout time.Duration
	// contains filtered or unexported fields
}

Regexp is the representation of a compiled regular expression. A Regexp is safe for concurrent use by multiple goroutines.

func Compile

func Compile(expr string, opt RegexOptions) (*Regexp, error)

Compile parses a regular expression and returns, if successful, a Regexp object that can be used to match against text.

func MustCompile

func MustCompile(str string, opt RegexOptions) *Regexp

MustCompile is like Compile but panics if the expression cannot be parsed. It simplifies safe initialization of global variables holding compiled regular expressions.

func (*Regexp) Debug

func (re *Regexp) Debug() bool

func (*Regexp) FindNextMatch

func (re *Regexp) FindNextMatch(m *Match) (*Match, error)

FindNextMatch returns the next match in the same input string as the match parameter. Will return nil if there is no next match or if given a nil match.

func (*Regexp) FindRunesMatch

func (re *Regexp) FindRunesMatch(r []rune) (*Match, error)

FindRunesMatch searches the input rune slice for a Regexp match

func (*Regexp) FindRunesMatchStartingAt

func (re *Regexp) FindRunesMatchStartingAt(r []rune, startAt int) (*Match, error)

FindRunesMatchStartingAt searches the input rune slice for a Regexp match starting at the startAt index

func (*Regexp) FindStringMatch

func (re *Regexp) FindStringMatch(s string) (*Match, error)

FindStringMatch searches the input string for a Regexp match

func (*Regexp) FindStringMatchStartingAt

func (re *Regexp) FindStringMatchStartingAt(s string, startAt int) (*Match, error)

FindStringMatchStartingAt searches the input string for a Regexp match starting at the startAt index

func (*Regexp) GetGroupNames

func (re *Regexp) GetGroupNames() []string

GetGroupNames Returns the set of strings used to name capturing groups in the expression.

func (*Regexp) GetGroupNumbers

func (re *Regexp) GetGroupNumbers() []int

GetGroupNumbers returns the integer group numbers corresponding to a group name.

func (*Regexp) GroupNameFromNumber

func (re *Regexp) GroupNameFromNumber(i int) string

GroupNameFromNumber retrieves a group name that corresponds to a group number. It will return "" for and unknown group number. Unnamed groups automatically receive a name that is the decimal string equivalent of its number.

func (*Regexp) GroupNumberFromName

func (re *Regexp) GroupNumberFromName(name string) int

GroupNumberFromName returns a group number that corresponds to a group name. Returns -1 if the name is not a recognized group name. Numbered groups automatically get a group name that is the decimal string equivalent of its number.

func (*Regexp) MatchRunes

func (re *Regexp) MatchRunes(r []rune) (bool, error)

MatchRunes return true if the runes matches the regex error will be set if a timeout occurs

func (*Regexp) MatchString

func (re *Regexp) MatchString(s string) (bool, error)

MatchString return true if the string matches the regex error will be set if a timeout occurs

func (*Regexp) Replace

func (re *Regexp) Replace(input, replacement string, startAt, count int) (string, error)

Replace searches the input string and replaces each match found with the replacement text. Count will limit the number of matches attempted and startAt will allow us to skip past possible matches at the start of the input (left or right depending on RightToLeft option). Set startAt and count to -1 to go through the whole string

func (*Regexp) ReplaceFunc added in v1.1.0

func (re *Regexp) ReplaceFunc(input string, evaluator MatchEvaluator, startAt, count int) (string, error)

ReplaceFunc searches the input string and replaces each match found using the string from the evaluator Count will limit the number of matches attempted and startAt will allow us to skip past possible matches at the start of the input (left or right depending on RightToLeft option). Set startAt and count to -1 to go through the whole string.

func (*Regexp) RightToLeft

func (re *Regexp) RightToLeft() bool

func (*Regexp) String

func (re *Regexp) String() string

String returns the source text used to compile the regular expression.

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL