regrams

package module
v0.0.0-...-813af43 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 10, 2023 License: MIT Imports: 9 Imported by: 0

README

regrams

regrams is a go package that implements an engine for text search by regular expression queries. Given a collection of documents that are indexed by trigrams and a regular expression, regrams can generate a boolean query for the trigram index that retrieves good candidates that might match the regular expression.

For example, regrams.MakeQuery("a(bc)+d") returns (abc) (bcb|bcd), which means that any document that matches the regular expression a(bc)+d must contain the trigram abc as well as one of bcb or bcd.

regrams is inspired by Google's codesearch repo and Russ Cox's corresponding blog post explaining the implementation. Instead of using the structure of the parsed regular expression to generate queries, however, regrams creates an NFA from the query with nodes weighted by the size of the reachable trigram sets and repeatedly extracts disjunctions from the NFA by identifying and removing minimum-weight graph cuts. I wrote a post on regular expression search via graph cuts that gives a lot more detail about the technique.

The easiest way to try out this package is to run the command-line wrapper included in this repo:

$ go run cmd/regrams/main.go abcd
(abc) (bcd)
$ go run cmd/regrams/main.go 'Time: [0-9]+ ms'
( ms) (: 0|: 1|: 2|: 3|: 4|: 5|: 6|: 7|: 8|: 9) (Tim) (e: ) (ime) (me:)
$ go run cmd/regrams/main.go '(abc*)+de'
(aba|abc|abd) (bab|bca|bcc|bcd|bde)
$ go run cmd/regrams/main.go '(?i)abc'
(ABC|ABc|AbC|Abc|aBC|aBc|abC|abc)

Running go run cmd/regrams/main.go --help will give you more options.

Alternatively, you can try out examples in a browser at the Go Playground.

To use regrams as a package, import github.com/aaw/regrams, then call regrams.MakeQuery in your code to generate a trigram query from a string representing a regular expression. MakeQuery returns a slice-of-slices of strings, which should be interpreted as a boolean query in conjunctive normal form: the slice [["abc"], ["xyz","xyw"], ["xxx", "yyy", "zzz"]] represents the query abc AND (xyz OR xyw) AND (xxx OR yyy OR zzz).

Documentation

Overview

regrams parses regular expressions into trigram queries in conjunctive normal form. To build a search engine that accepts regular expression queries, index all text documents by trigrams, then use the MakeQuery method provided in this package to transform any regular expression query into a query over the indexed trigrams.

For example, MakeQuery("abc+(d|e)") returns [["abc"], ["bcc","bcd","bce"]], which represents the trigram query "(abc) AND (bcc OR bcd OR bce)"

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Engine

type Engine struct {
	// contains filtered or unexported fields
}

Only use this if you want to tune config values in query processing like how large a trigram set to consider, how large an NFA to consider, etc. If you use MakeQuery instead of creating an engine e and running e.MakeQuery, it'll populate all these config values with some reasonable starting points.

func NewEngine

func NewEngine(maxCharClassSize int, maxTrigramSetSize int, maxNFANodes int) Engine

func (Engine) MakeQuery

func (e Engine) MakeQuery(r string) (Query, error)

Make a regrams.Query from a string representation of a regexp.

type Query

type Query [][]string

A trigram query. These are always in conjunctive normal form: an AND of a bunch of ORs. The Query [["abc", "xbc"], ["xxx"]], for example, represents the trigram query (abc OR xbc) AND (xxx).

func MakeQuery

func MakeQuery(r string) (Query, error)

func (Query) String

func (q Query) String() string

Directories

Path Synopsis
cmd

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL