tokenizer

package

v0.0.0-...-b1a8da6 Latest Latest Go to latest Published: Jul 3, 2020 License: Apache-2.0 Imports: 3 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/generaltso/linguist

Documentation ¶

Overview ¶

go port of https://github.com/github/linguist/blob/master/lib/linguist/tokenizer.rb

in their words:

# Generic programming language tokenizer.
#
# Tokens are designed for use in the language bayes classifier.
# It strips any data strings or comments and preserves significant
# language symbols.

Index ¶

Variables
func FindMultiLineComment(token []byte) (matched bool, terminator *regexp.Regexp)
func Tokenize(input []byte) (tokens []string)

Constants ¶

This section is empty.

Variables ¶

View Source

var (
	// Maximum input length for Tokenize()
	ByteLimit = 100000

	// NOTE(tso): these string slices are turned into their regexp slice counterparts
	// by this package's init() function.
	StartLineComments = []string{
		"\"",
		"%",
	}
	SingleLineComments = []string{
		"//",
		"--",
		"#",
	}
	MultiLineComments = [][]string{
		[]string{"/*", "*/"},
		[]string{"<!--", "-->"},
		[]string{"{-", "-}"},
		[]string{"(*", "*)"},
		[]string{`"""`, `"""`},
		[]string{"'''", "'''"},
		[]string{"#`(", ")"},
	}
	StartLineComment       []*regexp.Regexp
	BeginSingleLineComment []*regexp.Regexp
	BeginMultiLineComment  []*regexp.Regexp
	EndMultiLineComment    []*regexp.Regexp
	String                 = regexp.MustCompile(`[^\\]*(["'` + "`])")
	Shebang                = regexp.MustCompile(`#!.*$`)
	Number                 = regexp.MustCompile(`(0x[0-9a-f]([0-9a-f]|\.)*|\d(\d|\.)*)([uU][lL]{0,2}|([eE][-+]\d*)?[fFlL]*)`)
)

Functions ¶

func FindMultiLineComment ¶

func FindMultiLineComment(token []byte) (matched bool, terminator *regexp.Regexp)

If the given token matches the start of a multi-line comment, this function will return true and a regex for the corresponding closing token, otherwise false and nil.

func Tokenize ¶

func Tokenize(input []byte) (tokens []string)

Simple tokenizer that uses bufio.Scanner to process lines and individual words and matches them against regular expressions to filter out comments, strings, and numerals in a manner very similar to github's linguist (see https://github.com/github/linguist/blob/master/lib/linguist/tokenizer.rb)

The intention is to merely retrieve significant tokens from a piece of source code in order to identify the programming language using statistical analysis and NOT to be used as any part of the process of compilation whatsoever.

NOTE(tso): The tokens produced by this function may be of a dubious quality due to the approach taken. Feedback and alternate implementations welcome :)

Types ¶

This section is empty.

Source Files ¶

View all Source files

tokenizer.go

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL