tokenizer

package
v0.0.0-...-b1a8da6 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jul 3, 2020 License: Apache-2.0 Imports: 3 Imported by: 0

Documentation

Overview

go port of https://github.com/github/linguist/blob/master/lib/linguist/tokenizer.rb

in their words:

# Generic programming language tokenizer.
#
# Tokens are designed for use in the language bayes classifier.
# It strips any data strings or comments and preserves significant
# language symbols.

Index

Constants

This section is empty.

Variables

View Source
var (
	// Maximum input length for Tokenize()
	ByteLimit = 100000

	// NOTE(tso): these string slices are turned into their regexp slice counterparts
	// by this package's init() function.
	StartLineComments = []string{
		"\"",
		"%",
	}
	SingleLineComments = []string{
		"//",
		"--",
		"#",
	}
	MultiLineComments = [][]string{
		[]string{"/*", "*/"},
		[]string{"<!--", "-->"},
		[]string{"{-", "-}"},
		[]string{"(*", "*)"},
		[]string{`"""`, `"""`},
		[]string{"'''", "'''"},
		[]string{"#`(", ")"},
	}
	StartLineComment       []*regexp.Regexp
	BeginSingleLineComment []*regexp.Regexp
	BeginMultiLineComment  []*regexp.Regexp
	EndMultiLineComment    []*regexp.Regexp
	String                 = regexp.MustCompile(`[^\\]*(["'` + "`])")
	Shebang                = regexp.MustCompile(`#!.*$`)
	Number                 = regexp.MustCompile(`(0x[0-9a-f]([0-9a-f]|\.)*|\d(\d|\.)*)([uU][lL]{0,2}|([eE][-+]\d*)?[fFlL]*)`)
)

Functions

func FindMultiLineComment

func FindMultiLineComment(token []byte) (matched bool, terminator *regexp.Regexp)

If the given token matches the start of a multi-line comment, this function will return true and a regex for the corresponding closing token, otherwise false and nil.

func Tokenize

func Tokenize(input []byte) (tokens []string)

Simple tokenizer that uses bufio.Scanner to process lines and individual words and matches them against regular expressions to filter out comments, strings, and numerals in a manner very similar to github's linguist (see https://github.com/github/linguist/blob/master/lib/linguist/tokenizer.rb)

The intention is to merely retrieve significant tokens from a piece of source code in order to identify the programming language using statistical analysis and NOT to be used as any part of the process of compilation whatsoever.

NOTE(tso): The tokens produced by this function may be of a dubious quality due to the approach taken. Feedback and alternate implementations welcome :)

Types

This section is empty.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL