symbolizer

package module
v0.2.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 18, 2023 License: Apache-2.0 Imports: 6 Imported by: 0

README

Symbolizer 🔣

go docs go version latest tag license test status issue count

A Go Package for Parsing Simple Symbols

Overview

This package is designed for parsing very simple symbols and not large files or multi-file directories. It exposes a type Parser constructable with NewParser using the string input that needs to be parsed and optional ParserOption functions to modify its behaviour.

Installation
go get github.com/manishmeganathan/symbolizer
Token Model

The Token type in this package contains the TokenKind, the literal value as string as well the start position of the token. TokenKind are pseudo-runes that represents unicode code points for values above 0. Special tokens for literals and control are represented as negative variants with values extending below 0. It can extended with custom variants but be mindful of collsions.

// TokenKind is an enum for representing token grouping/values.
// For unicode tokens, the TokenKind is equal to its code point value.
// For literal such identifiers and numerics, the TokenKind values descend from 0.
// 
// Note: Custom TokenKind values can be used by external packages for keyword detection
// for special literals, but these values should be below -10 to prevent collisions
type TokenKind int32

const (
	TokenEoF TokenKind = -(iota + 1)
	TokenIdentifier
	TokenNumber
	TokenString
	TokenHexNumber
)

// Token represents a lexical Token.
// It may be either a lone unicode character or some literal value
type Token struct {
	Kind     TokenKind
	Literal  string
	Position int
}
Usage Examples
// TokenIterator describes a routine that leverages the Token inspection methods of Parser
// to view the current Token and check if the current Token is of a specific TokenKind.
func TokenIterator() {
    symbol := "map[string]string"
    parser := symbolizer.NewParser(symbol)
  
    // Check if the cursor has reach the end of the symbol
    for !parser.IsCursor(symbolizer.TokenEoF) { 
        // Print the current token
        fmt.Println(parser.Cursor()) 
        // Advance the cursor and ingest the next token
        parser.Advance()
    }

    // Output:
    // {<ident> map 0}
    // {<unicode:'['> [ 3}
    // {<ident> string 4}
    // {<unicode:']'> ] 10}
    // {<ident> string 11}
}
// TokenLookAhead describes a routine that leverages the Token look ahead methods of Parser
// to view the next Token without ingesting it. ExpectPeek can be used to move the parser
// if the next token is of a specific kind.
func TokenLookAhead() {
    symbol := "[32]string"
    parser := symbolizer.NewParser(symbol)

    // Print the current and next token
    fmt.Println(parser.Cursor())
    fmt.Println(parser.Peek())
    
    // Check if the next token is a numeric
    // If it is, then the parse cursor is moved forward
    if parser.ExpectPeek(symbolizer.TokenNumber) {
        fmt.Println("numeric encountered")
    }
    
    // Print the current and next token after the peek expectation
    fmt.Println(parser.Cursor())
    fmt.Println(parser.Peek())

    // Output:
    // {<unicode:'['> [ 0}
    // {<num> 32 1}
    // numeric encountered
    // {<num> 32 1}
    // {<unicode:']'> ] 3}
}
// CustomSymbols describes a routine that injects some custom keywords and token kinds
// into the parser, which can then be used to inspect tokens just as would regular TokenKind variants.
func CustomSymbols() {
    symbol := "map[string]string"
	
    // Define a custom token kind enum
    type MyTokenKind int32
    const Datatype = -10
    
    // Defines a mapping of identifier to custom token kinds
    keywords := map[string]symbolizer.TokenKind{"map": Datatype, "string": Datatype}
    // Create a Parser with a ParserOption that injects the custom keywords
    parser := symbolizer.NewParser(symbol, symbolizer.Keywords(keywords))
    
    // Check if the cursor has reach the end of the symbol
    for !parser.IsCursor(symbolizer.TokenEoF) {
        // Print the current token
        fmt.Println(parser.Cursor())
        // Advance the cursor and ingest the next token
        parser.Advance()
    }

    // Output:
    // 	{<custom:-10> map 0}
    // 	{<unicode:'['> [ 3}
    // 	{<custom:-10> string 4}
    // 	{<unicode:']'> ] 10}
    //  {<custom:-10> string 11}
}
// SymbolSplit describes a routine that splits a Symbols into sub-strings (sub symbols) based on 
// some delimiter rune. Use a Whitespace Ignorant Parser if they should be ignored while splitting.
func SymbolSplit() {
    symbol := "23, 56, 8902342"
	
	// Create a Parser with a ParseOption to ignore whitespaces
    parser := NewParser(symbol, IgnoreWhitespaces())
    // Split the parser contents with the comma delimiter
    components := parser.Split(',')
	
	// Print all the components
    for _, component := range components {
        fmt.Println(component)
    }
	
    // Output: 
    // 23
    // 56
    // 8902342
}
// SymbolUnwrap describes a routine that unwraps an inner sub-string (inner symbol) from within some 
// Enclosure, which is defined as a pair of unicode characters (cannot be the same). Unwrap can also
// handle nested contents of the same enclosure and works to resolve each opening with a closing.
func SymbolUnwrap() {
    symbol := "(outer[inner])"
    parser := NewParser(symbol)

	// Unwrap the symbol from within a set of parenthesis
    unwrapped, err := parser.Unwrap(EnclosureParens())
    if err != nil {
        panic(err)
    }

	// Print the unwrapped symbol
    fmt.Println(unwrapped)

    // Output: 
    // outer[inner]
}
Notes:

This package is still a work in progress and can be heavily extended for a lot of different use cases. If you are using this package and need some new functionality, please open an issue or a pull request.

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Enclosure

type Enclosure struct {
	// contains filtered or unexported fields
}

Enclosure is a tuple of unicode code points that indicate start and stop pairs. They cannot be the same.

func EnclosureAngle

func EnclosureAngle() Enclosure

EnclosureAngle returns an Enclosure set for Angle Brackets '<>'

func EnclosureCurly

func EnclosureCurly() Enclosure

EnclosureCurly returns an Enclosure set for Curly Brackets '{}'

func EnclosureParens

func EnclosureParens() Enclosure

EnclosureParens returns an Enclosure set for Parenthesis '()'

func EnclosureSquare

func EnclosureSquare() Enclosure

EnclosureSquare returns an Enclosure set for Square Brackets '[]'

func NewEnclosure

func NewEnclosure(start, stop rune) (Enclosure, error)

NewEnclosure generates a new Enclosure set and returns it. Throws an error if the start and stop code points are identical

type Parser

type Parser struct {
	// contains filtered or unexported fields
}

Parser is a symbol parser that parse a given string input and handle operations like unwrapping enclosed data or splitting by a given delimiter

func NewParser

func NewParser(input string, opts ...ParserOption) *Parser

NewParser generates a new Parser for a given input string and some options that modify the parser behaviour such as ignoring whitespaces or using custom keywords

func (*Parser) Advance

func (parser *Parser) Advance()

Advance moves the parser's cursor and peek tokens

func (*Parser) Cursor

func (parser *Parser) Cursor() Token

Cursor returns the current Token

func (*Parser) ExpectPeek

func (parser *Parser) ExpectPeek(t TokenKind) bool

ExpectPeek advances the cursor if the next token is of the specified TokenKind. If it is not the same type, the parser does not advance. The returned boolean indicates if the parser was advanced.

func (*Parser) IsCursor

func (parser *Parser) IsCursor(t TokenKind) bool

IsCursor checks if the current token is of the specified TokenKind.

func (*Parser) IsPeek

func (parser *Parser) IsPeek(t TokenKind) bool

IsPeek checks if the next token is of the specified TokenKind. This look ahead is performed without moving the parser's cursor

func (*Parser) Peek

func (parser *Parser) Peek() Token

Peek looks ahead and returns the next Token without advancing the parser

func (*Parser) Split

func (parser *Parser) Split(delimiter TokenKind) (splits []string)

Split attempts to split the remaining contents of the parser into a set of strings separated by the given delimiting TokenKind. This process exhausts the parser consuming all the tokens within it.

func (*Parser) Unparsed

func (parser *Parser) Unparsed() string

Unparsed returns the remaining unparsed data in the parser as a string

func (*Parser) Unwrap

func (parser *Parser) Unwrap(enc Enclosure) (string, error)

Unwrap attempts to unravel a substring enclosed between to characters described with an Enclosure. When calling Unwrap, the parse cursor must be the opening character of the given Enclosure. Returns an error if the opening character is not found or if the symbol terminates before the closing character.

Note: Unwrap will resolve nested enclosures attempting to match one opening character with one closing character until it fully resolves.

type ParserOption

type ParserOption func(config *parseConfig)

ParserOption represents an option to modify the Parser behaviour. It must be provided with the constructor for Parser.

func IgnoreWhitespaces

func IgnoreWhitespaces() ParserOption

IgnoreWhitespaces returns a ParserOption that specifies the Parser to ignore unicode characters with the whitespace property (' ', '\t', '\n', '\r', etc). They are consumed instead of generating Tokens for them.

func Keywords

func Keywords(keywords map[string]TokenKind) ParserOption

Keywords returns a ParserOption that can be used to provide the Parser with a set of special keywords mapped to some custom TokenKind value. If the Parser encounters identifiers that match any of the given keywords, it returns a Token with the given kind and the actual literal encountered. Any default keywords are overwritten if specified in the custom set.

Note: Use TokenKind values less than -10 for custom Token classes. -10 to -1 are reserved for standard token classes while 0 and above correspond the unicode code points.

type Token

type Token struct {
	Kind     TokenKind
	Literal  string
	Position int
}

Token represents a lexical Token. It may be either a lone unicode character or some literal value

func EOFToken

func EOFToken(pos int) Token

EOFToken returns an End of File Token

func UnicodeToken

func UnicodeToken(char rune, pos int) Token

UnicodeToken returns a Token for a given rune character. The TokenKind of the returned Token has the same value as it's unicode code point.

func (Token) Value added in v0.2.0

func (token Token) Value() (any, error)

Value returns an object value for the Token. If the Token is kind TokenString -> string (literal is returned as is) If the Token is kind TokenBoolean -> bool (parsed with strconv.ParseBool) If the Token is kind TokenNumber -> uint64/int64 (parsed with strconv depending on if a negative sign is present) If the Token is kind TokenHexNumber -> []byte (decoded with hex.DecodeString after trimming the 0x) All other Token kinds will return an error if attempted to convert to values

type TokenKind

type TokenKind int32

TokenKind is an enum for representing token grouping/values. For unicode tokens, the TokenKind is equal to its code point value. For literal such identifiers and numerics, the TokenKind values descend from 0. Note: Custom TokenKind values can be used by external packages for keyword detection for special literals, but these values should be below -10 to prevent collisions

const (
	TokenEoF TokenKind = -(iota + 1)
	TokenMalformed

	TokenIdent
	TokenNumber
	TokenString
	TokenBoolean
	TokenHexNumber
)

func (TokenKind) CanValue added in v0.2.0

func (kind TokenKind) CanValue() bool

CanValue returns whether the TokenKind can be converted into a value

func (TokenKind) String

func (kind TokenKind) String() string

String implements the Stringer interface for TokenKind

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL