tokenizer

package
v1.10.2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 13, 2026 License: Apache-2.0 Imports: 13 Imported by: 0

README

SQL Tokenizer Package

Overview

The tokenizer package provides a high-performance, zero-copy SQL lexical analyzer that converts SQL text into tokens. It supports multiple SQL dialects with full Unicode support and comprehensive operator recognition.

Key Features

  • Zero-Copy Operation: Works directly on input bytes without string allocation
  • Unicode Support: Full UTF-8 support for international SQL (8+ languages tested)
  • Multi-Dialect: PostgreSQL, MySQL, SQL Server, Oracle, SQLite operators and syntax
  • Object Pooling: 60-80% memory reduction through instance reuse
  • Position Tracking: Precise line/column information for error reporting
  • DOS Protection: Token limits and input size validation
  • Thread-Safe: All pool operations are race-free

Performance

  • Throughput: 8M tokens/second sustained
  • Latency: Sub-microsecond tokenization for typical queries
  • Memory: Minimal allocations with zero-copy design
  • Concurrency: Validated race-free with 20,000+ concurrent operations

Usage

Basic Tokenization
package main

import (
    "github.com/ajitpratap0/GoSQLX/pkg/sql/tokenizer"
)

func main() {
    // Get tokenizer from pool
    tkz := tokenizer.GetTokenizer()
    defer tokenizer.PutTokenizer(tkz)  // ALWAYS return to pool

    // Tokenize SQL
    sql := []byte("SELECT * FROM users WHERE active = true")
    tokens, err := tkz.Tokenize(sql)
    if err != nil {
        // Handle tokenization error
    }

    // Process tokens
    for _, tok := range tokens {
        fmt.Printf("%s at line %d, col %d\n",
            tok.Token.Value,
            tok.Start.Line,
            tok.Start.Column)
    }
}
Batch Processing
func ProcessMultipleQueries(queries []string) {
    tkz := tokenizer.GetTokenizer()
    defer tokenizer.PutTokenizer(tkz)

    for _, query := range queries {
        tokens, err := tkz.Tokenize([]byte(query))
        if err != nil {
            continue
        }

        // Process tokens
        // ...

        tkz.Reset()  // Reset between uses
    }
}
Concurrent Tokenization
func ConcurrentTokenization(queries []string) {
    var wg sync.WaitGroup

    for _, query := range queries {
        wg.Add(1)
        go func(sql string) {
            defer wg.Done()

            // Each goroutine gets its own tokenizer
            tkz := tokenizer.GetTokenizer()
            defer tokenizer.PutTokenizer(tkz)

            tokens, _ := tkz.Tokenize([]byte(sql))
            // Process tokens...
        }(query)
    }

    wg.Wait()
}

Token Types

Keywords
SELECT, FROM, WHERE, JOIN, GROUP BY, ORDER BY, HAVING, LIMIT, OFFSET,
INSERT, UPDATE, DELETE, CREATE, ALTER, DROP, WITH, UNION, EXCEPT, INTERSECT, etc.
Identifiers
  • Standard: user_id, TableName, column123
  • Quoted: "column name" (SQL standard)
  • Backtick: `column` (MySQL)
  • Bracket: [column] (SQL Server)
  • Unicode: "名前", "имя", "الاسم" (international)
Literals
  • Numbers: 42, 3.14, 1.5e10, 0xFF
  • Strings: 'hello', 'it''s' (escaped quotes)
  • Booleans: TRUE, FALSE
  • NULL: NULL
Operators
  • Comparison: =, <>, !=, <, >, <=, >=
  • Arithmetic: +, -, *, /, %
  • Logical: AND, OR, NOT
  • PostgreSQL: @>, <@, ->, ->>, #>, ?, ||
  • Pattern: LIKE, ILIKE, SIMILAR TO

Dialect-Specific Features

PostgreSQL
-- Array operators
SELECT * FROM users WHERE tags @> ARRAY['admin']

-- JSON operators
SELECT data->>'email' FROM users

-- String concatenation
SELECT first_name || ' ' || last_name FROM users
MySQL
-- Backtick identifiers
SELECT `user_id` FROM `users`

-- Double pipe as OR
SELECT * FROM users WHERE status = 1 || status = 2
SQL Server
-- Bracket identifiers
SELECT [User ID] FROM [User Table]

-- String concatenation with +
SELECT FirstName + ' ' + LastName FROM Users

Architecture

Core Files
  • tokenizer.go: Main tokenizer logic
  • string_literal.go: String parsing with escape sequence handling
  • unicode.go: Unicode identifier and quote normalization
  • position.go: Position tracking (line, column, byte offset)
  • pool.go: Object pool management
  • buffer.go: Internal buffer pool for performance
  • error.go: Structured error types
Tokenization Pipeline
Input bytes → Position tracking → Character scanning → Token recognition → Output tokens

Error Handling

Detailed Error Information
tokens, err := tkz.Tokenize(sqlBytes)
if err != nil {
    if tokErr, ok := err.(*tokenizer.Error); ok {
        fmt.Printf("Error at line %d, column %d: %s\n",
            tokErr.Location.Line,
            tokErr.Location.Column,
            tokErr.Message)
    }
}
Common Error Types
  • Unterminated String: Missing closing quote
  • Invalid Number: Malformed numeric literal
  • Invalid Character: Unexpected character in input
  • Invalid Escape: Unknown escape sequence in string

DOS Protection

Token Limit
// Default: 100,000 tokens per query
// Prevents memory exhaustion from malicious input
Input Size Validation
// Configurable maximum input size
// Default: 10MB per query

Unicode Support

Supported Scripts
  • Latin: English, Spanish, French, German, etc.
  • Cyrillic: Russian, Ukrainian, Bulgarian, etc.
  • CJK: Chinese, Japanese, Korean
  • Arabic: Arabic, Persian, Urdu
  • Devanagari: Hindi, Sanskrit
  • Greek, Hebrew, Thai, and more
Example
sql := `
    SELECT "名前" AS name,
           "возраст" AS age,
           "البريد_الإلكتروني" AS email
    FROM "المستخدمون"
    WHERE "نشط" = true
`
tokens, _ := tkz.Tokenize([]byte(sql))

Testing

Run tokenizer tests:

# All tests
go test -v ./pkg/sql/tokenizer/

# With race detection (MANDATORY during development)
go test -race ./pkg/sql/tokenizer/

# Specific features
go test -v -run TestTokenizer_Unicode ./pkg/sql/tokenizer/
go test -v -run TestTokenizer_PostgreSQL ./pkg/sql/tokenizer/

# Performance benchmarks
go test -bench=BenchmarkTokenizer -benchmem ./pkg/sql/tokenizer/

# Fuzz testing
go test -fuzz=FuzzTokenizer -fuzztime=30s ./pkg/sql/tokenizer/

Best Practices

1. Always Use Object Pool
// GOOD: Use pool
tkz := tokenizer.GetTokenizer()
defer tokenizer.PutTokenizer(tkz)

// BAD: Direct instantiation
tkz := &Tokenizer{}  // Misses pool benefits
2. Reset Between Uses
tkz := tokenizer.GetTokenizer()
defer tokenizer.PutTokenizer(tkz)

for _, query := range queries {
    tokens, _ := tkz.Tokenize([]byte(query))
    // ... process tokens
    tkz.Reset()  // Reset state for next query
}
3. Use Byte Slices
// GOOD: Zero-copy with byte slice
tokens, _ := tkz.Tokenize([]byte(sql))

// LESS EFFICIENT: String conversion
tokens, _ := tkz.Tokenize([]byte(sqlString))

Common Pitfalls

❌ Forgetting to Return to Pool
// BAD: Memory leak
tkz := tokenizer.GetTokenizer()
tokens, _ := tkz.Tokenize(sql)
// tkz never returned to pool
✅ Correct Pattern
// GOOD: Automatic cleanup
tkz := tokenizer.GetTokenizer()
defer tokenizer.PutTokenizer(tkz)
tokens, err := tkz.Tokenize(sql)
❌ Reusing Without Reset
// BAD: State contamination
tkz := tokenizer.GetTokenizer()
defer tokenizer.PutTokenizer(tkz)

tkz.Tokenize(sql1)  // First use
tkz.Tokenize(sql2)  // State from sql1 still present!
✅ Correct Pattern
// GOOD: Reset between uses
tkz := tokenizer.GetTokenizer()
defer tokenizer.PutTokenizer(tkz)

tkz.Tokenize(sql1)
tkz.Reset()  // Clear state
tkz.Tokenize(sql2)

Performance Tips

1. Minimize Allocations

The tokenizer is designed for zero-copy operation. To maximize performance:

  • Pass []byte directly (avoid string conversions)
  • Reuse tokenizer instances via the pool
  • Process tokens immediately (avoid copying token slices)
2. Batch Processing

For multiple queries, reuse a single tokenizer:

tkz := tokenizer.GetTokenizer()
defer tokenizer.PutTokenizer(tkz)

for _, query := range queries {
    tokens, _ := tkz.Tokenize([]byte(query))
    // Process immediately
    tkz.Reset()
}
3. Concurrent Processing

Each goroutine should get its own tokenizer:

// Each goroutine gets its own instance from pool
go func() {
    tkz := tokenizer.GetTokenizer()
    defer tokenizer.PutTokenizer(tkz)
    // ... tokenize and process
}()
  • parser: Consumes tokens to build AST
  • keywords: Keyword recognition and categorization
  • models: Token type definitions
  • metrics: Performance monitoring integration

Documentation

Version History

  • v1.5.0: Enhanced Unicode support, DOS protection hardening
  • v1.4.0: Production validation, 8M tokens/sec sustained
  • v1.3.0: PostgreSQL operator support expanded
  • v1.2.0: Multi-dialect operator recognition
  • v1.0.0: Initial release with zero-copy design

Documentation

Overview

Package tokenizer provides high-performance SQL tokenization with zero-copy operations and comprehensive Unicode support.

The primary entry points are Tokenize (convert raw SQL bytes to []models.TokenWithSpan), GetTokenizer and PutTokenizer (pool-based instance management for optimal memory efficiency), and TokenizeContext (tokenization with context cancellation support). The tokenizer operates directly on input byte slices without allocating intermediate strings, achieving 8M+ tokens/sec throughput with full UTF-8 support.

Overview

The tokenizer package converts raw SQL text into a stream of tokens (lexical analysis) with precise position tracking for error reporting. It is designed for production use with enterprise-grade performance, thread safety, and memory efficiency.

Architecture

The tokenizer uses a zero-copy design that operates directly on input byte slices without string allocations, achieving 8M+ tokens/sec throughput. It includes:

  • Zero-copy byte slice operations for minimal memory allocations
  • Object pooling (GetTokenizer/PutTokenizer) for instance reuse
  • Buffer pooling for internal string operations
  • Position tracking (line/column) for precise error reporting
  • Unicode support for international SQL queries
  • DoS protection with input size and token count limits

Performance Characteristics

The tokenizer is production-validated with the following characteristics:

  • Throughput: 8M+ tokens/sec sustained
  • Memory: Zero-copy operations minimize allocations
  • Thread Safety: Race-free with sync.Pool for object reuse
  • Latency: Sub-microsecond per token on average
  • Pool Hit Rate: 95%+ in production workloads

Thread Safety

The tokenizer is thread-safe when using the pooling API:

  • GetTokenizer() and PutTokenizer() are safe for concurrent use
  • Individual Tokenizer instances are NOT safe for concurrent use
  • Always use one Tokenizer instance per goroutine

Token Types

The tokenizer produces tokens for all SQL elements:

  • Keywords: SELECT, FROM, WHERE, JOIN, etc.
  • Identifiers: table names, column names, aliases
  • Literals: strings ('text'), numbers (123, 45.67, 1e10)
  • Operators: =, <>, +, -, *, /, ||, etc.
  • Punctuation: (, ), [, ], ,, ;, .
  • PostgreSQL JSON operators: ->, ->>, #>, #>>, @>, <@, ?, ?|, ?&, #-
  • Comments: -- line comments and /* block comments */

PostgreSQL Extensions (v1.6.0)

The tokenizer supports PostgreSQL-specific operators:

  • JSON/JSONB operators: -> ->> #> #>> @> <@ ? ?| ?& #-
  • Array operators: && (overlap)
  • Text search: @@ (full text search)
  • Cast operator: :: (double colon)
  • Parameters: @variable (SQL Server style)

Unicode Support

Full Unicode support for international SQL processing:

  • UTF-8 decoding with proper rune handling
  • Unicode quotes: " " ' ' « » (normalized to ASCII)
  • Unicode identifiers: letters, digits, combining marks
  • Multi-byte character support in strings and identifiers
  • Proper line/column tracking across Unicode boundaries

DoS Protection

Built-in protection against denial-of-service attacks:

  • MaxInputSize: 10MB input limit (configurable)
  • MaxTokens: 1M token limit per query (configurable)
  • Context support: TokenizeContext() for cancellation
  • Panic recovery: Structured errors on unexpected panics

Object Pooling

Use GetTokenizer/PutTokenizer for optimal performance:

tkz := tokenizer.GetTokenizer()
defer tokenizer.PutTokenizer(tkz)  // MANDATORY - returns to pool

tokens, err := tkz.Tokenize([]byte(sql))
if err != nil {
    return err
}
// Use tokens...

Benefits:

  • 60-80% reduction in allocations
  • 95%+ pool hit rate in production
  • Automatic state reset on return to pool

Basic Usage

Simple tokenization without pooling:

tkz, err := tokenizer.New()
if err != nil {
    return err
}

sql := "SELECT id, name FROM users WHERE active = true"
tokens, err := tkz.Tokenize([]byte(sql))
if err != nil {
    return err
}

for _, tok := range tokens {
    fmt.Printf("Token: %s (type: %v) at line %d, col %d\n",
        tok.Token.Value, tok.Token.Type, tok.Start.Line, tok.Start.Column)
}

Advanced Usage with Context

Tokenization with timeout and cancellation:

ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()

tkz := tokenizer.GetTokenizer()
defer tokenizer.PutTokenizer(tkz)

tokens, err := tkz.TokenizeContext(ctx, []byte(sql))
if err == context.DeadlineExceeded {
    log.Printf("Tokenization timed out")
    return err
}

The context is checked every 100 tokens for responsive cancellation.

Error Handling

The tokenizer returns structured errors with position information:

tokens, err := tkz.Tokenize([]byte(sql))
if err != nil {
    // Errors include line/column information
    // Common errors: UnterminatedStringError, UnexpectedCharError,
    // InvalidNumberError, InputTooLargeError, TokenLimitReachedError
    log.Printf("Tokenization error: %v", err)
    return err
}

Position Tracking

Every token includes precise start/end positions:

for _, tokWithSpan := range tokens {
    fmt.Printf("Token '%s' at line %d, column %d-%d\n",
        tokWithSpan.Token.Value,
        tokWithSpan.Start.Line,
        tokWithSpan.Start.Column,
        tokWithSpan.End.Column)
}

Position information is 1-based (first line is 1, first column is 1).

String Literals

The tokenizer handles various string literal formats:

  • Single quotes: 'text', 'can”t' (doubled quotes for escaping)
  • Double quotes: "identifier" (SQL identifiers, not strings)
  • Backticks: `identifier` (MySQL-style identifiers)
  • Triple quotes: ”'multiline”' """multiline"""
  • Unicode quotes: 'text' "text" «text» (normalized)
  • Escape sequences: \n \r \t \\ \' \" \uXXXX

Number Formats

Supported number formats:

  • Integers: 123, 0, 999999
  • Decimals: 3.14, 0.5, 999.999
  • Scientific: 1e10, 2.5e-3, 1.23E+4

Comments

Comments are automatically skipped during tokenization:

  • Line comments: -- comment text (until newline)
  • Block comments: /* comment text */ (can span multiple lines)

Identifiers

Identifiers follow SQL standards with extensions:

  • Unquoted: letters, digits, underscore (cannot start with digit)
  • Quoted: "Any Text" (allows spaces, special chars, keywords)
  • Backticked: `Any Text` (MySQL compatibility)
  • Unicode: Full Unicode letter and digit support
  • Compound keywords: GROUP BY, ORDER BY, LEFT JOIN, etc.

Keyword Recognition

Keywords are recognized case-insensitively and mapped to specific token types:

  • DML: SELECT, INSERT, UPDATE, DELETE, MERGE
  • DDL: CREATE, ALTER, DROP, TRUNCATE
  • Joins: JOIN, LEFT JOIN, INNER JOIN, CROSS JOIN, etc.
  • CTEs: WITH, RECURSIVE, UNION, EXCEPT, INTERSECT
  • Grouping: GROUP BY, ROLLUP, CUBE, GROUPING SETS
  • Window: OVER, PARTITION BY, ROWS, RANGE, etc.
  • PostgreSQL: DISTINCT ON, FILTER, RETURNING, LATERAL

Memory Management

The tokenizer uses several strategies for memory efficiency:

  • Tokenizer pooling: Reuse instances with sync.Pool
  • Buffer pooling: Reuse byte buffers for string operations
  • Zero-copy: Operate on input slices without allocation
  • Slice reuse: Preserve capacity when resetting state
  • Metrics tracking: Monitor pool efficiency and memory usage

Integration with Parser

Typical integration pattern with the parser:

// Get tokenizer from pool
tkz := tokenizer.GetTokenizer()
defer tokenizer.PutTokenizer(tkz)

// Tokenize SQL
tokens, err := tkz.Tokenize([]byte(sql))
if err != nil {
    return nil, err
}

// Parse tokens to AST
ast, err := parser.Parse(tokens)
if err != nil {
    return nil, err
}

Production Deployment

Best practices for production use:

  1. Always use GetTokenizer/PutTokenizer for pooling efficiency
  2. Use defer to ensure PutTokenizer is called even on errors
  3. Monitor metrics for pool hit rates and performance
  4. Configure DoS limits (MaxInputSize, MaxTokens) for your use case
  5. Use TokenizeContext for long-running operations
  6. Test with your actual SQL workload for realistic validation

Metrics and Monitoring

The tokenizer integrates with pkg/metrics for observability:

  • Tokenization duration and throughput
  • Pool get/put operations and hit rates
  • Error counts by category
  • Input size and token count distributions

Access metrics via the metrics package for production monitoring.

Validation Status

Production-ready with comprehensive validation:

  • Race detection: Zero race conditions (20,000+ concurrent operations tested)
  • Performance: 8M+ tokens/sec sustained throughput
  • Unicode: Full international support (8 languages validated)
  • Reliability: 95%+ success rate on real-world SQL queries
  • Memory: Zero leaks detected under extended load testing

Examples

See the tokenizer_test.go file for comprehensive examples including:

  • Basic tokenization
  • Unicode handling
  • PostgreSQL operators
  • Error cases
  • Performance benchmarks
  • Pool usage patterns

Index

Constants

View Source
const (
	// MaxInputSize is the maximum allowed input size in bytes (10MB default).
	//
	// This limit prevents denial-of-service (DoS) attacks via extremely large
	// SQL queries that could exhaust server memory. Queries exceeding this size
	// will return an InputTooLargeError.
	//
	// Rationale:
	//   - 10MB is sufficient for complex SQL queries with large IN clauses
	//   - Protects against malicious or accidental memory exhaustion
	//   - Can be increased if needed for legitimate large queries
	//
	// If your application requires larger queries, consider:
	//   - Breaking queries into smaller batches
	//   - Using prepared statements with parameter binding
	//   - Increasing the limit (but ensure adequate memory protection)
	MaxInputSize = 10 * 1024 * 1024 // 10MB

	// MaxTokens is the maximum number of tokens allowed in a single SQL query
	// (1M tokens default).
	//
	// This limit prevents denial-of-service (DoS) attacks via "token explosion"
	// where maliciously crafted or accidentally generated SQL creates an excessive
	// number of tokens, exhausting CPU and memory.
	//
	// Rationale:
	//   - 1M tokens is far beyond any reasonable SQL query size
	//   - Typical queries have 10-1000 tokens
	//   - Complex queries rarely exceed 10,000 tokens
	//   - Protects against pathological cases and attacks
	//
	// Example token counts:
	//   - Simple SELECT: ~10-50 tokens
	//   - Complex query with joins: ~100-500 tokens
	//   - Large IN clause with 1000 values: ~3000-4000 tokens
	//
	// If this limit is hit on a legitimate query, the query should likely
	// be redesigned for better performance and maintainability.
	MaxTokens = 1000000 // 1M tokens
)

Variables

This section is empty.

Functions

func PutTokenizer

func PutTokenizer(t *Tokenizer)

PutTokenizer returns a Tokenizer instance to the pool for reuse.

This must be called after you're done with a Tokenizer obtained from GetTokenizer() to enable instance reuse and prevent memory leaks.

The tokenizer is automatically reset before being returned to the pool, clearing all state including input references, positions, and debug loggers.

Thread Safety: Safe for concurrent calls from multiple goroutines.

Best Practice: Always use with defer immediately after GetTokenizer():

tkz := tokenizer.GetTokenizer()
defer tokenizer.PutTokenizer(tkz)  // MANDATORY

Nil Safety: Safely ignores nil tokenizers (no-op).

Metrics: Records pool put operations for monitoring pool efficiency.

State Reset:

  • Input reference cleared (enables GC of SQL bytes)
  • Position tracking reset to initial state
  • Line tracking cleared but capacity preserved
  • Debug logger cleared
  • Keywords preserved (immutable configuration)

Types

type BufferPool

type BufferPool struct {
	// contains filtered or unexported fields
}

BufferPool manages a pool of reusable byte buffers for token content.

This pool is used for temporary byte slice operations during tokenization, such as accumulating identifier characters or building string literal content. It complements the bytes.Buffer pool used elsewhere in the tokenizer.

The pool is designed for byte slices rather than bytes.Buffer for cases where direct slice manipulation is more efficient than buffer operations.

Thread Safety: Safe for concurrent use across multiple goroutines.

Initial Capacity: Buffers are pre-allocated with 128 bytes capacity, suitable for most SQL tokens (identifiers, keywords, short string literals).

func NewBufferPool

func NewBufferPool() *BufferPool

NewBufferPool creates a new buffer pool with optimized initial capacity.

The pool pre-allocates byte slices with 128-byte capacity, which is sufficient for most SQL tokens without excessive memory waste.

Returns a BufferPool ready for use with Get/Put operations.

Example:

pool := NewBufferPool()
buf := pool.Get()
defer pool.Put(buf)
// Use buf for byte operations...

func (*BufferPool) Get

func (p *BufferPool) Get() []byte

Get retrieves a buffer from the pool.

The returned buffer has zero length but may have capacity >= 128 bytes from previous use. This allows efficient appending without reallocation for typical SQL tokens.

Thread Safety: Safe for concurrent calls.

The buffer must be returned to the pool via Put() when done to enable reuse.

Returns a byte slice ready for use (length 0, capacity >= 128).

func (*BufferPool) Grow

func (p *BufferPool) Grow(buf []byte, n int) []byte

Grow ensures the buffer has enough capacity for n additional bytes.

If the buffer doesn't have enough spare capacity, a new larger buffer is allocated with doubled capacity plus n bytes. The old buffer is returned to the pool.

Growth Strategy: New capacity = 2 * old capacity + n This exponential growth with a minimum increment minimizes reallocations while preventing excessive memory waste.

Parameters:

  • buf: The current buffer
  • n: Number of additional bytes needed

Returns:

  • The original buffer if it has sufficient capacity
  • A new, larger buffer with contents copied if reallocation was needed

Example:

buf := pool.Get()
buf = pool.Grow(buf, 256)  // Ensure 256 bytes available
buf = append(buf, data...)  // Append without reallocation

func (*BufferPool) Put

func (p *BufferPool) Put(buf []byte)

Put returns a buffer to the pool for reuse.

The buffer's capacity is preserved, allowing it to be reused for similarly-sized operations without reallocation. Buffers with zero capacity are discarded.

Thread Safety: Safe for concurrent calls.

It's safe to call Put multiple times with the same buffer, though only the first call will be effective (subsequent calls operate on a reset buffer).

Parameters:

  • buf: The byte slice to return to the pool

type Error

type Error struct {
	Message  string          // Human-readable error message
	Location models.Location // Position where the error occurred (1-based)
}

Error represents a tokenization error with precise location information.

This type provides structured error reporting with line and column positions, making it easy for users to identify and fix SQL syntax issues.

Note: Modern code should use the errors from pkg/errors package instead, which provide more comprehensive error categorization and context. This type is maintained for backward compatibility.

Example:

if err != nil {
    if tokErr, ok := err.(*tokenizer.Error); ok {
        fmt.Printf("Tokenization failed at line %d, column %d: %s\n",
            tokErr.Location.Line, tokErr.Location.Column, tokErr.Message)
    }
}

func ErrorInvalidIdentifier

func ErrorInvalidIdentifier(value string, location models.Location) *Error

ErrorInvalidIdentifier creates an error for an invalid identifier.

This is used when an identifier has invalid syntax, such as:

  • Starting with a digit (when not quoted)
  • Containing invalid characters
  • Unterminated quoted identifier

Parameters:

  • value: The invalid identifier string
  • location: Position where the identifier started

Returns an Error describing the invalid identifier.

Example: "invalid identifier: 123abc at line 2, column 8"

func ErrorInvalidNumber

func ErrorInvalidNumber(value string, location models.Location) *Error

ErrorInvalidNumber creates an error for an invalid number format.

This is used when a number token has invalid syntax, such as:

  • Decimal point without digits: "123."
  • Exponent without digits: "123e"
  • Multiple decimal points: "12.34.56"

Parameters:

  • value: The invalid number string
  • location: Position where the number started

Returns an Error describing the invalid number format.

Example: "invalid number format: 123.e at line 1, column 10"

func ErrorInvalidOperator

func ErrorInvalidOperator(value string, location models.Location) *Error

ErrorInvalidOperator creates an error for an invalid operator.

This is used when an operator token has invalid syntax, such as:

  • Incomplete multi-character operators
  • Invalid operator combinations

Parameters:

  • value: The invalid operator string
  • location: Position where the operator started

Returns an Error describing the invalid operator.

Example: "invalid operator: <=> at line 1, column 20"

func ErrorUnexpectedChar

func ErrorUnexpectedChar(ch byte, location models.Location) *Error

ErrorUnexpectedChar creates an error for an unexpected character.

This is used when the tokenizer encounters a character that cannot start any valid token in the current context.

Parameters:

  • ch: The unexpected character (byte)
  • location: Position where the character was found

Returns an Error describing the unexpected character.

Example: "unexpected character: @ at line 2, column 5"

func ErrorUnterminatedString

func ErrorUnterminatedString(location models.Location) *Error

ErrorUnterminatedString creates an error for an unterminated string literal.

This occurs when a string literal (single or double quoted) is not properly closed before the end of the line or input.

Parameters:

  • location: Position where the string started

Returns an Error indicating the string was not terminated.

Example: "unterminated string literal at line 3, column 15"

func NewError

func NewError(message string, location models.Location) *Error

NewError creates a new tokenization error with a message and location.

Parameters:

  • message: Human-readable description of the error
  • location: Position in the input where the error occurred

Returns a pointer to an Error with the specified message and location.

func (*Error) Error

func (e *Error) Error() string

Error implements the error interface, returning a formatted error message with location information.

Format: "<message> at line <line>, column <column>"

Example output: "unterminated string literal at line 5, column 23"

type Position

type Position struct {
	Line   int // Current line number (1-based)
	Index  int // Current byte offset into input (0-based)
	Column int // Current column number (1-based)
	LastNL int // Byte offset of last newline (for efficient column calculation)
}

Position tracks the scanning cursor position during tokenization. It maintains both absolute byte offset and human-readable line/column coordinates for precise error reporting and token span tracking.

Coordinate System:

  • Line: 1-based (first line is line 1)
  • Column: 1-based (first column is column 1)
  • Index: 0-based byte offset into input (first byte is index 0)
  • LastNL: Byte offset of most recent newline (for column calculation)

Zero-Copy Design: Position operates on byte indices rather than rune indices for performance. UTF-8 decoding happens only when needed during character scanning.

Thread Safety: Position is not thread-safe. Each Tokenizer instance should have its own Position that is not shared across goroutines.

func NewPosition

func NewPosition(line, index int) Position

NewPosition creates a new Position with the specified line and byte index. The column is initialized to 1 (first column).

Parameters:

  • line: Line number (1-based, typically starts at 1)
  • index: Byte offset into input (0-based, typically starts at 0)

Returns a Position ready for use in tokenization.

func (*Position) AdvanceN

func (p *Position) AdvanceN(n int, lineStarts []int)

AdvanceN moves the position forward by n bytes and recalculates the line and column numbers using the provided line start indices.

This is used when jumping forward in the input (e.g., after skipping a comment block) where individual rune tracking would be inefficient.

Parameters:

  • n: Number of bytes to advance
  • lineStarts: Slice of byte offsets where each line starts (from tokenizer)

Performance: O(L) where L is the number of lines in lineStarts. For typical SQL queries with few lines, this is effectively O(1).

If n <= 0, this is a no-op.

func (*Position) AdvanceRune

func (p *Position) AdvanceRune(r rune, size int)

AdvanceRune moves the position forward by one UTF-8 rune. This updates the byte index, line number, and column number appropriately.

Newline Handling: When r is '\n', the line number increments and the column resets to 1.

Parameters:

  • r: The rune being consumed (used to detect newlines)
  • size: The byte size of the rune in UTF-8 encoding

Performance: O(1) operation, no string allocations.

Example:

r, size := utf8.DecodeRune(input[pos.Index:])
pos.AdvanceRune(r, size)  // Move past this rune

func (Position) Clone

func (p Position) Clone() Position

Clone creates a copy of this Position. The returned Position is independent and can be modified without affecting the original.

This is useful when you need to save a position (e.g., for backtracking during compound keyword parsing) and then potentially restore it.

Returns a new Position with identical values.

func (Position) Location

func (p Position) Location(t *Tokenizer) models.Location

Location converts this Position to a models.Location using the tokenizer's line tracking information for accurate column calculation.

This method uses the tokenizer's lineStarts slice to calculate the exact column position, accounting for variable-width UTF-8 characters and tabs.

Returns a models.Location with 1-based line and column numbers.

type StringLiteralReader

type StringLiteralReader struct {
	// contains filtered or unexported fields
}

StringLiteralReader handles reading of string literals with proper escape sequence handling

func NewStringLiteralReader

func NewStringLiteralReader(input []byte, pos *Position, quote rune) *StringLiteralReader

NewStringLiteralReader creates a new StringLiteralReader

func (*StringLiteralReader) ReadStringLiteral

func (r *StringLiteralReader) ReadStringLiteral() (models.Token, error)

ReadStringLiteral reads a string literal with proper escape sequence handling

type Tokenizer

type Tokenizer struct {
	Comments []models.Comment // Comments captured during tokenization
	// contains filtered or unexported fields
}

Tokenizer provides high-performance SQL tokenization with zero-copy operations. It converts raw SQL bytes into a stream of tokens with precise position tracking.

Features:

  • Zero-copy operations on input byte slices (no string allocations)
  • Precise line/column tracking for error reporting (1-based indexing)
  • Unicode support for international SQL queries
  • PostgreSQL operator support (JSON, array, text search operators)
  • DoS protection with input size and token count limits

Thread Safety:

  • Individual instances are NOT safe for concurrent use
  • Use GetTokenizer/PutTokenizer for safe pooling across goroutines
  • Each goroutine should use its own Tokenizer instance

Memory Management:

  • Reuses internal buffers to minimize allocations
  • Preserves slice capacity across Reset() calls
  • Integrates with sync.Pool for instance reuse

Usage:

// With pooling (recommended for production)
tkz := GetTokenizer()
defer PutTokenizer(tkz)
tokens, err := tkz.Tokenize([]byte(sql))

// Without pooling (simple usage)
tkz, _ := New()
tokens, err := tkz.Tokenize([]byte(sql))

func GetTokenizer

func GetTokenizer() *Tokenizer

GetTokenizer retrieves a Tokenizer instance from the pool.

This is the recommended way to obtain a Tokenizer for production use. The returned tokenizer is reset and ready for use.

Thread Safety: Safe for concurrent calls from multiple goroutines. Each call returns a separate instance.

Memory Management: Always pair with PutTokenizer() using defer to ensure the instance is returned to the pool, even if errors occur.

Metrics: Records pool get operations for monitoring pool efficiency.

Example:

tkz := tokenizer.GetTokenizer()
defer tokenizer.PutTokenizer(tkz)  // MANDATORY - ensures pool return

tokens, err := tkz.Tokenize([]byte(sql))
if err != nil {
    return err  // defer ensures PutTokenizer is called
}
// Process tokens...

Performance: 95%+ hit rate means most calls reuse existing instances rather than allocating new ones, providing significant performance benefits.

func New

func New() (*Tokenizer, error)

New creates a new Tokenizer with default configuration and keyword support. The returned tokenizer is ready to use for tokenizing SQL statements.

For production use, prefer GetTokenizer() which uses object pooling for better performance and reduced allocations.

Returns an error only if keyword initialization fails (extremely rare).

Example:

tkz, err := tokenizer.New()
if err != nil {
    return err
}
tokens, err := tkz.Tokenize([]byte("SELECT * FROM users"))

func NewWithDialect added in v1.9.3

func NewWithDialect(dialect keywords.SQLDialect) (*Tokenizer, error)

NewWithDialect creates a new Tokenizer configured for the given SQL dialect. Dialect-specific keywords are recognized based on the dialect parameter. If dialect is empty or unknown, defaults to DialectPostgreSQL.

func NewWithKeywords

func NewWithKeywords(kw *keywords.Keywords) (*Tokenizer, error)

NewWithKeywords initializes a Tokenizer with a custom keyword classifier. This allows you to customize keyword recognition for specific SQL dialects or to add custom keywords.

The keywords parameter must not be nil.

Returns an error if keywords is nil.

Example:

kw := keywords.NewKeywords()
// Customize keywords as needed...
tkz, err := tokenizer.NewWithKeywords(kw)
if err != nil {
    return err
}

func (*Tokenizer) Dialect added in v1.9.3

func (t *Tokenizer) Dialect() keywords.SQLDialect

Dialect returns the SQL dialect configured for this tokenizer.

func (*Tokenizer) Reset

func (t *Tokenizer) Reset()

Reset clears a Tokenizer's state for reuse while preserving allocated memory.

This method is called automatically by PutTokenizer() and generally should not be called directly by users. It's exposed for advanced use cases where you want to reuse a tokenizer instance without going through the pool.

Memory Optimization:

  • Clears input reference (allows GC of SQL bytes)
  • Resets position tracking to initial values
  • Preserves lineStarts slice capacity (avoids reallocation)
  • Clears debug logger reference

State After Reset:

  • pos: Line 1, Column 0, Index 0
  • lineStarts: Empty slice with preserved capacity (contains [0])
  • input: nil (ready for new input)
  • keywords: Preserved (immutable, no need to reset)
  • logger: nil (must be set again if needed)

Performance: By preserving slice capacity, subsequent Tokenize() calls avoid reallocation of lineStarts for similarly-sized inputs.

func (*Tokenizer) SetDialect added in v1.9.3

func (t *Tokenizer) SetDialect(dialect keywords.SQLDialect)

SetDialect reconfigures the tokenizer for a different SQL dialect. This rebuilds the keyword set to include dialect-specific keywords.

func (*Tokenizer) SetLogger added in v1.9.3

func (t *Tokenizer) SetLogger(logger *slog.Logger)

SetLogger configures a structured logger for verbose tracing during tokenization. The logger receives slog.Debug messages for each token produced, which is useful for diagnosing tokenization issues or understanding token stream structure.

Pass nil to disable debug logging (the default).

Logging is guarded by slog.LevelDebug checks so there is no performance cost when the handler's minimum level is above Debug.

Example:

tkz := tokenizer.GetTokenizer()
tkz.SetLogger(slog.Default())
tokens, _ := tkz.Tokenize([]byte(sql))

To disable:

tkz.SetLogger(nil)

Thread Safety: The logger may be called from multiple goroutines if tokenizers are used concurrently. *slog.Logger is safe for concurrent use.

func (*Tokenizer) Tokenize

func (t *Tokenizer) Tokenize(input []byte) ([]models.TokenWithSpan, error)

Tokenize converts raw SQL bytes into a slice of tokens with position information.

This is the main entry point for tokenization. It performs zero-copy tokenization directly on the input byte slice and returns tokens with precise start/end positions.

Performance: 8M+ tokens/sec sustained throughput with zero-copy operations.

DoS Protection:

  • Input size limited to MaxInputSize (10MB default)
  • Token count limited to MaxTokens (1M default)
  • Returns errors if limits exceeded

Position Tracking:

  • All positions are 1-based (first line is 1, first column is 1)
  • Start position is inclusive, end position is exclusive
  • Position information preserved for all tokens including EOF

Error Handling:

  • Returns structured errors with precise position information
  • Common errors: UnterminatedStringError, UnexpectedCharError, InvalidNumberError
  • Errors include line/column location and context

Parameters:

  • input: Raw SQL bytes to tokenize (not modified, zero-copy reference)

Returns:

  • []models.TokenWithSpan: Slice of tokens with position spans (includes EOF token)
  • error: Tokenization error with position information, or nil on success

Example:

tkz := GetTokenizer()
defer PutTokenizer(tkz)

sql := "SELECT id, name FROM users WHERE active = true"
tokens, err := tkz.Tokenize([]byte(sql))
if err != nil {
    log.Printf("Tokenization error at line %d: %v",
        err.(errors.TokenizerError).Location.Line, err)
    return err
}

for _, tok := range tokens {
    fmt.Printf("Token: %s (type: %v) at %d:%d\n",
        tok.Token.Value, tok.Token.Type,
        tok.Start.Line, tok.Start.Column)
}

PostgreSQL Operators (v1.6.0):

sql := "SELECT data->'field' FROM table WHERE config @> '{\"key\":\"value\"}'"
tokens, _ := tkz.Tokenize([]byte(sql))
// Produces tokens for: -> (JSON field access), @> (JSONB contains)

Unicode Support:

sql := "SELECT 名前 FROM ユーザー WHERE 'こんにちは'"
tokens, _ := tkz.Tokenize([]byte(sql))
// Correctly tokenizes Unicode identifiers and string literals

func (*Tokenizer) TokenizeContext added in v1.5.0

func (t *Tokenizer) TokenizeContext(ctx context.Context, input []byte) ([]models.TokenWithSpan, error)

TokenizeContext processes the input and returns tokens with context support for cancellation. It checks the context at regular intervals (every 100 tokens) to enable fast cancellation. Returns context.Canceled or context.DeadlineExceeded when the context is cancelled.

This method is useful for:

  • Long-running tokenization operations that need to be cancellable
  • Implementing timeouts for tokenization
  • Graceful shutdown scenarios

Example:

ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()
tokens, err := tokenizer.TokenizeContext(ctx, []byte(sql))
if err == context.DeadlineExceeded {
    // Handle timeout
}

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL