tokenizer

package

v1.9.3 Latest Latest Go to latest Published: Mar 8, 2026 License: Apache-2.0 Imports: 13 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/ajitpratap0/GoSQLX

Links

Open Source Insights

README ¶

SQL Tokenizer Package

Overview

The tokenizer package provides a high-performance, zero-copy SQL lexical analyzer that converts SQL text into tokens. It supports multiple SQL dialects with full Unicode support and comprehensive operator recognition.

Key Features

Zero-Copy Operation: Works directly on input bytes without string allocation
Unicode Support: Full UTF-8 support for international SQL (8+ languages tested)
Multi-Dialect: PostgreSQL, MySQL, SQL Server, Oracle, SQLite operators and syntax
Object Pooling: 60-80% memory reduction through instance reuse
Position Tracking: Precise line/column information for error reporting
DOS Protection: Token limits and input size validation
Thread-Safe: All pool operations are race-free

Performance

Throughput: 8M tokens/second sustained
Latency: Sub-microsecond tokenization for typical queries
Memory: Minimal allocations with zero-copy design
Concurrency: Validated race-free with 20,000+ concurrent operations

Usage

Basic Tokenization

package main

import (
    "github.com/ajitpratap0/GoSQLX/pkg/sql/tokenizer"
)

func main() {
    // Get tokenizer from pool
    tkz := tokenizer.GetTokenizer()
    defer tokenizer.PutTokenizer(tkz)  // ALWAYS return to pool

    // Tokenize SQL
    sql := []byte("SELECT * FROM users WHERE active = true")
    tokens, err := tkz.Tokenize(sql)
    if err != nil {
        // Handle tokenization error
    }

    // Process tokens
    for _, tok := range tokens {
        fmt.Printf("%s at line %d, col %d\n",
            tok.Token.Value,
            tok.Start.Line,
            tok.Start.Column)
    }
}

Batch Processing

func ProcessMultipleQueries(queries []string) {
    tkz := tokenizer.GetTokenizer()
    defer tokenizer.PutTokenizer(tkz)

    for _, query := range queries {
        tokens, err := tkz.Tokenize([]byte(query))
        if err != nil {
            continue
        }

        // Process tokens
        // ...

        tkz.Reset()  // Reset between uses
    }
}

Concurrent Tokenization

func ConcurrentTokenization(queries []string) {
    var wg sync.WaitGroup

    for _, query := range queries {
        wg.Add(1)
        go func(sql string) {
            defer wg.Done()

            // Each goroutine gets its own tokenizer
            tkz := tokenizer.GetTokenizer()
            defer tokenizer.PutTokenizer(tkz)

            tokens, _ := tkz.Tokenize([]byte(sql))
            // Process tokens...
        }(query)
    }

    wg.Wait()
}

Token Types

Keywords

SELECT, FROM, WHERE, JOIN, GROUP BY, ORDER BY, HAVING, LIMIT, OFFSET,
INSERT, UPDATE, DELETE, CREATE, ALTER, DROP, WITH, UNION, EXCEPT, INTERSECT, etc.

Identifiers

Standard: user_id, TableName, column123
Quoted: "column name" (SQL standard)
Backtick: `column` (MySQL)
Bracket: [column] (SQL Server)
Unicode: "名前", "имя", "الاسم" (international)

Literals

Numbers: 42, 3.14, 1.5e10, 0xFF
Strings: 'hello', 'it''s' (escaped quotes)
Booleans: TRUE, FALSE
NULL: NULL

Operators

Comparison: =, <>, !=, <, >, <=, >=
Arithmetic: +, -, *, /, %
Logical: AND, OR, NOT
PostgreSQL: @>, <@, ->, ->>, #>, ?, ||
Pattern: LIKE, ILIKE, SIMILAR TO

Dialect-Specific Features

PostgreSQL

-- Array operators
SELECT * FROM users WHERE tags @> ARRAY['admin']

-- JSON operators
SELECT data->>'email' FROM users

-- String concatenation
SELECT first_name || ' ' || last_name FROM users

MySQL

-- Backtick identifiers
SELECT `user_id` FROM `users`

-- Double pipe as OR
SELECT * FROM users WHERE status = 1 || status = 2

SQL Server

-- Bracket identifiers
SELECT [User ID] FROM [User Table]

-- String concatenation with +
SELECT FirstName + ' ' + LastName FROM Users

Architecture

Core Files

tokenizer.go: Main tokenizer logic
string_literal.go: String parsing with escape sequence handling
unicode.go: Unicode identifier and quote normalization
position.go: Position tracking (line, column, byte offset)
pool.go: Object pool management
buffer.go: Internal buffer pool for performance
error.go: Structured error types

Tokenization Pipeline

Input bytes → Position tracking → Character scanning → Token recognition → Output tokens

Error Handling

Detailed Error Information

tokens, err := tkz.Tokenize(sqlBytes)
if err != nil {
    if tokErr, ok := err.(*tokenizer.Error); ok {
        fmt.Printf("Error at line %d, column %d: %s\n",
            tokErr.Location.Line,
            tokErr.Location.Column,
            tokErr.Message)
    }
}

Common Error Types

Unterminated String: Missing closing quote
Invalid Number: Malformed numeric literal
Invalid Character: Unexpected character in input
Invalid Escape: Unknown escape sequence in string

DOS Protection

Token Limit

// Default: 100,000 tokens per query
// Prevents memory exhaustion from malicious input

Input Size Validation

// Configurable maximum input size
// Default: 10MB per query

Unicode Support

Supported Scripts

Latin: English, Spanish, French, German, etc.
Cyrillic: Russian, Ukrainian, Bulgarian, etc.
CJK: Chinese, Japanese, Korean
Arabic: Arabic, Persian, Urdu
Devanagari: Hindi, Sanskrit
Greek, Hebrew, Thai, and more

Example

sql := `
    SELECT "名前" AS name,
           "возраст" AS age,
           "البريد_الإلكتروني" AS email
    FROM "المستخدمون"
    WHERE "نشط" = true
`
tokens, _ := tkz.Tokenize([]byte(sql))

Testing

Run tokenizer tests:

# All tests
go test -v ./pkg/sql/tokenizer/

# With race detection (MANDATORY during development)
go test -race ./pkg/sql/tokenizer/

# Specific features
go test -v -run TestTokenizer_Unicode ./pkg/sql/tokenizer/
go test -v -run TestTokenizer_PostgreSQL ./pkg/sql/tokenizer/

# Performance benchmarks
go test -bench=BenchmarkTokenizer -benchmem ./pkg/sql/tokenizer/

# Fuzz testing
go test -fuzz=FuzzTokenizer -fuzztime=30s ./pkg/sql/tokenizer/

Best Practices

1. Always Use Object Pool

// GOOD: Use pool
tkz := tokenizer.GetTokenizer()
defer tokenizer.PutTokenizer(tkz)

// BAD: Direct instantiation
tkz := &Tokenizer{}  // Misses pool benefits

2. Reset Between Uses

tkz := tokenizer.GetTokenizer()
defer tokenizer.PutTokenizer(tkz)

for _, query := range queries {
    tokens, _ := tkz.Tokenize([]byte(query))
    // ... process tokens
    tkz.Reset()  // Reset state for next query
}

3. Use Byte Slices

// GOOD: Zero-copy with byte slice
tokens, _ := tkz.Tokenize([]byte(sql))

// LESS EFFICIENT: String conversion
tokens, _ := tkz.Tokenize([]byte(sqlString))

Common Pitfalls

❌ Forgetting to Return to Pool

// BAD: Memory leak
tkz := tokenizer.GetTokenizer()
tokens, _ := tkz.Tokenize(sql)
// tkz never returned to pool

✅ Correct Pattern

// GOOD: Automatic cleanup
tkz := tokenizer.GetTokenizer()
defer tokenizer.PutTokenizer(tkz)
tokens, err := tkz.Tokenize(sql)

❌ Reusing Without Reset

// BAD: State contamination
tkz := tokenizer.GetTokenizer()
defer tokenizer.PutTokenizer(tkz)

tkz.Tokenize(sql1)  // First use
tkz.Tokenize(sql2)  // State from sql1 still present!

✅ Correct Pattern

// GOOD: Reset between uses
tkz := tokenizer.GetTokenizer()
defer tokenizer.PutTokenizer(tkz)

tkz.Tokenize(sql1)
tkz.Reset()  // Clear state
tkz.Tokenize(sql2)

Performance Tips

1. Minimize Allocations

The tokenizer is designed for zero-copy operation. To maximize performance:

Pass []byte directly (avoid string conversions)
Reuse tokenizer instances via the pool
Process tokens immediately (avoid copying token slices)

2. Batch Processing

For multiple queries, reuse a single tokenizer:

tkz := tokenizer.GetTokenizer()
defer tokenizer.PutTokenizer(tkz)

for _, query := range queries {
    tokens, _ := tkz.Tokenize([]byte(query))
    // Process immediately
    tkz.Reset()
}

3. Concurrent Processing

Each goroutine should get its own tokenizer:

// Each goroutine gets its own instance from pool
go func() {
    tkz := tokenizer.GetTokenizer()
    defer tokenizer.PutTokenizer(tkz)
    // ... tokenize and process
}()

parser: Consumes tokens to build AST
keywords: Keyword recognition and categorization
models: Token type definitions
metrics: Performance monitoring integration

Documentation

Version History

v1.5.0: Enhanced Unicode support, DOS protection hardening
v1.4.0: Production validation, 8M tokens/sec sustained
v1.3.0: PostgreSQL operator support expanded
v1.2.0: Multi-dialect operator recognition
v1.0.0: Initial release with zero-copy design

Documentation ¶

Overview ¶

Package tokenizer provides high-performance SQL tokenization with zero-copy operations and comprehensive Unicode support.

The primary entry points are Tokenize (convert raw SQL bytes to []models.TokenWithSpan), GetTokenizer and PutTokenizer (pool-based instance management for optimal memory efficiency), and TokenizeContext (tokenization with context cancellation support). The tokenizer operates directly on input byte slices without allocating intermediate strings, achieving 8M+ tokens/sec throughput with full UTF-8 support.

Overview ¶

The tokenizer package converts raw SQL text into a stream of tokens (lexical analysis) with precise position tracking for error reporting. It is designed for production use with enterprise-grade performance, thread safety, and memory efficiency.

Architecture ¶

The tokenizer uses a zero-copy design that operates directly on input byte slices without string allocations, achieving 8M+ tokens/sec throughput. It includes:

Zero-copy byte slice operations for minimal memory allocations
Object pooling (GetTokenizer/PutTokenizer) for instance reuse
Buffer pooling for internal string operations
Position tracking (line/column) for precise error reporting
Unicode support for international SQL queries
DoS protection with input size and token count limits

Performance Characteristics ¶

The tokenizer is production-validated with the following characteristics:

Throughput: 8M+ tokens/sec sustained
Memory: Zero-copy operations minimize allocations
Thread Safety: Race-free with sync.Pool for object reuse
Latency: Sub-microsecond per token on average
Pool Hit Rate: 95%+ in production workloads

Thread Safety ¶

The tokenizer is thread-safe when using the pooling API:

GetTokenizer() and PutTokenizer() are safe for concurrent use
Individual Tokenizer instances are NOT safe for concurrent use
Always use one Tokenizer instance per goroutine

Token Types ¶

The tokenizer produces tokens for all SQL elements:

Keywords: SELECT, FROM, WHERE, JOIN, etc.
Identifiers: table names, column names, aliases
Literals: strings ('text'), numbers (123, 45.67, 1e10)
Operators: =, <>, +, -, *, /, ||, etc.
Punctuation: (, ), [, ], ,, ;, .
PostgreSQL JSON operators: ->, ->>, #>, #>>, @>, <@, ?, ?|, ?&, #-
Comments: -- line comments and /* block comments */

PostgreSQL Extensions (v1.6.0) ¶

The tokenizer supports PostgreSQL-specific operators:

JSON/JSONB operators: -> ->> #> #>> @> <@ ? ?| ?& #-
Array operators: && (overlap)
Text search: @@ (full text search)
Cast operator: :: (double colon)
Parameters: @variable (SQL Server style)

Unicode Support ¶

Full Unicode support for international SQL processing:

UTF-8 decoding with proper rune handling
Unicode quotes: " " ' ' « » (normalized to ASCII)
Unicode identifiers: letters, digits, combining marks
Multi-byte character support in strings and identifiers
Proper line/column tracking across Unicode boundaries

DoS Protection ¶

Built-in protection against denial-of-service attacks:

MaxInputSize: 10MB input limit (configurable)
MaxTokens: 1M token limit per query (configurable)
Context support: TokenizeContext() for cancellation
Panic recovery: Structured errors on unexpected panics

Object Pooling ¶

Use GetTokenizer/PutTokenizer for optimal performance:

tkz := tokenizer.GetTokenizer()
defer tokenizer.PutTokenizer(tkz)  // MANDATORY - returns to pool

tokens, err := tkz.Tokenize([]byte(sql))
if err != nil {
    return err
}
// Use tokens...

Benefits:

60-80% reduction in allocations
95%+ pool hit rate in production
Automatic state reset on return to pool

Basic Usage ¶

Simple tokenization without pooling:

tkz, err := tokenizer.New()
if err != nil {
    return err
}

sql := "SELECT id, name FROM users WHERE active = true"
tokens, err := tkz.Tokenize([]byte(sql))
if err != nil {
    return err
}

for _, tok := range tokens {
    fmt.Printf("Token: %s (type: %v) at line %d, col %d\n",
        tok.Token.Value, tok.Token.Type, tok.Start.Line, tok.Start.Column)
}

Advanced Usage with Context ¶

Tokenization with timeout and cancellation:

ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()

tkz := tokenizer.GetTokenizer()
defer tokenizer.PutTokenizer(tkz)

tokens, err := tkz.TokenizeContext(ctx, []byte(sql))
if err == context.DeadlineExceeded {
    log.Printf("Tokenization timed out")
    return err
}

The context is checked every 100 tokens for responsive cancellation.

Error Handling ¶

The tokenizer returns structured errors with position information:

tokens, err := tkz.Tokenize([]byte(sql))
if err != nil {
    // Errors include line/column information
    // Common errors: UnterminatedStringError, UnexpectedCharError,
    // InvalidNumberError, InputTooLargeError, TokenLimitReachedError
    log.Printf("Tokenization error: %v", err)
    return err
}

Position Tracking ¶

Every token includes precise start/end positions:

for _, tokWithSpan := range tokens {
    fmt.Printf("Token '%s' at line %d, column %d-%d\n",
        tokWithSpan.Token.Value,
        tokWithSpan.Start.Line,
        tokWithSpan.Start.Column,
        tokWithSpan.End.Column)
}

Position information is 1-based (first line is 1, first column is 1).

String Literals ¶

The tokenizer handles various string literal formats:

Single quotes: 'text', 'can”t' (doubled quotes for escaping)
Double quotes: "identifier" (SQL identifiers, not strings)
Backticks: `identifier` (MySQL-style identifiers)
Triple quotes: ”'multiline”' """multiline"""
Unicode quotes: 'text' "text" «text» (normalized)
Escape sequences: \n \r \t \\ \' \" \uXXXX

Number Formats ¶

Supported number formats:

Integers: 123, 0, 999999
Decimals: 3.14, 0.5, 999.999
Scientific: 1e10, 2.5e-3, 1.23E+4

Comments ¶

Comments are automatically skipped during tokenization:

Line comments: -- comment text (until newline)
Block comments: /* comment text */ (can span multiple lines)

Identifiers ¶

Identifiers follow SQL standards with extensions:

Unquoted: letters, digits, underscore (cannot start with digit)
Quoted: "Any Text" (allows spaces, special chars, keywords)
Backticked: `Any Text` (MySQL compatibility)
Unicode: Full Unicode letter and digit support
Compound keywords: GROUP BY, ORDER BY, LEFT JOIN, etc.

Keyword Recognition ¶

Keywords are recognized case-insensitively and mapped to specific token types:

DML: SELECT, INSERT, UPDATE, DELETE, MERGE
DDL: CREATE, ALTER, DROP, TRUNCATE
Joins: JOIN, LEFT JOIN, INNER JOIN, CROSS JOIN, etc.
CTEs: WITH, RECURSIVE, UNION, EXCEPT, INTERSECT
Grouping: GROUP BY, ROLLUP, CUBE, GROUPING SETS
Window: OVER, PARTITION BY, ROWS, RANGE, etc.
PostgreSQL: DISTINCT ON, FILTER, RETURNING, LATERAL

Memory Management ¶

The tokenizer uses several strategies for memory efficiency:

Tokenizer pooling: Reuse instances with sync.Pool
Buffer pooling: Reuse byte buffers for string operations
Zero-copy: Operate on input slices without allocation
Slice reuse: Preserve capacity when resetting state
Metrics tracking: Monitor pool efficiency and memory usage

Integration with Parser ¶

Typical integration pattern with the parser:

// Get tokenizer from pool
tkz := tokenizer.GetTokenizer()
defer tokenizer.PutTokenizer(tkz)

// Tokenize SQL
tokens, err := tkz.Tokenize([]byte(sql))
if err != nil {
    return nil, err
}

// Parse tokens to AST
ast, err := parser.Parse(tokens)
if err != nil {
    return nil, err
}

Production Deployment ¶

Best practices for production use:

Always use GetTokenizer/PutTokenizer for pooling efficiency
Use defer to ensure PutTokenizer is called even on errors
Monitor metrics for pool hit rates and performance
Configure DoS limits (MaxInputSize, MaxTokens) for your use case
Use TokenizeContext for long-running operations
Test with your actual SQL workload for realistic validation

Metrics and Monitoring ¶

The tokenizer integrates with pkg/metrics for observability:

Tokenization duration and throughput
Pool get/put operations and hit rates
Error counts by category
Input size and token count distributions

Access metrics via the metrics package for production monitoring.

Validation Status ¶

Production-ready with comprehensive validation:

Race detection: Zero race conditions (20,000+ concurrent operations tested)
Performance: 8M+ tokens/sec sustained throughput
Unicode: Full international support (8 languages validated)
Reliability: 95%+ success rate on real-world SQL queries
Memory: Zero leaks detected under extended load testing

Examples ¶

See the tokenizer_test.go file for comprehensive examples including:

Basic tokenization
Unicode handling
PostgreSQL operators
Error cases
Performance benchmarks
Pool usage patterns

Index ¶

Constants
func PutTokenizer(t *Tokenizer)
type BufferPool
- func NewBufferPool() *BufferPool
type Error
- func (e *Error) Error() string
type Position
- func NewPosition(line, index int) Position
type StringLiteralReader
- func NewStringLiteralReader(input []byte, pos *Position, quote rune) *StringLiteralReader
- func (r *StringLiteralReader) ReadStringLiteral() (models.Token, error)
type Tokenizer

Constants ¶

View Source

const (
	// MaxInputSize is the maximum allowed input size in bytes (10MB default).
	//
	// This limit prevents denial-of-service (DoS) attacks via extremely large
	// SQL queries that could exhaust server memory. Queries exceeding this size
	// will return an InputTooLargeError.
	//
	// Rationale:
	//   - 10MB is sufficient for complex SQL queries with large IN clauses
	//   - Protects against malicious or accidental memory exhaustion
	//   - Can be increased if needed for legitimate large queries
	//
	// If your application requires larger queries, consider:
	//   - Breaking queries into smaller batches
	//   - Using prepared statements with parameter binding
	//   - Increasing the limit (but ensure adequate memory protection)
	MaxInputSize = 10 * 1024 * 1024 // 10MB

	// MaxTokens is the maximum number of tokens allowed in a single SQL query
	// (1M tokens default).
	//
	// This limit prevents denial-of-service (DoS) attacks via "token explosion"
	// where maliciously crafted or accidentally generated SQL creates an excessive
	// number of tokens, exhausting CPU and memory.
	//
	// Rationale:
	//   - 1M tokens is far beyond any reasonable SQL query size
	//   - Typical queries have 10-1000 tokens
	//   - Complex queries rarely exceed 10,000 tokens
	//   - Protects against pathological cases and attacks
	//
	// Example token counts:
	//   - Simple SELECT: ~10-50 tokens
	//   - Complex query with joins: ~100-500 tokens
	//   - Large IN clause with 1000 values: ~3000-4000 tokens
	//
	// If this limit is hit on a legitimate query, the query should likely
	// be redesigned for better performance and maintainability.
	MaxTokens = 1000000 // 1M tokens
)

Variables ¶

This section is empty.

Functions ¶

func PutTokenizer ¶

func PutTokenizer(t *Tokenizer)

PutTokenizer returns a Tokenizer instance to the pool for reuse.

This must be called after you're done with a Tokenizer obtained from GetTokenizer() to enable instance reuse and prevent memory leaks.

The tokenizer is automatically reset before being returned to the pool, clearing all state including input references, positions, and debug loggers.

Thread Safety: Safe for concurrent calls from multiple goroutines.

Best Practice: Always use with defer immediately after GetTokenizer():

tkz := tokenizer.GetTokenizer()
defer tokenizer.PutTokenizer(tkz)  // MANDATORY

Nil Safety: Safely ignores nil tokenizers (no-op).

Metrics: Records pool put operations for monitoring pool efficiency.

State Reset:

Input reference cleared (enables GC of SQL bytes)
Position tracking reset to initial state
Line tracking cleared but capacity preserved
Debug logger cleared
Keywords preserved (immutable configuration)

Types ¶

type BufferPool ¶

type BufferPool struct {
	// contains filtered or unexported fields
}

BufferPool manages a pool of reusable byte buffers for token content.

This pool is used for temporary byte slice operations during tokenization, such as accumulating identifier characters or building string literal content. It complements the bytes.Buffer pool used elsewhere in the tokenizer.

The pool is designed for byte slices rather than bytes.Buffer for cases where direct slice manipulation is more efficient than buffer operations.

Thread Safety: Safe for concurrent use across multiple goroutines.

Initial Capacity: Buffers are pre-allocated with 128 bytes capacity, suitable for most SQL tokens (identifiers, keywords, short string literals).

func NewBufferPool ¶

func NewBufferPool() *BufferPool

NewBufferPool creates a new buffer pool with optimized initial capacity.

The pool pre-allocates byte slices with 128-byte capacity, which is sufficient for most SQL tokens without excessive memory waste.

Returns a BufferPool ready for use with Get/Put operations.

Example:

pool := NewBufferPool()
buf := pool.Get()
defer pool.Put(buf)
// Use buf for byte operations...

func (*BufferPool) Get ¶

func (p *BufferPool) Get() []byte

Get retrieves a buffer from the pool.

The returned buffer has zero length but may have capacity >= 128 bytes from previous use. This allows efficient appending without reallocation for typical SQL tokens.

Thread Safety: Safe for concurrent calls.

The buffer must be returned to the pool via Put() when done to enable reuse.

Returns a byte slice ready for use (length 0, capacity >= 128).

func (*BufferPool) Grow ¶

func (p *BufferPool) Grow(buf []byte, n int) []byte

Grow ensures the buffer has enough capacity for n additional bytes.

If the buffer doesn't have enough spare capacity, a new larger buffer is allocated with doubled capacity plus n bytes. The old buffer is returned to the pool.

Growth Strategy: New capacity = 2 * old capacity + n This exponential growth with a minimum increment minimizes reallocations while preventing excessive memory waste.

Parameters:

buf: The current buffer
n: Number of additional bytes needed

Returns:

The original buffer if it has sufficient capacity
A new, larger buffer with contents copied if reallocation was needed

Example:

buf := pool.Get()
buf = pool.Grow(buf, 256)  // Ensure 256 bytes available
buf = append(buf, data...)  // Append without reallocation

func (*BufferPool) Put ¶

func (p *BufferPool) Put(buf []byte)

Put returns a buffer to the pool for reuse.

The buffer's capacity is preserved, allowing it to be reused for similarly-sized operations without reallocation. Buffers with zero capacity are discarded.

Thread Safety: Safe for concurrent calls.

It's safe to call Put multiple times with the same buffer, though only the first call will be effective (subsequent calls operate on a reset buffer).

Parameters:

buf: The byte slice to return to the pool

type Error ¶

type Error struct {
	Message  string          // Human-readable error message
	Location models.Location // Position where the error occurred (1-based)
}

Error represents a tokenization error with precise location information.

This type provides structured error reporting with line and column positions, making it easy for users to identify and fix SQL syntax issues.

Note: Modern code should use the errors from pkg/errors package instead, which provide more comprehensive error categorization and context. This type is maintained for backward compatibility.

Example:

if err != nil {
    if tokErr, ok := err.(*tokenizer.Error); ok {
        fmt.Printf("Tokenization failed at line %d, column %d: %s\n",
            tokErr.Location.Line, tokErr.Location.Column, tokErr.Message)
    }
}

func ErrorInvalidIdentifier ¶

func ErrorInvalidIdentifier(value string, location models.Location) *Error

ErrorInvalidIdentifier creates an error for an invalid identifier.

This is used when an identifier has invalid syntax, such as:

Starting with a digit (when not quoted)
Containing invalid characters
Unterminated quoted identifier

Parameters:

value: The invalid identifier string
location: Position where the identifier started

Returns an Error describing the invalid identifier.

Example: "invalid identifier: 123abc at line 2, column 8"

func ErrorInvalidNumber ¶

func ErrorInvalidNumber(value string, location models.Location) *Error

ErrorInvalidNumber creates an error for an invalid number format.

This is used when a number token has invalid syntax, such as:

Decimal point without digits: "123."
Exponent without digits: "123e"
Multiple decimal points: "12.34.56"

Parameters:

value: The invalid number string
location: Position where the number started

Returns an Error describing the invalid number format.

Example: "invalid number format: 123.e at line 1, column 10"

func ErrorInvalidOperator ¶

func ErrorInvalidOperator(value string, location models.Location) *Error

ErrorInvalidOperator creates an error for an invalid operator.

This is used when an operator token has invalid syntax, such as:

Incomplete multi-character operators
Invalid operator combinations

Parameters:

value: The invalid operator string
location: Position where the operator started

Returns an Error describing the invalid operator.

Example: "invalid operator: <=> at line 1, column 20"

func ErrorUnexpectedChar ¶

func ErrorUnexpectedChar(ch byte, location models.Location) *Error

ErrorUnexpectedChar creates an error for an unexpected character.

This is used when the tokenizer encounters a character that cannot start any valid token in the current context.

Parameters:

ch: The unexpected character (byte)
location: Position where the character was found

Returns an Error describing the unexpected character.

Example: "unexpected character: @ at line 2, column 5"

func ErrorUnterminatedString ¶

func ErrorUnterminatedString(location models.Location) *Error

ErrorUnterminatedString creates an error for an unterminated string literal.

This occurs when a string literal (single or double quoted) is not properly closed before the end of the line or input.

Parameters:

location: Position where the string started

Returns an Error indicating the string was not terminated.

Example: "unterminated string literal at line 3, column 15"

func NewError ¶

func NewError(message string, location models.Location) *Error

NewError creates a new tokenization error with a message and location.

Parameters:

message: Human-readable description of the error
location: Position in the input where the error occurred

Returns a pointer to an Error with the specified message and location.

func (*Error) Error ¶

func (e *Error) Error() string

Error implements the error interface, returning a formatted error message with location information.

Format: "<message> at line <line>, column <column>"

Example output: "unterminated string literal at line 5, column 23"

type Position ¶

type Position struct {
	Line   int // Current line number (1-based)
	Index  int // Current byte offset into input (0-based)
	Column int // Current column number (1-based)
	LastNL int // Byte offset of last newline (for efficient column calculation)
}

Position tracks the scanning cursor position during tokenization. It maintains both absolute byte offset and human-readable line/column coordinates for precise error reporting and token span tracking.

Coordinate System:

Line: 1-based (first line is line 1)
Column: 1-based (first column is column 1)
Index: 0-based byte offset into input (first byte is index 0)
LastNL: Byte offset of most recent newline (for column calculation)

Zero-Copy Design: Position operates on byte indices rather than rune indices for performance. UTF-8 decoding happens only when needed during character scanning.

Thread Safety: Position is not thread-safe. Each Tokenizer instance should have its own Position that is not shared across goroutines.

func NewPosition ¶

func NewPosition(line, index int) Position

NewPosition creates a new Position with the specified line and byte index. The column is initialized to 1 (first column).

Parameters:

line: Line number (1-based, typically starts at 1)
index: Byte offset into input (0-based, typically starts at 0)

Returns a Position ready for use in tokenization.

func (*Position) AdvanceN ¶

func (p *Position) AdvanceN(n int, lineStarts []int)

AdvanceN moves the position forward by n bytes and recalculates the line and column numbers using the provided line start indices.

This is used when jumping forward in the input (e.g., after skipping a comment block) where individual rune tracking would be inefficient.

Parameters:

n: Number of bytes to advance
lineStarts: Slice of byte offsets where each line starts (from tokenizer)

Performance: O(L) where L is the number of lines in lineStarts. For typical SQL queries with few lines, this is effectively O(1).

If n <= 0, this is a no-op.

func (*Position) AdvanceRune ¶

func (p *Position) AdvanceRune(r rune, size int)

AdvanceRune moves the position forward by one UTF-8 rune. This updates the byte index, line number, and column number appropriately.

Newline Handling: When r is '\n', the line number increments and the column resets to 1.

Parameters:

r: The rune being consumed (used to detect newlines)
size: The byte size of the rune in UTF-8 encoding

Performance: O(1) operation, no string allocations.

Example:

r, size := utf8.DecodeRune(input[pos.Index:])
pos.AdvanceRune(r, size)  // Move past this rune

func (Position) Clone ¶

func (p Position) Clone() Position

Clone creates a copy of this Position. The returned Position is independent and can be modified without affecting the original.

This is useful when you need to save a position (e.g., for backtracking during compound keyword parsing) and then potentially restore it.

Returns a new Position with identical values.

func (Position) Location ¶

func (p Position) Location(t *Tokenizer) models.Location

Location converts this Position to a models.Location using the tokenizer's line tracking information for accurate column calculation.

This method uses the tokenizer's lineStarts slice to calculate the exact column position, accounting for variable-width UTF-8 characters and tabs.

Returns a models.Location with 1-based line and column numbers.

type StringLiteralReader ¶

type StringLiteralReader struct {
	// contains filtered or unexported fields
}

StringLiteralReader handles reading of string literals with proper escape sequence handling

func NewStringLiteralReader ¶

func NewStringLiteralReader(input []byte, pos *Position, quote rune) *StringLiteralReader

NewStringLiteralReader creates a new StringLiteralReader

func (*StringLiteralReader) ReadStringLiteral ¶

func (r *StringLiteralReader) ReadStringLiteral() (models.Token, error)

ReadStringLiteral reads a string literal with proper escape sequence handling

type Tokenizer ¶

type Tokenizer struct {
	Comments []models.Comment // Comments captured during tokenization
	// contains filtered or unexported fields
}

Tokenizer provides high-performance SQL tokenization with zero-copy operations. It converts raw SQL bytes into a stream of tokens with precise position tracking.

Features:

Zero-copy operations on input byte slices (no string allocations)
Precise line/column tracking for error reporting (1-based indexing)
Unicode support for international SQL queries
PostgreSQL operator support (JSON, array, text search operators)
DoS protection with input size and token count limits

Thread Safety:

Individual instances are NOT safe for concurrent use
Use GetTokenizer/PutTokenizer for safe pooling across goroutines
Each goroutine should use its own Tokenizer instance

Memory Management:

Reuses internal buffers to minimize allocations
Preserves slice capacity across Reset() calls
Integrates with sync.Pool for instance reuse

Usage:

// With pooling (recommended for production)
tkz := GetTokenizer()
defer PutTokenizer(tkz)
tokens, err := tkz.Tokenize([]byte(sql))

// Without pooling (simple usage)
tkz, _ := New()
tokens, err := tkz.Tokenize([]byte(sql))

func GetTokenizer ¶

func GetTokenizer() *Tokenizer

GetTokenizer retrieves a Tokenizer instance from the pool.

This is the recommended way to obtain a Tokenizer for production use. The returned tokenizer is reset and ready for use.

Thread Safety: Safe for concurrent calls from multiple goroutines. Each call returns a separate instance.

Memory Management: Always pair with PutTokenizer() using defer to ensure the instance is returned to the pool, even if errors occur.

Metrics: Records pool get operations for monitoring pool efficiency.

Example:

tkz := tokenizer.GetTokenizer()
defer tokenizer.PutTokenizer(tkz)  // MANDATORY - ensures pool return

tokens, err := tkz.Tokenize([]byte(sql))
if err != nil {
    return err  // defer ensures PutTokenizer is called
}
// Process tokens...

Performance: 95%+ hit rate means most calls reuse existing instances rather than allocating new ones, providing significant performance benefits.

func New ¶

func New() (*Tokenizer, error)

New creates a new Tokenizer with default configuration and keyword support. The returned tokenizer is ready to use for tokenizing SQL statements.

For production use, prefer GetTokenizer() which uses object pooling for better performance and reduced allocations.

Returns an error only if keyword initialization fails (extremely rare).

Example:

tkz, err := tokenizer.New()
if err != nil {
    return err
}
tokens, err := tkz.Tokenize([]byte("SELECT * FROM users"))

func NewWithDialect ¶ added in v1.9.3

func NewWithDialect(dialect keywords.SQLDialect) (*Tokenizer, error)

NewWithDialect creates a new Tokenizer configured for the given SQL dialect. Dialect-specific keywords are recognized based on the dialect parameter. If dialect is empty or unknown, defaults to DialectPostgreSQL.

func NewWithKeywords ¶

func NewWithKeywords(kw *keywords.Keywords) (*Tokenizer, error)

NewWithKeywords initializes a Tokenizer with a custom keyword classifier. This allows you to customize keyword recognition for specific SQL dialects or to add custom keywords.

The keywords parameter must not be nil.

Returns an error if keywords is nil.

Example:

kw := keywords.NewKeywords()
// Customize keywords as needed...
tkz, err := tokenizer.NewWithKeywords(kw)
if err != nil {
    return err
}

func (*Tokenizer) Dialect ¶ added in v1.9.3

func (t *Tokenizer) Dialect() keywords.SQLDialect

Dialect returns the SQL dialect configured for this tokenizer.

func (*Tokenizer) Reset ¶

func (t *Tokenizer) Reset()

Reset clears a Tokenizer's state for reuse while preserving allocated memory.

This method is called automatically by PutTokenizer() and generally should not be called directly by users. It's exposed for advanced use cases where you want to reuse a tokenizer instance without going through the pool.

Memory Optimization:

Clears input reference (allows GC of SQL bytes)
Resets position tracking to initial values
Preserves lineStarts slice capacity (avoids reallocation)
Clears debug logger reference

State After Reset:

pos: Line 1, Column 0, Index 0
lineStarts: Empty slice with preserved capacity (contains [0])
input: nil (ready for new input)
keywords: Preserved (immutable, no need to reset)
logger: nil (must be set again if needed)

Performance: By preserving slice capacity, subsequent Tokenize() calls avoid reallocation of lineStarts for similarly-sized inputs.

func (*Tokenizer) SetDialect ¶ added in v1.9.3

func (t *Tokenizer) SetDialect(dialect keywords.SQLDialect)

SetDialect reconfigures the tokenizer for a different SQL dialect. This rebuilds the keyword set to include dialect-specific keywords.

func (*Tokenizer) SetLogger ¶ added in v1.9.3

func (t *Tokenizer) SetLogger(logger *slog.Logger)

SetLogger configures a structured logger for verbose tracing during tokenization. The logger receives slog.Debug messages for each token produced, which is useful for diagnosing tokenization issues or understanding token stream structure.

Pass nil to disable debug logging (the default).

Logging is guarded by slog.LevelDebug checks so there is no performance cost when the handler's minimum level is above Debug.

Example:

tkz := tokenizer.GetTokenizer()
tkz.SetLogger(slog.Default())
tokens, _ := tkz.Tokenize([]byte(sql))

To disable:

tkz.SetLogger(nil)

Thread Safety: The logger may be called from multiple goroutines if tokenizers are used concurrently. *slog.Logger is safe for concurrent use.

func (*Tokenizer) Tokenize ¶

func (t *Tokenizer) Tokenize(input []byte) ([]models.TokenWithSpan, error)

Tokenize converts raw SQL bytes into a slice of tokens with position information.

This is the main entry point for tokenization. It performs zero-copy tokenization directly on the input byte slice and returns tokens with precise start/end positions.

Performance: 8M+ tokens/sec sustained throughput with zero-copy operations.

DoS Protection:

Input size limited to MaxInputSize (10MB default)
Token count limited to MaxTokens (1M default)
Returns errors if limits exceeded

Position Tracking:

All positions are 1-based (first line is 1, first column is 1)
Start position is inclusive, end position is exclusive
Position information preserved for all tokens including EOF

Error Handling:

Returns structured errors with precise position information
Common errors: UnterminatedStringError, UnexpectedCharError, InvalidNumberError
Errors include line/column location and context

Parameters:

input: Raw SQL bytes to tokenize (not modified, zero-copy reference)

Returns:

[]models.TokenWithSpan: Slice of tokens with position spans (includes EOF token)
error: Tokenization error with position information, or nil on success

Example:

tkz := GetTokenizer()
defer PutTokenizer(tkz)

sql := "SELECT id, name FROM users WHERE active = true"
tokens, err := tkz.Tokenize([]byte(sql))
if err != nil {
    log.Printf("Tokenization error at line %d: %v",
        err.(errors.TokenizerError).Location.Line, err)
    return err
}

for _, tok := range tokens {
    fmt.Printf("Token: %s (type: %v) at %d:%d\n",
        tok.Token.Value, tok.Token.Type,
        tok.Start.Line, tok.Start.Column)
}

PostgreSQL Operators (v1.6.0):

sql := "SELECT data->'field' FROM table WHERE config @> '{\"key\":\"value\"}'"
tokens, _ := tkz.Tokenize([]byte(sql))
// Produces tokens for: -> (JSON field access), @> (JSONB contains)

Unicode Support:

sql := "SELECT 名前 FROM ユーザー WHERE 'こんにちは'"
tokens, _ := tkz.Tokenize([]byte(sql))
// Correctly tokenizes Unicode identifiers and string literals

func (*Tokenizer) TokenizeContext ¶ added in v1.5.0

func (t *Tokenizer) TokenizeContext(ctx context.Context, input []byte) ([]models.TokenWithSpan, error)

TokenizeContext processes the input and returns tokens with context support for cancellation. It checks the context at regular intervals (every 100 tokens) to enable fast cancellation. Returns context.Canceled or context.DeadlineExceeded when the context is cancelled.

This method is useful for:

Long-running tokenization operations that need to be cancellable
Implementing timeouts for tokenization
Graceful shutdown scenarios

Example:

ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()
tokens, err := tokenizer.TokenizeContext(ctx, []byte(sql))
if err == context.DeadlineExceeded {
    // Handle timeout
}

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL

README ¶

SQL Tokenizer Package

Overview

Key Features

Performance

Usage

Basic Tokenization

Batch Processing

Concurrent Tokenization

Token Types

Keywords

Identifiers

Literals

Operators

Dialect-Specific Features

PostgreSQL

MySQL

SQL Server

Architecture

Core Files

Tokenization Pipeline

Error Handling

Detailed Error Information

Common Error Types

DOS Protection

Token Limit

Input Size Validation

Unicode Support

Supported Scripts

Example

Testing

Best Practices

1. Always Use Object Pool

2. Reset Between Uses

3. Use Byte Slices

Common Pitfalls

❌ Forgetting to Return to Pool

✅ Correct Pattern

❌ Reusing Without Reset

✅ Correct Pattern

Performance Tips

1. Minimize Allocations

2. Batch Processing

3. Concurrent Processing

Related Packages

Documentation

Version History

Documentation ¶

Overview ¶

Overview ¶

Architecture ¶

Performance Characteristics ¶

Thread Safety ¶

Token Types ¶

PostgreSQL Extensions (v1.6.0) ¶

Unicode Support ¶

DoS Protection ¶

Object Pooling ¶

Basic Usage ¶

Advanced Usage with Context ¶

Error Handling ¶

Position Tracking ¶

String Literals ¶

Number Formats ¶

Comments ¶

Identifiers ¶

Keyword Recognition ¶

Memory Management ¶

Integration with Parser ¶

Production Deployment ¶

Metrics and Monitoring ¶

Validation Status ¶

Examples ¶

Index ¶

Constants ¶

Variables ¶

Functions ¶

func PutTokenizer ¶

Types ¶

type BufferPool ¶