Documentation
¶
Overview ¶
Package tokenizer provides high-performance SQL tokenization with zero-copy operations and comprehensive Unicode support.
The primary entry points are Tokenize (convert raw SQL bytes to []models.TokenWithSpan), GetTokenizer and PutTokenizer (pool-based instance management for optimal memory efficiency), and TokenizeContext (tokenization with context cancellation support). The tokenizer operates directly on input byte slices without allocating intermediate strings, achieving 8M+ tokens/sec throughput with full UTF-8 support.
Overview ¶
The tokenizer package converts raw SQL text into a stream of tokens (lexical analysis) with precise position tracking for error reporting. It is designed for production use with enterprise-grade performance, thread safety, and memory efficiency.
Architecture ¶
The tokenizer uses a zero-copy design that operates directly on input byte slices without string allocations, achieving 8M+ tokens/sec throughput. It includes:
- Zero-copy byte slice operations for minimal memory allocations
- Object pooling (GetTokenizer/PutTokenizer) for instance reuse
- Buffer pooling for internal string operations
- Position tracking (line/column) for precise error reporting
- Unicode support for international SQL queries
- DoS protection with input size and token count limits
Performance Characteristics ¶
The tokenizer is production-validated with the following characteristics:
- Throughput: 8M+ tokens/sec sustained
- Memory: Zero-copy operations minimize allocations
- Thread Safety: Race-free with sync.Pool for object reuse
- Latency: Sub-microsecond per token on average
- Pool Hit Rate: 95%+ in production workloads
Thread Safety ¶
The tokenizer is thread-safe when using the pooling API:
- GetTokenizer() and PutTokenizer() are safe for concurrent use
- Individual Tokenizer instances are NOT safe for concurrent use
- Always use one Tokenizer instance per goroutine
Token Types ¶
The tokenizer produces tokens for all SQL elements:
- Keywords: SELECT, FROM, WHERE, JOIN, etc.
- Identifiers: table names, column names, aliases
- Literals: strings ('text'), numbers (123, 45.67, 1e10)
- Operators: =, <>, +, -, *, /, ||, etc.
- Punctuation: (, ), [, ], ,, ;, .
- PostgreSQL JSON operators: ->, ->>, #>, #>>, @>, <@, ?, ?|, ?&, #-
- Comments: -- line comments and /* block comments */
PostgreSQL Extensions (v1.6.0) ¶
The tokenizer supports PostgreSQL-specific operators:
- JSON/JSONB operators: -> ->> #> #>> @> <@ ? ?| ?& #-
- Array operators: && (overlap)
- Text search: @@ (full text search)
- Cast operator: :: (double colon)
- Parameters: @variable (SQL Server style)
Unicode Support ¶
Full Unicode support for international SQL processing:
- UTF-8 decoding with proper rune handling
- Unicode quotes: " " ' ' « » (normalized to ASCII)
- Unicode identifiers: letters, digits, combining marks
- Multi-byte character support in strings and identifiers
- Proper line/column tracking across Unicode boundaries
DoS Protection ¶
Built-in protection against denial-of-service attacks:
- MaxInputSize: 10MB input limit (configurable)
- MaxTokens: 1M token limit per query (configurable)
- Context support: TokenizeContext() for cancellation
- Panic recovery: Structured errors on unexpected panics
Object Pooling ¶
Use GetTokenizer/PutTokenizer for optimal performance:
tkz := tokenizer.GetTokenizer()
defer tokenizer.PutTokenizer(tkz) // MANDATORY - returns to pool
tokens, err := tkz.Tokenize([]byte(sql))
if err != nil {
return err
}
// Use tokens...
Benefits:
- 60-80% reduction in allocations
- 95%+ pool hit rate in production
- Automatic state reset on return to pool
Basic Usage ¶
Simple tokenization without pooling:
tkz, err := tokenizer.New()
if err != nil {
return err
}
sql := "SELECT id, name FROM users WHERE active = true"
tokens, err := tkz.Tokenize([]byte(sql))
if err != nil {
return err
}
for _, tok := range tokens {
fmt.Printf("Token: %s (type: %v) at line %d, col %d\n",
tok.Token.Value, tok.Token.Type, tok.Start.Line, tok.Start.Column)
}
Advanced Usage with Context ¶
Tokenization with timeout and cancellation:
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()
tkz := tokenizer.GetTokenizer()
defer tokenizer.PutTokenizer(tkz)
tokens, err := tkz.TokenizeContext(ctx, []byte(sql))
if err == context.DeadlineExceeded {
log.Printf("Tokenization timed out")
return err
}
The context is checked every 100 tokens for responsive cancellation.
Error Handling ¶
The tokenizer returns structured errors with position information:
tokens, err := tkz.Tokenize([]byte(sql))
if err != nil {
// Errors include line/column information
// Common errors: UnterminatedStringError, UnexpectedCharError,
// InvalidNumberError, InputTooLargeError, TokenLimitReachedError
log.Printf("Tokenization error: %v", err)
return err
}
Position Tracking ¶
Every token includes precise start/end positions:
for _, tokWithSpan := range tokens {
fmt.Printf("Token '%s' at line %d, column %d-%d\n",
tokWithSpan.Token.Value,
tokWithSpan.Start.Line,
tokWithSpan.Start.Column,
tokWithSpan.End.Column)
}
Position information is 1-based (first line is 1, first column is 1).
String Literals ¶
The tokenizer handles various string literal formats:
- Single quotes: 'text', 'can”t' (doubled quotes for escaping)
- Double quotes: "identifier" (SQL identifiers, not strings)
- Backticks: `identifier` (MySQL-style identifiers)
- Triple quotes: ”'multiline”' """multiline"""
- Unicode quotes: 'text' "text" «text» (normalized)
- Escape sequences: \n \r \t \\ \' \" \uXXXX
Number Formats ¶
Supported number formats:
- Integers: 123, 0, 999999
- Decimals: 3.14, 0.5, 999.999
- Scientific: 1e10, 2.5e-3, 1.23E+4
Comments ¶
Comments are automatically skipped during tokenization:
- Line comments: -- comment text (until newline)
- Block comments: /* comment text */ (can span multiple lines)
Identifiers ¶
Identifiers follow SQL standards with extensions:
- Unquoted: letters, digits, underscore (cannot start with digit)
- Quoted: "Any Text" (allows spaces, special chars, keywords)
- Backticked: `Any Text` (MySQL compatibility)
- Unicode: Full Unicode letter and digit support
- Compound keywords: GROUP BY, ORDER BY, LEFT JOIN, etc.
Keyword Recognition ¶
Keywords are recognized case-insensitively and mapped to specific token types:
- DML: SELECT, INSERT, UPDATE, DELETE, MERGE
- DDL: CREATE, ALTER, DROP, TRUNCATE
- Joins: JOIN, LEFT JOIN, INNER JOIN, CROSS JOIN, etc.
- CTEs: WITH, RECURSIVE, UNION, EXCEPT, INTERSECT
- Grouping: GROUP BY, ROLLUP, CUBE, GROUPING SETS
- Window: OVER, PARTITION BY, ROWS, RANGE, etc.
- PostgreSQL: DISTINCT ON, FILTER, RETURNING, LATERAL
Memory Management ¶
The tokenizer uses several strategies for memory efficiency:
- Tokenizer pooling: Reuse instances with sync.Pool
- Buffer pooling: Reuse byte buffers for string operations
- Zero-copy: Operate on input slices without allocation
- Slice reuse: Preserve capacity when resetting state
- Metrics tracking: Monitor pool efficiency and memory usage
Integration with Parser ¶
Typical integration pattern with the parser:
// Get tokenizer from pool
tkz := tokenizer.GetTokenizer()
defer tokenizer.PutTokenizer(tkz)
// Tokenize SQL
tokens, err := tkz.Tokenize([]byte(sql))
if err != nil {
return nil, err
}
// Parse tokens to AST
ast, err := parser.Parse(tokens)
if err != nil {
return nil, err
}
Production Deployment ¶
Best practices for production use:
- Always use GetTokenizer/PutTokenizer for pooling efficiency
- Use defer to ensure PutTokenizer is called even on errors
- Monitor metrics for pool hit rates and performance
- Configure DoS limits (MaxInputSize, MaxTokens) for your use case
- Use TokenizeContext for long-running operations
- Test with your actual SQL workload for realistic validation
Metrics and Monitoring ¶
The tokenizer integrates with pkg/metrics for observability:
- Tokenization duration and throughput
- Pool get/put operations and hit rates
- Error counts by category
- Input size and token count distributions
Access metrics via the metrics package for production monitoring.
Validation Status ¶
Production-ready with comprehensive validation:
- Race detection: Zero race conditions (20,000+ concurrent operations tested)
- Performance: 8M+ tokens/sec sustained throughput
- Unicode: Full international support (8 languages validated)
- Reliability: 95%+ success rate on real-world SQL queries
- Memory: Zero leaks detected under extended load testing
Examples ¶
See the tokenizer_test.go file for comprehensive examples including:
- Basic tokenization
- Unicode handling
- PostgreSQL operators
- Error cases
- Performance benchmarks
- Pool usage patterns
Index ¶
- Constants
- func PutTokenizer(t *Tokenizer)
- type BufferPool
- type Error
- func ErrorInvalidIdentifier(value string, location models.Location) *Error
- func ErrorInvalidNumber(value string, location models.Location) *Error
- func ErrorInvalidOperator(value string, location models.Location) *Error
- func ErrorUnexpectedChar(ch byte, location models.Location) *Error
- func ErrorUnterminatedString(location models.Location) *Error
- func NewError(message string, location models.Location) *Error
- type Position
- type StringLiteralReader
- type Tokenizer
- func (t *Tokenizer) Dialect() keywords.SQLDialect
- func (t *Tokenizer) Reset()
- func (t *Tokenizer) SetDialect(dialect keywords.SQLDialect)
- func (t *Tokenizer) SetLogger(logger *slog.Logger)
- func (t *Tokenizer) Tokenize(input []byte) ([]models.TokenWithSpan, error)
- func (t *Tokenizer) TokenizeContext(ctx context.Context, input []byte) ([]models.TokenWithSpan, error)
Constants ¶
const ( // MaxInputSize is the maximum allowed input size in bytes (10MB default). // // This limit prevents denial-of-service (DoS) attacks via extremely large // SQL queries that could exhaust server memory. Queries exceeding this size // will return an InputTooLargeError. // // Rationale: // - 10MB is sufficient for complex SQL queries with large IN clauses // - Protects against malicious or accidental memory exhaustion // - Can be increased if needed for legitimate large queries // // If your application requires larger queries, consider: // - Breaking queries into smaller batches // - Using prepared statements with parameter binding // - Increasing the limit (but ensure adequate memory protection) MaxInputSize = 10 * 1024 * 1024 // 10MB // MaxTokens is the maximum number of tokens allowed in a single SQL query // (1M tokens default). // // This limit prevents denial-of-service (DoS) attacks via "token explosion" // where maliciously crafted or accidentally generated SQL creates an excessive // number of tokens, exhausting CPU and memory. // // Rationale: // - 1M tokens is far beyond any reasonable SQL query size // - Typical queries have 10-1000 tokens // - Complex queries rarely exceed 10,000 tokens // - Protects against pathological cases and attacks // // Example token counts: // - Simple SELECT: ~10-50 tokens // - Complex query with joins: ~100-500 tokens // - Large IN clause with 1000 values: ~3000-4000 tokens // // If this limit is hit on a legitimate query, the query should likely // be redesigned for better performance and maintainability. MaxTokens = 1000000 // 1M tokens )
Variables ¶
This section is empty.
Functions ¶
func PutTokenizer ¶
func PutTokenizer(t *Tokenizer)
PutTokenizer returns a Tokenizer instance to the pool for reuse.
This must be called after you're done with a Tokenizer obtained from GetTokenizer() to enable instance reuse and prevent memory leaks.
The tokenizer is automatically reset before being returned to the pool, clearing all state including input references, positions, and debug loggers.
Thread Safety: Safe for concurrent calls from multiple goroutines.
Best Practice: Always use with defer immediately after GetTokenizer():
tkz := tokenizer.GetTokenizer() defer tokenizer.PutTokenizer(tkz) // MANDATORY
Nil Safety: Safely ignores nil tokenizers (no-op).
Metrics: Records pool put operations for monitoring pool efficiency.
State Reset:
- Input reference cleared (enables GC of SQL bytes)
- Position tracking reset to initial state
- Line tracking cleared but capacity preserved
- Debug logger cleared
- Keywords preserved (immutable configuration)
Types ¶
type BufferPool ¶
type BufferPool struct {
// contains filtered or unexported fields
}
BufferPool manages a pool of reusable byte buffers for token content.
This pool is used for temporary byte slice operations during tokenization, such as accumulating identifier characters or building string literal content. It complements the bytes.Buffer pool used elsewhere in the tokenizer.
The pool is designed for byte slices rather than bytes.Buffer for cases where direct slice manipulation is more efficient than buffer operations.
Thread Safety: Safe for concurrent use across multiple goroutines.
Initial Capacity: Buffers are pre-allocated with 128 bytes capacity, suitable for most SQL tokens (identifiers, keywords, short string literals).
func NewBufferPool ¶
func NewBufferPool() *BufferPool
NewBufferPool creates a new buffer pool with optimized initial capacity.
The pool pre-allocates byte slices with 128-byte capacity, which is sufficient for most SQL tokens without excessive memory waste.
Returns a BufferPool ready for use with Get/Put operations.
Example:
pool := NewBufferPool() buf := pool.Get() defer pool.Put(buf) // Use buf for byte operations...
func (*BufferPool) Get ¶
func (p *BufferPool) Get() []byte
Get retrieves a buffer from the pool.
The returned buffer has zero length but may have capacity >= 128 bytes from previous use. This allows efficient appending without reallocation for typical SQL tokens.
Thread Safety: Safe for concurrent calls.
The buffer must be returned to the pool via Put() when done to enable reuse.
Returns a byte slice ready for use (length 0, capacity >= 128).
func (*BufferPool) Grow ¶
func (p *BufferPool) Grow(buf []byte, n int) []byte
Grow ensures the buffer has enough capacity for n additional bytes.
If the buffer doesn't have enough spare capacity, a new larger buffer is allocated with doubled capacity plus n bytes. The old buffer is returned to the pool.
Growth Strategy: New capacity = 2 * old capacity + n This exponential growth with a minimum increment minimizes reallocations while preventing excessive memory waste.
Parameters:
- buf: The current buffer
- n: Number of additional bytes needed
Returns:
- The original buffer if it has sufficient capacity
- A new, larger buffer with contents copied if reallocation was needed
Example:
buf := pool.Get() buf = pool.Grow(buf, 256) // Ensure 256 bytes available buf = append(buf, data...) // Append without reallocation
func (*BufferPool) Put ¶
func (p *BufferPool) Put(buf []byte)
Put returns a buffer to the pool for reuse.
The buffer's capacity is preserved, allowing it to be reused for similarly-sized operations without reallocation. Buffers with zero capacity are discarded.
Thread Safety: Safe for concurrent calls.
It's safe to call Put multiple times with the same buffer, though only the first call will be effective (subsequent calls operate on a reset buffer).
Parameters:
- buf: The byte slice to return to the pool
type Error ¶
type Error struct {
Message string // Human-readable error message
Location models.Location // Position where the error occurred (1-based)
}
Error represents a tokenization error with precise location information.
This type provides structured error reporting with line and column positions, making it easy for users to identify and fix SQL syntax issues.
Note: Modern code should use the errors from pkg/errors package instead, which provide more comprehensive error categorization and context. This type is maintained for backward compatibility.
Example:
if err != nil {
if tokErr, ok := err.(*tokenizer.Error); ok {
fmt.Printf("Tokenization failed at line %d, column %d: %s\n",
tokErr.Location.Line, tokErr.Location.Column, tokErr.Message)
}
}
func ErrorInvalidIdentifier ¶
ErrorInvalidIdentifier creates an error for an invalid identifier.
This is used when an identifier has invalid syntax, such as:
- Starting with a digit (when not quoted)
- Containing invalid characters
- Unterminated quoted identifier
Parameters:
- value: The invalid identifier string
- location: Position where the identifier started
Returns an Error describing the invalid identifier.
Example: "invalid identifier: 123abc at line 2, column 8"
func ErrorInvalidNumber ¶
ErrorInvalidNumber creates an error for an invalid number format.
This is used when a number token has invalid syntax, such as:
- Decimal point without digits: "123."
- Exponent without digits: "123e"
- Multiple decimal points: "12.34.56"
Parameters:
- value: The invalid number string
- location: Position where the number started
Returns an Error describing the invalid number format.
Example: "invalid number format: 123.e at line 1, column 10"
func ErrorInvalidOperator ¶
ErrorInvalidOperator creates an error for an invalid operator.
This is used when an operator token has invalid syntax, such as:
- Incomplete multi-character operators
- Invalid operator combinations
Parameters:
- value: The invalid operator string
- location: Position where the operator started
Returns an Error describing the invalid operator.
Example: "invalid operator: <=> at line 1, column 20"
func ErrorUnexpectedChar ¶
ErrorUnexpectedChar creates an error for an unexpected character.
This is used when the tokenizer encounters a character that cannot start any valid token in the current context.
Parameters:
- ch: The unexpected character (byte)
- location: Position where the character was found
Returns an Error describing the unexpected character.
Example: "unexpected character: @ at line 2, column 5"
func ErrorUnterminatedString ¶
ErrorUnterminatedString creates an error for an unterminated string literal.
This occurs when a string literal (single or double quoted) is not properly closed before the end of the line or input.
Parameters:
- location: Position where the string started
Returns an Error indicating the string was not terminated.
Example: "unterminated string literal at line 3, column 15"
type Position ¶
type Position struct {
Line int // Current line number (1-based)
Index int // Current byte offset into input (0-based)
Column int // Current column number (1-based)
LastNL int // Byte offset of last newline (for efficient column calculation)
}
Position tracks the scanning cursor position during tokenization. It maintains both absolute byte offset and human-readable line/column coordinates for precise error reporting and token span tracking.
Coordinate System:
- Line: 1-based (first line is line 1)
- Column: 1-based (first column is column 1)
- Index: 0-based byte offset into input (first byte is index 0)
- LastNL: Byte offset of most recent newline (for column calculation)
Zero-Copy Design: Position operates on byte indices rather than rune indices for performance. UTF-8 decoding happens only when needed during character scanning.
Thread Safety: Position is not thread-safe. Each Tokenizer instance should have its own Position that is not shared across goroutines.
func NewPosition ¶
NewPosition creates a new Position with the specified line and byte index. The column is initialized to 1 (first column).
Parameters:
- line: Line number (1-based, typically starts at 1)
- index: Byte offset into input (0-based, typically starts at 0)
Returns a Position ready for use in tokenization.
func (*Position) AdvanceN ¶
AdvanceN moves the position forward by n bytes and recalculates the line and column numbers using the provided line start indices.
This is used when jumping forward in the input (e.g., after skipping a comment block) where individual rune tracking would be inefficient.
Parameters:
- n: Number of bytes to advance
- lineStarts: Slice of byte offsets where each line starts (from tokenizer)
Performance: O(L) where L is the number of lines in lineStarts. For typical SQL queries with few lines, this is effectively O(1).
If n <= 0, this is a no-op.
func (*Position) AdvanceRune ¶
AdvanceRune moves the position forward by one UTF-8 rune. This updates the byte index, line number, and column number appropriately.
Newline Handling: When r is '\n', the line number increments and the column resets to 1.
Parameters:
- r: The rune being consumed (used to detect newlines)
- size: The byte size of the rune in UTF-8 encoding
Performance: O(1) operation, no string allocations.
Example:
r, size := utf8.DecodeRune(input[pos.Index:]) pos.AdvanceRune(r, size) // Move past this rune
func (Position) Clone ¶
Clone creates a copy of this Position. The returned Position is independent and can be modified without affecting the original.
This is useful when you need to save a position (e.g., for backtracking during compound keyword parsing) and then potentially restore it.
Returns a new Position with identical values.
func (Position) Location ¶
Location converts this Position to a models.Location using the tokenizer's line tracking information for accurate column calculation.
This method uses the tokenizer's lineStarts slice to calculate the exact column position, accounting for variable-width UTF-8 characters and tabs.
Returns a models.Location with 1-based line and column numbers.
type StringLiteralReader ¶
type StringLiteralReader struct {
// contains filtered or unexported fields
}
StringLiteralReader handles reading of string literals with proper escape sequence handling
func NewStringLiteralReader ¶
func NewStringLiteralReader(input []byte, pos *Position, quote rune) *StringLiteralReader
NewStringLiteralReader creates a new StringLiteralReader
func (*StringLiteralReader) ReadStringLiteral ¶
func (r *StringLiteralReader) ReadStringLiteral() (models.Token, error)
ReadStringLiteral reads a string literal with proper escape sequence handling
type Tokenizer ¶
type Tokenizer struct {
Comments []models.Comment // Comments captured during tokenization
// contains filtered or unexported fields
}
Tokenizer provides high-performance SQL tokenization with zero-copy operations. It converts raw SQL bytes into a stream of tokens with precise position tracking.
Features:
- Zero-copy operations on input byte slices (no string allocations)
- Precise line/column tracking for error reporting (1-based indexing)
- Unicode support for international SQL queries
- PostgreSQL operator support (JSON, array, text search operators)
- DoS protection with input size and token count limits
Thread Safety:
- Individual instances are NOT safe for concurrent use
- Use GetTokenizer/PutTokenizer for safe pooling across goroutines
- Each goroutine should use its own Tokenizer instance
Memory Management:
- Reuses internal buffers to minimize allocations
- Preserves slice capacity across Reset() calls
- Integrates with sync.Pool for instance reuse
Usage:
// With pooling (recommended for production) tkz := GetTokenizer() defer PutTokenizer(tkz) tokens, err := tkz.Tokenize([]byte(sql)) // Without pooling (simple usage) tkz, _ := New() tokens, err := tkz.Tokenize([]byte(sql))
func GetTokenizer ¶
func GetTokenizer() *Tokenizer
GetTokenizer retrieves a Tokenizer instance from the pool.
This is the recommended way to obtain a Tokenizer for production use. The returned tokenizer is reset and ready for use.
Thread Safety: Safe for concurrent calls from multiple goroutines. Each call returns a separate instance.
Memory Management: Always pair with PutTokenizer() using defer to ensure the instance is returned to the pool, even if errors occur.
Metrics: Records pool get operations for monitoring pool efficiency.
Example:
tkz := tokenizer.GetTokenizer()
defer tokenizer.PutTokenizer(tkz) // MANDATORY - ensures pool return
tokens, err := tkz.Tokenize([]byte(sql))
if err != nil {
return err // defer ensures PutTokenizer is called
}
// Process tokens...
Performance: 95%+ hit rate means most calls reuse existing instances rather than allocating new ones, providing significant performance benefits.
func New ¶
New creates a new Tokenizer with default configuration and keyword support. The returned tokenizer is ready to use for tokenizing SQL statements.
For production use, prefer GetTokenizer() which uses object pooling for better performance and reduced allocations.
Returns an error only if keyword initialization fails (extremely rare).
Example:
tkz, err := tokenizer.New()
if err != nil {
return err
}
tokens, err := tkz.Tokenize([]byte("SELECT * FROM users"))
func NewWithDialect ¶ added in v1.9.3
func NewWithDialect(dialect keywords.SQLDialect) (*Tokenizer, error)
NewWithDialect creates a new Tokenizer configured for the given SQL dialect. Dialect-specific keywords are recognized based on the dialect parameter. If dialect is empty or unknown, defaults to DialectPostgreSQL.
func NewWithKeywords ¶
NewWithKeywords initializes a Tokenizer with a custom keyword classifier. This allows you to customize keyword recognition for specific SQL dialects or to add custom keywords.
The keywords parameter must not be nil.
Returns an error if keywords is nil.
Example:
kw := keywords.NewKeywords()
// Customize keywords as needed...
tkz, err := tokenizer.NewWithKeywords(kw)
if err != nil {
return err
}
func (*Tokenizer) Dialect ¶ added in v1.9.3
func (t *Tokenizer) Dialect() keywords.SQLDialect
Dialect returns the SQL dialect configured for this tokenizer.
func (*Tokenizer) Reset ¶
func (t *Tokenizer) Reset()
Reset clears a Tokenizer's state for reuse while preserving allocated memory.
This method is called automatically by PutTokenizer() and generally should not be called directly by users. It's exposed for advanced use cases where you want to reuse a tokenizer instance without going through the pool.
Memory Optimization:
- Clears input reference (allows GC of SQL bytes)
- Resets position tracking to initial values
- Preserves lineStarts slice capacity (avoids reallocation)
- Clears debug logger reference
State After Reset:
- pos: Line 1, Column 0, Index 0
- lineStarts: Empty slice with preserved capacity (contains [0])
- input: nil (ready for new input)
- keywords: Preserved (immutable, no need to reset)
- logger: nil (must be set again if needed)
Performance: By preserving slice capacity, subsequent Tokenize() calls avoid reallocation of lineStarts for similarly-sized inputs.
func (*Tokenizer) SetDialect ¶ added in v1.9.3
func (t *Tokenizer) SetDialect(dialect keywords.SQLDialect)
SetDialect reconfigures the tokenizer for a different SQL dialect. This rebuilds the keyword set to include dialect-specific keywords.
func (*Tokenizer) SetLogger ¶ added in v1.9.3
SetLogger configures a structured logger for verbose tracing during tokenization. The logger receives slog.Debug messages for each token produced, which is useful for diagnosing tokenization issues or understanding token stream structure.
Pass nil to disable debug logging (the default).
Logging is guarded by slog.LevelDebug checks so there is no performance cost when the handler's minimum level is above Debug.
Example:
tkz := tokenizer.GetTokenizer() tkz.SetLogger(slog.Default()) tokens, _ := tkz.Tokenize([]byte(sql))
To disable:
tkz.SetLogger(nil)
Thread Safety: The logger may be called from multiple goroutines if tokenizers are used concurrently. *slog.Logger is safe for concurrent use.
func (*Tokenizer) Tokenize ¶
func (t *Tokenizer) Tokenize(input []byte) ([]models.TokenWithSpan, error)
Tokenize converts raw SQL bytes into a slice of tokens with position information.
This is the main entry point for tokenization. It performs zero-copy tokenization directly on the input byte slice and returns tokens with precise start/end positions.
Performance: 8M+ tokens/sec sustained throughput with zero-copy operations.
DoS Protection:
- Input size limited to MaxInputSize (10MB default)
- Token count limited to MaxTokens (1M default)
- Returns errors if limits exceeded
Position Tracking:
- All positions are 1-based (first line is 1, first column is 1)
- Start position is inclusive, end position is exclusive
- Position information preserved for all tokens including EOF
Error Handling:
- Returns structured errors with precise position information
- Common errors: UnterminatedStringError, UnexpectedCharError, InvalidNumberError
- Errors include line/column location and context
Parameters:
- input: Raw SQL bytes to tokenize (not modified, zero-copy reference)
Returns:
- []models.TokenWithSpan: Slice of tokens with position spans (includes EOF token)
- error: Tokenization error with position information, or nil on success
Example:
tkz := GetTokenizer()
defer PutTokenizer(tkz)
sql := "SELECT id, name FROM users WHERE active = true"
tokens, err := tkz.Tokenize([]byte(sql))
if err != nil {
log.Printf("Tokenization error at line %d: %v",
err.(errors.TokenizerError).Location.Line, err)
return err
}
for _, tok := range tokens {
fmt.Printf("Token: %s (type: %v) at %d:%d\n",
tok.Token.Value, tok.Token.Type,
tok.Start.Line, tok.Start.Column)
}
PostgreSQL Operators (v1.6.0):
sql := "SELECT data->'field' FROM table WHERE config @> '{\"key\":\"value\"}'"
tokens, _ := tkz.Tokenize([]byte(sql))
// Produces tokens for: -> (JSON field access), @> (JSONB contains)
Unicode Support:
sql := "SELECT 名前 FROM ユーザー WHERE 'こんにちは'" tokens, _ := tkz.Tokenize([]byte(sql)) // Correctly tokenizes Unicode identifiers and string literals
func (*Tokenizer) TokenizeContext ¶ added in v1.5.0
func (t *Tokenizer) TokenizeContext(ctx context.Context, input []byte) ([]models.TokenWithSpan, error)
TokenizeContext processes the input and returns tokens with context support for cancellation. It checks the context at regular intervals (every 100 tokens) to enable fast cancellation. Returns context.Canceled or context.DeadlineExceeded when the context is cancelled.
This method is useful for:
- Long-running tokenization operations that need to be cancellable
- Implementing timeouts for tokenization
- Graceful shutdown scenarios
Example:
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()
tokens, err := tokenizer.TokenizeContext(ctx, []byte(sql))
if err == context.DeadlineExceeded {
// Handle timeout
}