Documentation
¶
Overview ¶
Package tokenizer provides a pure-Go DeepSeek V4 BPE tokenizer for exact token counting. It is counting-only — never used on the wire or for cache-fingerprint computation.
The embedded vocabulary is the official DeepSeek V4 tokenizer.json, gzipped. It is decompressed and parsed lazily on first use via sync.Once.
Index ¶
- Constants
- func Available() bool
- func Count(text string) int
- func CountBounded(text string, maxChars int) int
- func CountExact(text string) int
- func CountMessages(messages []llm.Message) (n int, ok bool)
- func Encode(text string) []int
- func FormatPrompt(messages []llm.Message, dropThinking bool) string
Constants ¶
const DefaultBoundedTokenizeChars = 2 * 1024
DefaultBoundedTokenizeChars is the default max-chars cap for CountBounded.
Variables ¶
This section is empty.
Functions ¶
func Available ¶
func Available() bool
Available reports whether the embedded tokenizer loaded successfully. Never panics. False means callers should fall back to the heuristic estimator.
func CountBounded ¶
CountBounded counts exactly when len(text) <= maxChars, else estimates via head+tail sampling (ratio extrapolation). maxChars<=0 => 0.3*len fallback.
func CountExact ¶ added in v0.3.7
CountExact returns the exact token count of text — byte-identical to len(Encode(text)). Returns 0 for "" or when the tokenizer is unavailable.
func CountMessages ¶
CountMessages returns the EXACT token count of the rendered conversation (no sampling) and ok=true when the tokenizer is available; (0,false) otherwise. Uses CountExact internally — exact at any conversation size.
Types ¶
This section is empty.