tokenizer

package

v0.4.1 Latest Latest Go to latest Published: Jun 8, 2026 License: MIT Imports: 13 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/amemiya02/deepseekcode

Links

Open Source Insights

Documentation ¶

Overview ¶

Package tokenizer provides a pure-Go DeepSeek V4 BPE tokenizer for exact token counting. It is counting-only — never used on the wire or for cache-fingerprint computation.

The embedded vocabulary is the official DeepSeek V4 tokenizer.json, gzipped. It is decompressed and parsed lazily on first use via sync.Once.

Index ¶

Constants
func Available() bool
func Count(text string) int
func CountBounded(text string, maxChars int) int
func CountExact(text string) int
func CountMessages(messages []llm.Message) (n int, ok bool)
func Encode(text string) []int
func FormatPrompt(messages []llm.Message, dropThinking bool) string

Constants ¶

View Source

const DefaultBoundedTokenizeChars = 2 * 1024

DefaultBoundedTokenizeChars is the default max-chars cap for CountBounded.

Variables ¶

This section is empty.

Functions ¶

func Available ¶

func Available() bool

Available reports whether the embedded tokenizer loaded successfully. Never panics. False means callers should fall back to the heuristic estimator.

func Count ¶

func Count(text string) int

Count returns len(Encode(text)). Returns 0 for "".

func CountBounded ¶

func CountBounded(text string, maxChars int) int

CountBounded counts exactly when len(text) <= maxChars, else estimates via head+tail sampling (ratio extrapolation). maxChars<=0 => 0.3*len fallback.

func CountExact ¶ added in v0.3.7

func CountExact(text string) int

CountExact returns the exact token count of text — byte-identical to len(Encode(text)). Returns 0 for "" or when the tokenizer is unavailable.

func CountMessages ¶

func CountMessages(messages []llm.Message) (n int, ok bool)

CountMessages returns the EXACT token count of the rendered conversation (no sampling) and ok=true when the tokenizer is available; (0,false) otherwise. Uses CountExact internally — exact at any conversation size.

func Encode ¶

func Encode(text string) []int

Encode is the exported wrapper around loaded.Encode for use by test helpers and generators. Returns nil if the tokenizer is unavailable.

func FormatPrompt ¶

func FormatPrompt(messages []llm.Message, dropThinking bool) string

FormatPrompt renders messages through the V4 chat template. dropThinking strips reasoning from assistant turns before the last user turn.

Types ¶

This section is empty.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL