These examples demonstrate Zerfoo's core value: embeddable ML inference in pure Go. Each example is a standalone program you can build and run with go build.
Load a GGUF model and generate text from a prompt. Demonstrates the core inference.LoadFile and model.Generate API with sampling options (temperature, top-K, top-P) and token streaming.
Embed inference inside a custom Go HTTP handler. Demonstrates the pattern of loading a model once at startup and serving many concurrent requests through your own routing and request/response types.
Start an OpenAI-compatible HTTP server backed by a GGUF model. Demonstrates serve.NewServer with graceful shutdown. Drop-in replacement for any OpenAI client.
Grammar-guided decoding that constrains model output to valid JSON matching a predefined schema. Useful for structured data extraction and tool-calling pipelines.
Retrieval-augmented generation pattern: embed a document corpus, retrieve the most relevant documents via cosine similarity, and generate answers grounded in those facts using model.Embed and model.Chat.
GGUF model file
Running an Example
# Build and run the inference example
go build -o inference ./examples/inference/
./inference path/to/model.gguf "What is the capital of France?"
# With GPU acceleration (automatic if CUDA is available)
./inference --device cuda path/to/model.gguf "What is the capital of France?"
Further Reading
See docs/getting-started.md for a full tutorial covering CLI usage, library API, and the OpenAI-compatible server.