Documentation
¶
Overview ¶
Recipe 12: Vision / Multimodal Inference
Analyze images using a vision-capable GGUF model. The image is passed alongside a text prompt using the inference.Message API, the same format used by the OpenAI-compatible /v1/chat/completions endpoint.
Requirements:
- A vision-capable GGUF model (e.g. LLaVA, Gemma 3 with vision encoder)
Usage:
go run ./docs/cookbook/12-vision-multimodal/ --model path/to/vision-model.gguf --image photo.jpg go run ./docs/cookbook/12-vision-multimodal/ --model path/to/vision-model.gguf --image photo.jpg --prompt "Count the objects"
Click to show internal directories.
Click to hide internal directories.