serve

package

v1.5.0 Latest Latest Go to latest Published: Mar 18, 2026 License: Apache-2.0 Imports: 20 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/zerfoo/zerfoo

Links

Open Source Insights

Documentation ¶

Overview ¶

Package serve provides an OpenAI-compatible HTTP API server for model inference.

The server exposes REST endpoints that follow the OpenAI API specification, enabling drop-in compatibility with existing OpenAI client libraries and tools.

Creating a Server ¶

Use NewServer to create a server for a loaded model:

m, _ := inference.LoadGGUF("model.gguf")
srv := serve.NewServer(m,
	serve.WithLogger(logger),
	serve.WithMetrics(collector),
)
http.ListenAndServe(":8080", srv.Handler())

The returned Server is configured with functional options:

WithLogger sets structured request logging.
WithMetrics enables Prometheus-compatible metrics collection.
WithDraftModel enables speculative decoding with a smaller draft model.
WithBatchScheduler routes non-streaming requests through a BatchScheduler for higher throughput.

Endpoints ¶

The server registers the following HTTP routes:

POST /v1/chat/completions   Chat completion (streaming and non-streaming)
POST /v1/completions        Text completion (streaming and non-streaming)
POST /v1/embeddings         Text embeddings
GET  /v1/models             List loaded models
GET  /v1/models/{id}        Get model info
DELETE /v1/models/{id}      Unload a model
GET  /openapi.yaml          OpenAPI specification
GET  /metrics               Prometheus metrics

SSE Streaming ¶

When a chat or text completion request sets "stream": true, the server responds with Server-Sent Events (SSE). Each event contains a JSON chunk with incremental tokens. The stream terminates with a "data: [DONE]" sentinel.

Tool Calling ¶

Chat completion requests may include OpenAI-compatible tool definitions. The server validates tool schemas, detects tool calls in model output via DetectToolCall, and returns structured tool_calls in the response. Tool choice can be set to "auto", "none", or forced to a specific function.

Structured Output ¶

Chat completions support response_format with "json_schema" type for grammar-constrained decoding, ensuring model output conforms to a provided JSON Schema.

Batch Scheduling ¶

A BatchScheduler groups incoming non-streaming requests into batches for efficient GPU utilization. Create one with NewBatchScheduler, configure it with a BatchConfig, and pass it to WithBatchScheduler:

bs := serve.NewBatchScheduler(serve.BatchConfig{
	MaxBatchSize: 8,
	BatchTimeout: 10 * time.Millisecond,
})
bs.Start()
defer bs.Stop()
srv := serve.NewServer(m, serve.WithBatchScheduler(bs))

Metrics ¶

The GET /metrics endpoint exposes Prometheus text exposition format metrics:

requests_total: total number of completed requests (counter)
tokens_generated_total: total tokens generated (counter)
tokens_per_second: rolling token generation rate (gauge)
request_latency_ms: request latency histogram with configurable buckets

Metrics are collected through the runtime.Collector interface passed via WithMetrics.

Graceful Shutdown ¶

Call Server.Close to gracefully stop the server, which drains the batch scheduler if one is attached.

Package serve provides an OpenAI-compatible HTTP API server for model inference.

Index ¶

type BatchConfig
type BatchHandler
type BatchRequest
type BatchResult
type BatchScheduler
- func NewBatchScheduler(config BatchConfig) *BatchScheduler
- func (s *BatchScheduler) Start()
- func (s *BatchScheduler) Stop()
- func (s *BatchScheduler) Submit(ctx context.Context, req BatchRequest) (BatchResult, error)
type ChatCompletionChoice
type ChatCompletionRequest
type ChatCompletionResponse
type ChatMessage
- func (m *ChatMessage) UnmarshalJSON(data []byte) error
type CompletionChoice
type CompletionRequest
type CompletionResponse
type ContentPart
type EmbeddingObject
type EmbeddingRequest
type EmbeddingResponse
type ImageURL
type JSONSchemaFormat
type ModelDeleteResponse
type ModelListResponse
type ModelObject
type ResponseFormat
type Server
- func NewServer(m *inference.Model, opts ...ServerOption) *Server
- func (s *Server) Close(_ context.Context) error
- func (s *Server) Handler() http.Handler
type ServerMetrics
- func NewServerMetrics(c runtime.Collector) *ServerMetrics
- func (m *ServerMetrics) RecordRequest(tokens int, latency time.Duration)
type ServerOption
- func WithBatchScheduler(bs *BatchScheduler) ServerOption
- func WithDraftModel(draft *inference.Model) ServerOption
- func WithLogger(l log.Logger) ServerOption
- func WithMetrics(c runtime.Collector) ServerOption
- func WithTranscriber(t Transcriber) ServerOption
type Tool
type ToolCall
type ToolCallFunction
type ToolCallResult
- func DetectToolCall(text string, tools []Tool, choice ToolChoice) (*ToolCallResult, bool)
type ToolChoice
- func (tc ToolChoice) MarshalJSON() ([]byte, error)
- func (tc *ToolChoice) UnmarshalJSON(data []byte) error
type ToolChoiceFunction
type ToolFunction
type Transcriber
type TranscriptionResponse
type UsageInfo

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type BatchConfig ¶

type BatchConfig struct {
	MaxBatchSize int
	BatchTimeout time.Duration
	Handler      BatchHandler
}

BatchConfig configures the batch scheduler.

type BatchHandler ¶

type BatchHandler func(ctx context.Context, reqs []BatchRequest) []BatchResult

BatchHandler processes a batch of requests and returns results. The results slice must have the same length as the requests slice.

type BatchRequest ¶

type BatchRequest struct {
	Prompt string
	Phase  string // "prefill" or "decode"
}

BatchRequest represents a single inference request in a batch.

type BatchResult ¶

type BatchResult struct {
	Value string
	Err   error
}

BatchResult holds the result for a single request in a batch.

type BatchScheduler ¶

type BatchScheduler struct {
	// contains filtered or unexported fields
}

BatchScheduler collects incoming requests into batches for efficient processing.

func NewBatchScheduler ¶

func NewBatchScheduler(config BatchConfig) *BatchScheduler

NewBatchScheduler creates a new batch scheduler.

func (*BatchScheduler) Start ¶

func (s *BatchScheduler) Start()

Start begins the batch collection loop.

func (*BatchScheduler) Stop ¶

func (s *BatchScheduler) Stop()

Stop gracefully shuts down the scheduler.

func (*BatchScheduler) Submit ¶

func (s *BatchScheduler) Submit(ctx context.Context, req BatchRequest) (BatchResult, error)

Submit adds a request to the next batch and waits for the result.

type ChatCompletionChoice ¶

type ChatCompletionChoice struct {
	Index        int         `json:"index"`
	Message      ChatMessage `json:"message"`
	FinishReason string      `json:"finish_reason"`
	ToolCalls    []ToolCall  `json:"tool_calls,omitempty"`
}

ChatCompletionChoice is a single choice in the response.

type ChatCompletionRequest ¶

type ChatCompletionRequest struct {
	Model          string          `json:"model"`
	Messages       []ChatMessage   `json:"messages"`
	Temperature    *float64        `json:"temperature,omitempty"`
	TopP           *float64        `json:"top_p,omitempty"`
	MaxTokens      *int            `json:"max_tokens,omitempty"`
	Stream         bool            `json:"stream"`
	Tools          []Tool          `json:"tools,omitempty"`
	ToolChoice     *ToolChoice     `json:"tool_choice,omitempty"`
	ResponseFormat *ResponseFormat `json:"response_format,omitempty"`
}

ChatCompletionRequest represents the OpenAI chat completion request.

type ChatCompletionResponse ¶

type ChatCompletionResponse struct {
	ID      string                 `json:"id"`
	Object  string                 `json:"object"`
	Created int64                  `json:"created"`
	Model   string                 `json:"model"`
	Choices []ChatCompletionChoice `json:"choices"`
	Usage   UsageInfo              `json:"usage"`
}

ChatCompletionResponse is the non-streaming response.

type ChatMessage ¶

type ChatMessage struct {
	Role      string     `json:"role"`
	Content   string     `json:"content"`
	ImageURLs []ImageURL `json:"-"`
}

ChatMessage is a single message in the chat. Content can be either a plain string or an array of content parts (for vision requests with type:"text" and type:"image_url"). Custom JSON unmarshaling is in vision.go.

func (*ChatMessage) UnmarshalJSON ¶ added in v1.5.0

func (m *ChatMessage) UnmarshalJSON(data []byte) error

UnmarshalJSON handles both string and array content formats for ChatMessage. The OpenAI API allows content to be either a plain string or an array of content parts (for vision requests with type:"text" and type:"image_url").

type CompletionChoice ¶

type CompletionChoice struct {
	Index        int    `json:"index"`
	Text         string `json:"text"`
	FinishReason string `json:"finish_reason"`
}

CompletionChoice is a single choice in the completion response.

type CompletionRequest ¶

type CompletionRequest struct {
	Model       string   `json:"model"`
	Prompt      string   `json:"prompt"`
	Temperature *float64 `json:"temperature,omitempty"`
	MaxTokens   *int     `json:"max_tokens,omitempty"`
	Stream      bool     `json:"stream"`
}

CompletionRequest represents the OpenAI completion request.

type CompletionResponse ¶

type CompletionResponse struct {
	ID      string             `json:"id"`
	Object  string             `json:"object"`
	Created int64              `json:"created"`
	Model   string             `json:"model"`
	Choices []CompletionChoice `json:"choices"`
	Usage   UsageInfo          `json:"usage"`
}

CompletionResponse is the non-streaming completion response.

type ContentPart ¶ added in v1.5.0

type ContentPart struct {
	Type     string    `json:"type"`
	Text     string    `json:"text,omitempty"`
	ImageURL *ImageURL `json:"image_url,omitempty"`
}

ContentPart represents a single element in a multi-part content array.

type EmbeddingObject ¶

type EmbeddingObject struct {
	Object    string    `json:"object"`
	Embedding []float32 `json:"embedding"`
	Index     int       `json:"index"`
}

EmbeddingObject is a single embedding in the response.

type EmbeddingRequest ¶

type EmbeddingRequest struct {
	Model string      `json:"model"`
	Input interface{} `json:"input"` // string or []string
}

EmbeddingRequest represents the OpenAI embeddings request.

type EmbeddingResponse ¶

type EmbeddingResponse struct {
	Object string            `json:"object"`
	Data   []EmbeddingObject `json:"data"`
	Model  string            `json:"model"`
	Usage  UsageInfo         `json:"usage"`
}

EmbeddingResponse is the /v1/embeddings response.

type ImageURL ¶ added in v1.5.0

type ImageURL struct {
	URL    string `json:"url"`
	Detail string `json:"detail,omitempty"`
}

ImageURL holds the URL and optional detail level for an image content part.

type JSONSchemaFormat ¶

type JSONSchemaFormat struct {
	Name   string          `json:"name"`
	Strict bool            `json:"strict,omitempty"`
	Schema json.RawMessage `json:"schema"`
}

JSONSchemaFormat describes the json_schema object within a response_format request.

type ModelDeleteResponse ¶

type ModelDeleteResponse struct {
	ID      string `json:"id"`
	Object  string `json:"object"`
	Deleted bool   `json:"deleted"`
}

ModelDeleteResponse is the DELETE /v1/models/:id response.

type ModelListResponse ¶

type ModelListResponse struct {
	Object string        `json:"object"`
	Data   []ModelObject `json:"data"`
}

ModelListResponse is the /v1/models response.

type ModelObject ¶

type ModelObject struct {
	ID           string `json:"id"`
	Object       string `json:"object"`
	Created      int64  `json:"created"`
	OwnedBy      string `json:"owned_by"`
	Architecture string `json:"architecture,omitempty"`
}

ModelObject represents a model in the /v1/models response.

type ResponseFormat ¶

type ResponseFormat struct {
	Type       string            `json:"type"` // "text" | "json_object" | "json_schema"
	JSONSchema *JSONSchemaFormat `json:"json_schema,omitempty"`
}

ResponseFormat controls the output structure of a chat completion.

type Server ¶

type Server struct {
	// contains filtered or unexported fields
}

Server wraps a loaded model and serves OpenAI-compatible HTTP endpoints.

func NewServer ¶

func NewServer(m *inference.Model, opts ...ServerOption) *Server

NewServer creates a Server for the given model.

func (*Server) Close ¶

func (s *Server) Close(_ context.Context) error

Close implements shutdown.Closer for graceful shutdown integration.

func (*Server) Handler ¶

func (s *Server) Handler() http.Handler

Handler returns the HTTP handler for this server.

type ServerMetrics ¶

type ServerMetrics struct {
	// contains filtered or unexported fields
}

ServerMetrics records serving metrics using a runtime.Collector.

func NewServerMetrics ¶

func NewServerMetrics(c runtime.Collector) *ServerMetrics

NewServerMetrics creates a ServerMetrics backed by the given collector.

func (*ServerMetrics) RecordRequest ¶

func (m *ServerMetrics) RecordRequest(tokens int, latency time.Duration)

RecordRequest records a completed request's metrics.

type ServerOption ¶

type ServerOption func(*Server)

ServerOption configures the server.

func WithBatchScheduler ¶

func WithBatchScheduler(bs *BatchScheduler) ServerOption

WithBatchScheduler attaches a batch scheduler for non-streaming requests. When set, incoming completion requests are routed through the scheduler to be grouped into batches for higher throughput.

func WithDraftModel ¶

func WithDraftModel(draft *inference.Model) ServerOption

WithDraftModel enables speculative decoding using the given draft model. When set, completion requests use speculative decode with the draft model proposing tokens and the target model verifying them.

func WithLogger ¶

func WithLogger(l log.Logger) ServerOption

WithLogger sets the logger for request logging.

func WithMetrics ¶

func WithMetrics(c runtime.Collector) ServerOption

WithMetrics sets the metrics collector for token rate and request tracking.

func WithTranscriber ¶ added in v1.5.0

func WithTranscriber(t Transcriber) ServerOption

WithTranscriber sets the audio transcription backend for the /v1/audio/transcriptions endpoint.

type Tool ¶

type Tool struct {
	Type     string       `json:"type"`
	Function ToolFunction `json:"function"`
}

Tool represents an OpenAI-compatible tool definition.

type ToolCall ¶

type ToolCall struct {
	ID       string           `json:"id"`
	Type     string           `json:"type"`
	Function ToolCallFunction `json:"function"`
}

ToolCall represents a tool invocation in the assistant's response.

type ToolCallFunction ¶

type ToolCallFunction struct {
	Name      string `json:"name"`
	Arguments string `json:"arguments"`
}

ToolCallFunction holds the function name and arguments in a tool call response.

type ToolCallResult ¶

type ToolCallResult struct {
	ID           string // e.g. "call_1234567890"
	FunctionName string
	Arguments    json.RawMessage // raw JSON of arguments
}

ToolCallResult holds a detected tool call.

func DetectToolCall ¶

func DetectToolCall(text string, tools []Tool, choice ToolChoice) (*ToolCallResult, bool)

DetectToolCall examines generated text to determine if it is a tool call. Returns (result, true) if a tool call is detected, (nil, false) otherwise.

Detection heuristic:

If tool_choice is "none": never detect
Trim whitespace from text
If text starts with '{' and parses as valid JSON object: it is a tool call
If tool_choice forces a specific function: use that name
Otherwise: look for "name" field in the JSON to match a tool

type ToolChoice ¶

type ToolChoice struct {
	Mode     string // "auto", "none", or "function"
	Function *ToolChoiceFunction
}

ToolChoice represents the tool_choice field. It can be the string "auto", "none", or an object {"type":"function","function":{"name":"..."}}.

func (ToolChoice) MarshalJSON ¶

func (tc ToolChoice) MarshalJSON() ([]byte, error)

MarshalJSON encodes ToolChoice back to its JSON representation.

func (*ToolChoice) UnmarshalJSON ¶

func (tc *ToolChoice) UnmarshalJSON(data []byte) error

UnmarshalJSON handles both string and object forms of tool_choice.

type ToolChoiceFunction ¶

type ToolChoiceFunction struct {
	Name string `json:"name"`
}

ToolChoiceFunction identifies a specific function in a tool_choice object.

type ToolFunction ¶

type ToolFunction struct {
	Name        string          `json:"name"`
	Description string          `json:"description"`
	Parameters  json.RawMessage `json:"parameters,omitempty"`
}

ToolFunction holds the function definition within a tool.

type Transcriber ¶ added in v1.5.0

type Transcriber interface {
	Transcribe(ctx context.Context, audio []byte, language string) (string, error)
}

Transcriber converts raw audio bytes into a text transcript.

type TranscriptionResponse ¶ added in v1.5.0

type TranscriptionResponse struct {
	Text string `json:"text"`
}

TranscriptionResponse is the /v1/audio/transcriptions JSON response.

type UsageInfo ¶

type UsageInfo struct {
	PromptTokens     int `json:"prompt_tokens"`
	CompletionTokens int `json:"completion_tokens"`
	TotalTokens      int `json:"total_tokens"`
}

UsageInfo reports token counts.

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
agent Package agent adapts the generate/agent agentic loop to the OpenAI-compatible chat completions API, translating between OpenAI tool definitions and the internal ToolRegistry/Supervisor types.	Package agent adapts the generate/agent agentic loop to the OpenAI-compatible chat completions API, translating between OpenAI tool definitions and the internal ToolRegistry/Supervisor types.
batcher Package batcher implements a continuous batching scheduler for inference serving.	Package batcher implements a continuous batching scheduler for inference serving.
cloud Package cloud provides multi-tenant namespace isolation for the serving layer.	Package cloud provides multi-tenant namespace isolation for the serving layer.
disaggregated Package disaggregated implements disaggregated prefill/decode serving.	Package disaggregated implements disaggregated prefill/decode serving.
proto Package disaggpb defines the gRPC service contracts for disaggregated prefill/decode serving.	Package disaggpb defines the gRPC service contracts for disaggregated prefill/decode serving.
registry Package registry provides a bbolt-backed model version registry for tracking, activating, and managing model versions used by the serving layer.	Package registry provides a bbolt-backed model version registry for tracking, activating, and managing model versions used by the serving layer.

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL