serve

package
v1.38.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 30, 2026 License: Apache-2.0 Imports: 33 Imported by: 0

Documentation

Overview

Package serve provides an OpenAI-compatible HTTP API server for model inference. (Stability: stable)

The server exposes REST endpoints that follow the OpenAI API specification, enabling drop-in compatibility with existing OpenAI client libraries and tools.

Creating a Server

Use NewServer to create a server for a loaded model:

m, _ := inference.LoadGGUF("model.gguf")
srv := serve.NewServer(m,
	serve.WithLogger(logger),
	serve.WithMetrics(collector),
)
http.ListenAndServe(":8080", srv.Handler())

The returned Server is configured with functional options:

Endpoints

The server registers the following HTTP routes:

POST /v1/chat/completions   Chat completion (streaming and non-streaming)
POST /v1/completions        Text completion (streaming and non-streaming)
POST /v1/embeddings         Text embeddings
GET  /v1/models             List loaded models
GET  /v1/models/{id}        Get model info
DELETE /v1/models/{id}      Unload a model
GET  /openapi.yaml          OpenAPI specification
GET  /metrics               Prometheus metrics

SSE Streaming

When a chat or text completion request sets "stream": true, the server responds with Server-Sent Events (SSE). Each event contains a JSON chunk with incremental tokens. The stream terminates with a "data: [DONE]" sentinel.

Tool Calling

Chat completion requests may include OpenAI-compatible tool definitions. The server validates tool schemas, detects tool calls in model output via DetectToolCall, and returns structured tool_calls in the response. Tool choice can be set to "auto", "none", or forced to a specific function.

Structured Output

Chat completions support response_format with "json_schema" type for grammar-constrained decoding, ensuring model output conforms to a provided JSON Schema.

Batch Scheduling

A BatchScheduler groups incoming non-streaming requests into batches for efficient GPU utilization. Create one with NewBatchScheduler, configure it with a BatchConfig, and pass it to WithBatchScheduler:

bs := serve.NewBatchScheduler(serve.BatchConfig{
	MaxBatchSize: 8,
	BatchTimeout: 10 * time.Millisecond,
})
bs.Start()
defer bs.Stop()
srv := serve.NewServer(m, serve.WithBatchScheduler(bs))

Metrics

The GET /metrics endpoint exposes Prometheus text exposition format metrics:

  • requests_total: total number of completed requests (counter)
  • tokens_generated_total: total tokens generated (counter)
  • tokens_per_second: rolling token generation rate (gauge)
  • request_latency_ms: request latency histogram with configurable buckets

Metrics are collected through the runtime.Collector interface passed via WithMetrics.

Graceful Shutdown

Call Server.Close to gracefully stop the server, which drains the batch scheduler if one is attached. Stability: stable

Package serve HTTP handler implementations for OpenAI-compatible API endpoints.

Package serve provides an OpenAI-compatible HTTP API server for model inference.

Package serve SSE streaming implementations for chat and text completions.

Package serve request/response types and validation helpers for the OpenAI-compatible API server.

Index

Examples

Constants

This section is empty.

Variables

This section is empty.

Functions

func GuardianMiddleware added in v1.23.0

func GuardianMiddleware(evaluator GuardEvaluator, config GuardianMiddlewareConfig) func(http.Handler) http.Handler

GuardianMiddleware returns HTTP middleware that wraps chat completion requests with Guardian safety checks. If evaluator is nil the middleware is a no-op pass-through.

func ParseModelAdapter added in v1.29.0

func ParseModelAdapter(model string) (baseModel, adapterName string)

ParseModelAdapter splits a model field of the form "base_model:adapter_name" into the base model ID and the adapter name. If no colon is present, the adapter name is empty.

func RequestID added in v1.11.0

func RequestID(ctx context.Context) string

RequestID returns the request ID from the context, or an empty string if not set.

Types

type AdapterCacheHandle added in v1.29.0

type AdapterCacheHandle struct {
	// contains filtered or unexported fields
}

AdapterCacheHandle wraps a lora.AdapterCache with the directory where adapter GGUF files are stored.

type BatchConfig

type BatchConfig struct {
	MaxBatchSize int
	BatchTimeout time.Duration
	Handler      BatchHandler
}

BatchConfig configures the batch scheduler.

type BatchHandler

type BatchHandler func(ctx context.Context, reqs []BatchRequest) []BatchResult

BatchHandler processes a batch of requests and returns results. The results slice must have the same length as the requests slice.

type BatchRequest

type BatchRequest struct {
	Prompt string
	Phase  string // "prefill" or "decode"
}

BatchRequest represents a single inference request in a batch.

type BatchResult

type BatchResult struct {
	Value string
	Err   error
}

BatchResult holds the result for a single request in a batch.

type BatchScheduler

type BatchScheduler struct {
	// contains filtered or unexported fields
}

BatchScheduler collects incoming requests into batches for efficient processing.

func NewBatchScheduler

func NewBatchScheduler(config BatchConfig) *BatchScheduler

NewBatchScheduler creates a new batch scheduler.

func (*BatchScheduler) Start

func (s *BatchScheduler) Start()

Start begins the batch collection loop.

func (*BatchScheduler) Stop

func (s *BatchScheduler) Stop()

Stop gracefully shuts down the scheduler.

func (*BatchScheduler) Submit

Submit adds a request to the next batch and waits for the result.

type ChatCompletionChoice

type ChatCompletionChoice struct {
	Index        int         `json:"index"`
	Message      ChatMessage `json:"message"`
	FinishReason string      `json:"finish_reason"`
	ToolCalls    []ToolCall  `json:"tool_calls,omitempty"`
}

ChatCompletionChoice is a single choice in the response.

type ChatCompletionRequest

type ChatCompletionRequest struct {
	Model          string          `json:"model"`
	Messages       []ChatMessage   `json:"messages"`
	Temperature    *float64        `json:"temperature,omitempty"`
	TopP           *float64        `json:"top_p,omitempty"`
	TopK           *int            `json:"top_k,omitempty"`
	MaxTokens      *int            `json:"max_tokens,omitempty"`
	Stream         bool            `json:"stream"`
	Tools          []Tool          `json:"tools,omitempty"`
	ToolChoice     *ToolChoice     `json:"tool_choice,omitempty"`
	ResponseFormat *ResponseFormat `json:"response_format,omitempty"`
}

ChatCompletionRequest represents the OpenAI chat completion request.

type ChatCompletionResponse

type ChatCompletionResponse struct {
	ID      string                 `json:"id"`
	Object  string                 `json:"object"`
	Created int64                  `json:"created"`
	Model   string                 `json:"model"`
	Choices []ChatCompletionChoice `json:"choices"`
	Usage   UsageInfo              `json:"usage"`
}

ChatCompletionResponse is the non-streaming response.

type ChatMessage

type ChatMessage struct {
	Role      string     `json:"role"`
	Content   string     `json:"content"`
	ImageURLs []ImageURL `json:"-"`
}

ChatMessage is a single message in the chat. Content can be either a plain string or an array of content parts (for vision requests with type:"text" and type:"image_url"). Custom JSON unmarshaling is in vision.go.

func (*ChatMessage) UnmarshalJSON added in v1.5.0

func (m *ChatMessage) UnmarshalJSON(data []byte) error

UnmarshalJSON handles both string and array content formats for ChatMessage. The OpenAI API allows content to be either a plain string or an array of content parts (for vision requests with type:"text" and type:"image_url").

type Classifier added in v1.10.0

type Classifier interface {
	Classify(ctx context.Context, texts []string) ([]sentiment.SentimentResult, error)
}

Classifier abstracts a text classification pipeline for testability.

type ClassifyData added in v1.10.0

type ClassifyData struct {
	Label string  `json:"label"`
	Score float64 `json:"score"`
	Index int     `json:"index"`
}

ClassifyData holds classification output for a single input text.

type ClassifyMetrics added in v1.10.0

type ClassifyMetrics struct {
	// contains filtered or unexported fields
}

ClassifyMetrics records classification-specific Prometheus metrics.

func NewClassifyMetrics added in v1.10.0

func NewClassifyMetrics(c runtime.Collector) *ClassifyMetrics

NewClassifyMetrics creates classification metrics backed by the given collector.

type ClassifyRequest added in v1.10.0

type ClassifyRequest struct {
	Model string   `json:"model"`
	Input []string `json:"input"`
}

ClassifyRequest is the request body for POST /v1/classify.

type ClassifyResponse added in v1.10.0

type ClassifyResponse struct {
	Data  []ClassifyData `json:"data"`
	Model string         `json:"model"`
	Usage ClassifyUsage  `json:"usage"`
}

ClassifyResponse is the response body for POST /v1/classify.

type ClassifyUsage added in v1.10.0

type ClassifyUsage struct {
	TotalTokens int `json:"total_tokens"`
}

ClassifyUsage reports token counts for the classify request.

type CompletionChoice

type CompletionChoice struct {
	Index        int    `json:"index"`
	Text         string `json:"text"`
	FinishReason string `json:"finish_reason"`
}

CompletionChoice is a single choice in the completion response.

type CompletionRequest

type CompletionRequest struct {
	Model       string   `json:"model"`
	Prompt      string   `json:"prompt"`
	Temperature *float64 `json:"temperature,omitempty"`
	TopP        *float64 `json:"top_p,omitempty"`
	TopK        *int     `json:"top_k,omitempty"`
	MaxTokens   *int     `json:"max_tokens,omitempty"`
	Stream      bool     `json:"stream"`
}

CompletionRequest represents the OpenAI completion request.

type CompletionResponse

type CompletionResponse struct {
	ID      string             `json:"id"`
	Object  string             `json:"object"`
	Created int64              `json:"created"`
	Model   string             `json:"model"`
	Choices []CompletionChoice `json:"choices"`
	Usage   UsageInfo          `json:"usage"`
}

CompletionResponse is the non-streaming completion response.

type ContentPart added in v1.5.0

type ContentPart struct {
	Type     string    `json:"type"`
	Text     string    `json:"text,omitempty"`
	ImageURL *ImageURL `json:"image_url,omitempty"`
}

ContentPart represents a single element in a multi-part content array.

type EmbeddingObject

type EmbeddingObject struct {
	Object    string    `json:"object"`
	Embedding []float32 `json:"embedding"`
	Index     int       `json:"index"`
}

EmbeddingObject is a single embedding in the response.

type EmbeddingRequest

type EmbeddingRequest struct {
	Model string      `json:"model"`
	Input interface{} `json:"input"` // string or []string
}

EmbeddingRequest represents the OpenAI embeddings request.

type EmbeddingResponse

type EmbeddingResponse struct {
	Object string            `json:"object"`
	Data   []EmbeddingObject `json:"data"`
	Model  string            `json:"model"`
	Usage  UsageInfo         `json:"usage"`
}

EmbeddingResponse is the /v1/embeddings response.

type GuardBatchRequest added in v1.22.0

type GuardBatchRequest struct {
	Model  string                   `json:"model"`
	Inputs []guardian.GuardianInput `json:"inputs"`
	Risks  []string                 `json:"risks"`
}

GuardBatchRequest is the request body for POST /v1/guard/batch.

type GuardBatchResponse added in v1.22.0

type GuardBatchResponse struct {
	Model     string             `json:"model"`
	Results   []GuardBatchResult `json:"results"`
	LatencyMs int64              `json:"latency_ms"`
}

GuardBatchResponse is the response body for POST /v1/guard/batch.

type GuardBatchResult added in v1.22.0

type GuardBatchResult struct {
	Index    int           `json:"index"`
	Flagged  bool          `json:"flagged"`
	Verdicts []VerdictData `json:"verdicts"`
}

GuardBatchResult holds verdicts for a single input in a batch.

type GuardEvaluator added in v1.22.0

type GuardEvaluator interface {
	Evaluate(ctx context.Context, req guardian.GuardianRequest) ([]guardian.Verdict, error)
	EvaluateBatch(ctx context.Context, inputs []guardian.GuardianInput, risks []string) (*guardian.BatchResult, error)
	Scan(ctx context.Context, input guardian.GuardianInput) (*guardian.ScanResult, error)
}

GuardEvaluator abstracts a Guardian safety evaluator for testability.

type GuardMetrics added in v1.22.0

type GuardMetrics struct {
	// contains filtered or unexported fields
}

GuardMetrics records Guardian-specific Prometheus metrics.

func NewGuardMetrics added in v1.22.0

func NewGuardMetrics(c runtime.Collector) *GuardMetrics

NewGuardMetrics creates Guardian metrics backed by the given collector.

type GuardRequest added in v1.22.0

type GuardRequest struct {
	Model  string                 `json:"model"`
	Input  guardian.GuardianInput `json:"input"`
	Risks  []string               `json:"risks"`
	Format string                 `json:"format,omitempty"`
	Think  bool                   `json:"think,omitempty"`
}

GuardRequest is the request body for POST /v1/guard.

type GuardResponse added in v1.22.0

type GuardResponse struct {
	Model     string        `json:"model"`
	Flagged   bool          `json:"flagged"`
	Verdicts  []VerdictData `json:"verdicts"`
	LatencyMs int64         `json:"latency_ms"`
}

GuardResponse is the response body for POST /v1/guard.

type GuardScanRequest added in v1.22.0

type GuardScanRequest struct {
	Model string                 `json:"model"`
	Input guardian.GuardianInput `json:"input"`
}

GuardScanRequest is the request body for POST /v1/guard/scan.

type GuardScanResponse added in v1.22.0

type GuardScanResponse struct {
	Model       string        `json:"model"`
	Flagged     bool          `json:"flagged"`
	HighestRisk string        `json:"highest_risk,omitempty"`
	Verdicts    []VerdictData `json:"verdicts"`
	LatencyMs   int64         `json:"latency_ms"`
}

GuardScanResponse is the response body for POST /v1/guard/scan.

type GuardianMiddlewareConfig added in v1.23.0

type GuardianMiddlewareConfig struct {
	Model       string   // Guardian model path
	Risks       []string // risk categories (default: HarmRiskCategories)
	CheckInput  bool     // scan user prompts (default: true)
	CheckOutput bool     // scan assistant responses (default: false)
	BlockOnFlag bool     // return 400 if flagged (default: true)
}

GuardianMiddlewareConfig controls how the Guardian safety middleware intercepts chat completion requests.

type ImageURL added in v1.5.0

type ImageURL struct {
	URL    string `json:"url"`
	Detail string `json:"detail,omitempty"`
}

ImageURL holds the URL and optional detail level for an image content part.

type JSONSchemaFormat

type JSONSchemaFormat struct {
	Name   string          `json:"name"`
	Strict bool            `json:"strict,omitempty"`
	Schema json.RawMessage `json:"schema"`
}

JSONSchemaFormat describes the json_schema object within a response_format request.

type ModelDeleteResponse

type ModelDeleteResponse struct {
	ID      string `json:"id"`
	Object  string `json:"object"`
	Deleted bool   `json:"deleted"`
}

ModelDeleteResponse is the DELETE /v1/models/:id response.

type ModelListResponse

type ModelListResponse struct {
	Object string        `json:"object"`
	Data   []ModelObject `json:"data"`
}

ModelListResponse is the /v1/models response.

type ModelObject

type ModelObject struct {
	ID           string `json:"id"`
	Object       string `json:"object"`
	Created      int64  `json:"created"`
	OwnedBy      string `json:"owned_by"`
	Architecture string `json:"architecture,omitempty"`
}

ModelObject represents a model in the /v1/models response.

type ResponseFormat

type ResponseFormat struct {
	Type       string            `json:"type"` // "text" | "json_object" | "json_schema"
	JSONSchema *JSONSchemaFormat `json:"json_schema,omitempty"`
}

ResponseFormat controls the output structure of a chat completion.

type Server

type Server struct {
	// contains filtered or unexported fields
}

Server wraps a loaded model and serves OpenAI-compatible HTTP endpoints.

func NewServer

func NewServer(m *inference.Model, opts ...ServerOption) *Server

NewServer creates a Server for the given model.

func (*Server) Close

func (s *Server) Close(_ context.Context) error

Close implements shutdown.Closer for graceful shutdown integration.

func (*Server) GPUs added in v1.8.0

func (s *Server) GPUs() []int

GPUs returns the configured GPU IDs, or nil if not set.

func (*Server) Handler

func (s *Server) Handler() http.Handler

Handler returns the HTTP handler for this server.

type ServerMetrics

type ServerMetrics struct {
	// contains filtered or unexported fields
}

ServerMetrics records serving metrics using a runtime.Collector.

func NewServerMetrics

func NewServerMetrics(c runtime.Collector) *ServerMetrics

NewServerMetrics creates a ServerMetrics backed by the given collector.

func (*ServerMetrics) ActiveRequests added in v1.17.0

func (m *ServerMetrics) ActiveRequests() int64

ActiveRequests returns the current number of in-flight requests.

func (*ServerMetrics) DecActiveRequests added in v1.17.0

func (m *ServerMetrics) DecActiveRequests()

DecActiveRequests decrements the active request count and updates the gauge.

func (*ServerMetrics) IncActiveRequests added in v1.17.0

func (m *ServerMetrics) IncActiveRequests()

IncActiveRequests increments the active request count and updates the gauge.

func (*ServerMetrics) RecordError added in v1.17.0

func (m *ServerMetrics) RecordError(endpoint string, statusCode int)

RecordError increments the errors_total counter for the given endpoint and HTTP status code. Labels are encoded in the counter name so that the Prometheus exposition can emit them as {endpoint="...",status_code="..."}.

func (*ServerMetrics) RecordRequest

func (m *ServerMetrics) RecordRequest(tokens int, latency time.Duration)

RecordRequest records a completed request's metrics.

type ServerOption

type ServerOption func(*Server)

ServerOption configures the server.

func WithAPIKey added in v1.11.0

func WithAPIKey(key string) ServerOption

WithAPIKey enables Bearer token authentication on all endpoints except health checks (/healthz, /readyz), metrics (/metrics), and the OpenAPI spec (/openapi.yaml).

func WithAdapterCache added in v1.29.0

func WithAdapterCache(dir string, maxCached int) ServerOption

WithAdapterCache enables per-request LoRA adapter selection. dir is the directory containing adapter GGUF files named <adapter>.gguf. maxCached is the maximum number of adapters to keep in memory.

func WithBatchScheduler

func WithBatchScheduler(bs *BatchScheduler) ServerOption

WithBatchScheduler attaches a batch scheduler for non-streaming requests. When set, incoming completion requests are routed through the scheduler to be grouped into batches for higher throughput.

func WithClassifier added in v1.10.0

func WithClassifier(c Classifier) ServerOption

WithClassifier sets the text classification pipeline for the /v1/classify endpoint.

func WithDraftModel

func WithDraftModel(draft *inference.Model) ServerOption

WithDraftModel enables speculative decoding using the given draft model. When set, completion requests use speculative decode with the draft model proposing tokens and the target model verifying them.

func WithGPUs added in v1.8.0

func WithGPUs(ids []int) ServerOption

WithGPUs sets the GPU IDs to distribute the model across.

func WithGuardEvaluator added in v1.22.0

func WithGuardEvaluator(e GuardEvaluator) ServerOption

WithGuardEvaluator sets the Guardian evaluator for the /v1/guard endpoints.

func WithKeyStore added in v1.12.0

func WithKeyStore(ks *security.KeyStore) ServerOption

WithKeyStore enables scope-based authorization using the provided KeyStore. When set, after Bearer token validation the middleware looks up the key in the store and checks that it has a sufficient scope for the endpoint.

func WithLogger

func WithLogger(l log.Logger) ServerOption

WithLogger sets the logger for request logging.

func WithMaxTokens added in v1.11.0

func WithMaxTokens(n int) ServerOption

WithMaxTokens sets the server-side upper bound for max_tokens in completion requests. Any request asking for more tokens than this limit will be clamped. The default is 8192.

func WithMetrics

func WithMetrics(c runtime.Collector) ServerOption

WithMetrics sets the metrics collector for token rate and request tracking.

func WithRateLimiter added in v1.11.0

func WithRateLimiter(rl *security.RateLimiter) ServerOption

WithRateLimiter enables per-IP rate limiting using the provided RateLimiter. When set, requests that exceed the rate limit receive 429 Too Many Requests.

func WithTranscriber added in v1.5.0

func WithTranscriber(t Transcriber) ServerOption

WithTranscriber sets the audio transcription backend for the /v1/audio/transcriptions endpoint.

func WithTrustedProxies added in v1.12.0

func WithTrustedProxies(proxies []string) ServerOption

WithTrustedProxies configures the set of reverse-proxy IPs whose X-Forwarded-For and X-Real-IP headers are trusted for client-IP extraction. When the rate limiter is enabled, only requests arriving from these addresses will have their forwarding headers honoured; all other requests use RemoteAddr directly.

type Tool

type Tool struct {
	Type     string       `json:"type"`
	Function ToolFunction `json:"function"`
}

Tool represents an OpenAI-compatible tool definition.

type ToolCall

type ToolCall struct {
	ID       string           `json:"id"`
	Type     string           `json:"type"`
	Function ToolCallFunction `json:"function"`
}

ToolCall represents a tool invocation in the assistant's response.

type ToolCallFunction

type ToolCallFunction struct {
	Name      string `json:"name"`
	Arguments string `json:"arguments"`
}

ToolCallFunction holds the function name and arguments in a tool call response.

type ToolCallResult

type ToolCallResult struct {
	ID           string // e.g. "call_1234567890"
	FunctionName string
	Arguments    json.RawMessage // raw JSON of arguments
}

ToolCallResult holds a detected tool call.

func DetectToolCall

func DetectToolCall(text string, tools []Tool, choice ToolChoice) (*ToolCallResult, bool)

DetectToolCall examines generated text to determine if it is a tool call. Returns (result, true) if a tool call is detected, (nil, false) otherwise.

Detection heuristic:

  1. If tool_choice is "none": never detect
  2. Trim whitespace from text
  3. If text starts with '{' and parses as valid JSON object: it is a tool call
  4. If tool_choice forces a specific function: use that name
  5. Otherwise: look for "name" field in the JSON to match a tool
Example
package main

import (
	"encoding/json"
	"fmt"

	"github.com/zerfoo/zerfoo/serve"
)

func main() {
	tools := []serve.Tool{{
		Type: "function",
		Function: serve.ToolFunction{
			Name:        "get_weather",
			Description: "Get weather for a city",
			Parameters:  json.RawMessage(`{"type":"object","properties":{"city":{"type":"string"}}}`),
		},
	}}

	text := `{"name":"get_weather","arguments":{"city":"Paris"}}`
	result, ok := serve.DetectToolCall(text, tools, serve.ToolChoice{Mode: "auto"})
	if ok {
		fmt.Printf("%s %s\n", result.FunctionName, result.Arguments)
	}
}
Output:
get_weather {"city":"Paris"}

type ToolChoice

type ToolChoice struct {
	Mode     string // "auto", "none", or "function"
	Function *ToolChoiceFunction
}

ToolChoice represents the tool_choice field. It can be the string "auto", "none", or an object {"type":"function","function":{"name":"..."}}.

func (ToolChoice) MarshalJSON

func (tc ToolChoice) MarshalJSON() ([]byte, error)

MarshalJSON encodes ToolChoice back to its JSON representation.

func (*ToolChoice) UnmarshalJSON

func (tc *ToolChoice) UnmarshalJSON(data []byte) error

UnmarshalJSON handles both string and object forms of tool_choice.

type ToolChoiceFunction

type ToolChoiceFunction struct {
	Name string `json:"name"`
}

ToolChoiceFunction identifies a specific function in a tool_choice object.

type ToolFunction

type ToolFunction struct {
	Name        string          `json:"name"`
	Description string          `json:"description"`
	Parameters  json.RawMessage `json:"parameters,omitempty"`
}

ToolFunction holds the function definition within a tool.

type Transcriber added in v1.5.0

type Transcriber interface {
	Transcribe(ctx context.Context, audio []byte, language string) (string, error)
}

Transcriber converts raw audio bytes into a text transcript.

type TranscriptionResponse added in v1.5.0

type TranscriptionResponse struct {
	Text string `json:"text"`
}

TranscriptionResponse is the /v1/audio/transcriptions JSON response.

type UsageInfo

type UsageInfo struct {
	PromptTokens     int `json:"prompt_tokens"`
	CompletionTokens int `json:"completion_tokens"`
	TotalTokens      int `json:"total_tokens"`
}

UsageInfo reports token counts.

type VerdictData added in v1.22.0

type VerdictData struct {
	Risk       string  `json:"risk"`
	Unsafe     bool    `json:"unsafe"`
	Confidence float64 `json:"confidence"`
	Reasoning  string  `json:"reasoning"`
}

VerdictData holds a single risk verdict in the API response.

Directories

Path Synopsis
Package adaptive implements an adaptive batch scheduler that dynamically adjusts batch size based on queue depth and latency targets to maximize throughput while meeting latency SLOs.
Package adaptive implements an adaptive batch scheduler that dynamically adjusts batch size based on queue depth and latency targets to maximize throughput while meeting latency SLOs.
Package agent adapts the generate/agent agentic loop to the serving layer.
Package agent adapts the generate/agent agentic loop to the serving layer.
Package batcher implements a continuous batching scheduler for inference serving.
Package batcher implements a continuous batching scheduler for inference serving.
Package disaggregated implements disaggregated prefill/decode serving.
Package disaggregated implements disaggregated prefill/decode serving.
proto
Package disaggpb defines the gRPC service contracts for disaggregated prefill/decode serving.
Package disaggpb defines the gRPC service contracts for disaggregated prefill/decode serving.
Package multimodel provides a ModelManager that loads and unloads models on demand with LRU eviction when GPU memory budget is exceeded.
Package multimodel provides a ModelManager that loads and unloads models on demand with LRU eviction when GPU memory budget is exceeded.
Package operator provides a Kubernetes operator for managing ZerfooInferenceService custom resources.
Package operator provides a Kubernetes operator for managing ZerfooInferenceService custom resources.
Package registry provides a bbolt-backed model version registry for tracking and A/B testing.
Package registry provides a bbolt-backed model version registry for tracking and A/B testing.
Package repository provides a model repository for storing and managing GGUF model files.
Package repository provides a model repository for storing and managing GGUF model files.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL