Documentation
¶
Overview ¶
Package vllm wraps the OpenAI-compat HTTP surface exposed by `vllm serve` (https://docs.vllm.ai/). vLLM is a high-throughput inference server with one behavior worth distinguishing in the catalog stack: by default it applies the target model's HuggingFace `generation_config.json` when the client omits sampler fields. Most other local servers we wrap (omlx, lmstudio, lucebox) cannot do this — MLX / GGUF repackaging typically drops generation_config.json from the bundle, and the servers ship their own presets instead.
The implication for ADR-007's catalog-stale nudge: when a vLLM-served request omits sampler fields, the user is not "decoding greedy" — the server is honoring the model creator's recommended bundle. The CLI reflects that with a softer message.
Capabilities mirror lmstudio (Tools / Stream / StructuredOutput true) and add ImplicitGenerationConfig=true. Reasoning is model-dependent and not declared at the provider level; per-model thinking-mode controls live in the catalog ModelEntry, matching the lmstudio precedent.
Default port 8000 follows the vLLM docs. Auth is optional: vLLM accepts unauthenticated requests by default and gates with --api-key (or VLLM_API_KEY) when the operator sets one. The Config.APIKey field flows through unchanged.
Index ¶
Constants ¶
const DefaultBaseURL = "http://localhost:8000/v1"
Variables ¶
var ProtocolCapabilities = openai.ProtocolCapabilities{ Tools: true, Stream: true, StructuredOutput: true, ImplicitGenerationConfig: true, }
ProtocolCapabilities mirrors lmstudio's openai-compat surface and adds ImplicitGenerationConfig=true so the catalog-stale nudge can soften.
Functions ¶
Types ¶
type UtilizationProbe ¶ added in v0.10.9
type UtilizationProbe struct {
// contains filtered or unexported fields
}
UtilizationProbe queries vLLM server-root observability endpoints and normalizes them into the shared endpoint utilization shape.
func NewUtilizationProbe ¶ added in v0.10.9
func NewUtilizationProbe(baseURL string, client *http.Client) *UtilizationProbe
NewUtilizationProbe creates a probe for an OpenAI-compatible vLLM base URL.
func (*UtilizationProbe) Probe ¶ added in v0.10.9
func (p *UtilizationProbe) Probe(ctx context.Context) utilization.EndpointUtilization
Probe fetches /metrics from the server root and returns a normalized sample. Failures return stale or unknown utilization instead of surfacing endpoint unavailability.