Documentation
¶
Overview ¶
Package cloud provides multi-tenant namespace isolation for the serving layer.
Stability: alpha
Package cloud provides multi-tenant namespace isolation for the serving layer.
Index ¶
- func BillingMiddleware(recorder UsageRecorder) func(http.Handler) http.Handler
- type ModelInfo
- type NDJSONRecorder
- type ResourceManager
- func (rm *ResourceManager) Evict(modelID string) error
- func (rm *ResourceManager) Load(modelID string, vramBytes uint64) error
- func (rm *ResourceManager) LoadedModels() []ModelInfo
- func (rm *ResourceManager) SetEvictCallback(fn func(modelID string))
- func (rm *ResourceManager) Stats() (used, budget uint64, loaded int)
- func (rm *ResourceManager) Touch(modelID string) error
- type Tenant
- type TenantConfig
- type TenantRegistry
- type UsageEvent
- type UsageRecorder
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func BillingMiddleware ¶
func BillingMiddleware(recorder UsageRecorder) func(http.Handler) http.Handler
BillingMiddleware returns an HTTP middleware that meters prompt and completion tokens per request and publishes usage events to the given recorder. It expects the TenantRegistry middleware to run first so that TenantFromContext returns a valid tenant. The tenant ID is taken from the apiKeyHeader value.
Types ¶
type NDJSONRecorder ¶
type NDJSONRecorder struct {
// contains filtered or unexported fields
}
NDJSONRecorder writes usage events as newline-delimited JSON to an io.Writer.
func NewNDJSONRecorder ¶
func NewNDJSONRecorder(w io.Writer) *NDJSONRecorder
NewNDJSONRecorder creates a recorder that writes NDJSON to w.
func (*NDJSONRecorder) Record ¶
func (r *NDJSONRecorder) Record(event UsageEvent) error
Record serializes the event as a single JSON line followed by a newline.
type ResourceManager ¶
type ResourceManager struct {
// contains filtered or unexported fields
}
ResourceManager tracks loaded models and their VRAM usage, evicting least-recently-used models when a new load would exceed the memory budget.
func NewResourceManager ¶
func NewResourceManager(budgetBytes uint64) (*ResourceManager, error)
NewResourceManager creates a ResourceManager with the given VRAM budget in bytes.
func (*ResourceManager) Evict ¶
func (rm *ResourceManager) Evict(modelID string) error
Evict explicitly removes a model from the manager.
func (*ResourceManager) Load ¶
func (rm *ResourceManager) Load(modelID string, vramBytes uint64) error
Load registers a model with the given VRAM footprint. If loading would exceed the budget, LRU models are evicted until there is enough space. Returns an error if the model alone exceeds the entire budget.
func (*ResourceManager) LoadedModels ¶
func (rm *ResourceManager) LoadedModels() []ModelInfo
LoadedModels returns a snapshot of all currently loaded models.
func (*ResourceManager) SetEvictCallback ¶
func (rm *ResourceManager) SetEvictCallback(fn func(modelID string))
SetEvictCallback sets an optional function called when a model is evicted.
func (*ResourceManager) Stats ¶
func (rm *ResourceManager) Stats() (used, budget uint64, loaded int)
Stats returns the current memory usage statistics.
func (*ResourceManager) Touch ¶
func (rm *ResourceManager) Touch(modelID string) error
Touch updates the last-used time for a model, moving it to the front of the LRU list. Call this on each inference request.
type Tenant ¶
type Tenant struct {
Config TenantConfig
// contains filtered or unexported fields
}
Tenant represents a registered tenant with its quota state.
func TenantFromContext ¶
TenantFromContext extracts the Tenant from the request context.
func (*Tenant) ConsumeTokens ¶
ConsumeTokens attempts to consume n tokens from the tenant's per-minute budget. It returns true if the tokens were consumed, false if the budget is exhausted.
func (*Tenant) ModelAllowed ¶
ModelAllowed returns true if the model is in the tenant's allow list. An empty allow list permits all models.
type TenantConfig ¶
type TenantConfig struct {
MaxConcurrentRequests int `json:"max_concurrent_requests"`
MaxTokensPerMinute int64 `json:"max_tokens_per_minute"`
ModelAllowList []string `json:"model_allow_list,omitempty"`
}
TenantConfig holds per-tenant quota configuration.
func (TenantConfig) Validate ¶
func (c TenantConfig) Validate() error
Validate checks that the configuration has valid values.
type TenantRegistry ¶
type TenantRegistry struct {
// contains filtered or unexported fields
}
TenantRegistry manages per-API-key tenant registrations and quotas.
func NewTenantRegistry ¶
func NewTenantRegistry() *TenantRegistry
NewTenantRegistry creates a new empty registry.
func (*TenantRegistry) Get ¶
func (r *TenantRegistry) Get(apiKey string) (*Tenant, error)
Get retrieves the tenant for the given API key.
func (*TenantRegistry) Middleware ¶
func (r *TenantRegistry) Middleware(next http.Handler) http.Handler
Middleware returns an HTTP middleware that enforces tenant isolation. It extracts the API key from the Authorization header (Bearer <key>), enforces concurrency limits, token rate limits, and injects the Tenant into the request context.
func (*TenantRegistry) Register ¶
func (r *TenantRegistry) Register(apiKey string, cfg TenantConfig) error
Register adds a tenant with the given API key and configuration.
func (*TenantRegistry) Remove ¶
func (r *TenantRegistry) Remove(apiKey string) error
Remove deletes a tenant registration.
type UsageEvent ¶
type UsageEvent struct {
TenantID string `json:"tenant_id"`
Model string `json:"model"`
PromptTokens int `json:"prompt_tokens"`
CompletionTokens int `json:"completion_tokens"`
Timestamp int64 `json:"timestamp"`
}
UsageEvent records token consumption for a single request.
type UsageRecorder ¶
type UsageRecorder interface {
Record(event UsageEvent) error
}
UsageRecorder defines the interface for recording usage events. The default implementation writes NDJSON; a Kafka adapter can implement this interface for production deployments.