Documentation
¶
Overview ¶
Package capacity is modeld's hardware capacity planner: it resolves the EFFECTIVE context window a model can actually be served at on this device, from the model's KV-cache footprint and the device's free memory — not the model's trained ceiling alone. modeld owns this calculation because it owns the backend process and hardware telemetry; the runtime consumes the resolved value and does not inspect model files.
Index ¶
Constants ¶
const DefaultHeadroomFrac = 0.1
DefaultHeadroomFrac of free memory is reserved for activations, the compute graph, and fragmentation, leaving the rest for model weights + KV cache.
const DefaultHostColdFrac = 0.25
DefaultHostColdFrac is the launch-time cap for the host-RAM KV cold store when the user did not set one explicitly.
const DefaultMaxResidentFrac = 0.8
DefaultMaxResidentFrac caps modeld's resident footprint at this fraction of the device's CURRENTLY free memory when the user did not set an explicit ceiling. It is evaluated fresh on every resolution, so the budget tracks the device live instead of freezing a launch-time view.
Variables ¶
This section is empty.
Functions ¶
func HeadroomFromEnv ¶
func HeadroomFromEnv() float64
HeadroomFromEnv reads CONTENOX_MODELD_MEM_HEADROOM (a fraction in (0,1)), falling back to DefaultHeadroomFrac.
func KVBytesPerToken ¶
KVBytesPerToken is the memory one token of context costs in the KV cache: K and V, across every layer and KV head, at the KV precision.
func ParseBytes ¶ added in v0.32.4
ParseBytes parses byte strings used by modeld memory settings.
Types ¶
type DeviceSnapshot ¶ added in v0.32.4
type DeviceSnapshot struct {
Kind string `json:"kind,omitempty"`
DeviceID string `json:"device_id,omitempty"`
TotalBytes int64 `json:"total_bytes,omitempty"`
FreeBytes int64 `json:"free_bytes,omitempty"`
}
DeviceSnapshot describes the memory pool the backend will allocate from.
func Snapshot ¶ added in v0.32.4
func Snapshot(src MemorySource) (DeviceSnapshot, error)
Snapshot returns a DeviceSnapshot for either a richer source with Snapshot or a legacy FreeBytes-only source.
type MemorySource ¶
MemorySource reports the free memory of the device a backend serves on. modeld picks the source by device: system RAM for CPU; GPU VRAM (ov::Core / ggml) is a CGO seam filled per backend when a GPU device is selected.
type ModelCapacity ¶
type ModelCapacity struct {
ModelMaxContext int
EffectiveContext int
MemoryContextTokens int
HotContextTokens int
PlannerEffectiveContext int
KVBytesPerToken int64
FreeBytes int64
WeightsBytes int64
OverheadBytes int64
ReservedBytes int64
UserLimitBytes int64
MinFreeBytes int64
HostColdBudgetBytes int64
UsableBytes int64
RequiredBytes int64
Clamped bool
Reason string
}
ModelCapacity is the resolved result reported to the runtime. EffectiveContext remains the dense context window modeld will actually serve today and the value the cache identity must use. MemoryContextTokens is the raw KV-token budget from memory before model/request clamping. HotContextTokens is the physical hot KV budget. PlannerEffectiveContext is the logical planner window: it equals the dense window when no host cold budget exists, and can grow by the cold KV token budget once host offload is configured.
func Resolve ¶
func Resolve(p Params) ModelCapacity
Resolve computes the dense compatibility window, physical hot context budget, and logical planner window:
usable = min(free - minFree, userLimit - reserved) * (1 - headroom) effective = clamp(request, 0, min(modelMax, (usable - weights - overhead) / kvBytesPerToken))
Unknown inputs degrade gracefully: with no KV cost it falls back to the model ceiling (clamped by request); with no ceiling it uses the memory budget.
type Params ¶
type Params struct {
ModelMaxCtx int // model's trained context ceiling (0 = unknown)
KVBytesPerToken int64 // 0 = unknown (cannot budget by memory)
WeightsBytes int64 // resident model weight footprint
OverheadBytes int64 // fixed runtime buffers (compute graph, staging)
FreeBytes int64 // device free memory
ReservedBytes int64 // memory already reserved by resident sessions
UserLimitBytes int64 // user cap for modeld resident memory (0 = no cap)
MinFreeBytes int64 // memory to leave free for the desktop/other workloads
HostColdBudgetBytes int64 // host-RAM budget for cold KV blocks (0 = none)
Request int // requested window (0 = use the resolved max)
HeadroomFrac float64 // <=0 or >=1 falls back to DefaultHeadroomFrac
}
Params are the inputs to a capacity resolution. Zero values mean "unknown": an unknown ModelMaxCtx or KVBytesPerToken disables that side of the clamp rather than producing a bogus window.
type Policy ¶ added in v0.32.4
type Policy struct {
MaxResidentBytes int64 `json:"max_resident_bytes,omitempty"`
MinFreeBytes int64 `json:"min_free_bytes,omitempty"`
HostColdBudgetBytes int64 `json:"host_cold_budget_bytes,omitempty"`
HeadroomFrac float64 `json:"headroom_frac,omitempty"`
}
Policy is the user/operator memory policy modeld applies before opening a resident session. MaxResidentBytes is a hard ceiling on modeld's resident footprint for the served device; MinFreeBytes preserves memory for the desktop or other local workloads that may share the same device.
func LoadPolicy ¶ added in v0.32.4
LoadPolicy reads <dataRoot>/modeld.json and then applies env overrides. The JSON accepts either numeric byte fields or string fields ("8GiB", "512MiB"):
{"memory":{"max_resident":"8GiB","reserve_free":"2GiB","headroom_frac":0.15}}
func WithHostColdDefaults ¶ added in v0.32.6
func WithHostColdDefaults(p Policy, host DeviceSnapshot) Policy
WithHostColdDefaults fills the host-RAM cold-store budget from a host memory snapshot. It is separate from WithResidentDefault because the hot model budget may come from VRAM while the cold store always lives in host RAM.
func WithResidentDefault ¶ added in v0.32.6
func WithResidentDefault(p Policy, dev DeviceSnapshot) Policy
WithResidentDefault fills a missing resident-memory cap from the device's CURRENT free memory. Services call it with a fresh snapshot on every resolution, so the default tracks the device live — it rises when memory frees up and falls when other workloads claim it. An explicit MaxResidentBytes (the user's hard cap) always wins and is left untouched.