Documentation
¶
Overview ¶
Package gpu provides detection, allocation, and health monitoring of GPU devices for the Helix cluster. This file defines the unified GPUBackend interface and associated value types used by the backend registry and all concrete backend implementations.
Package gpu provides detection, allocation, and health monitoring of GPU devices for the Helix cluster. Detection reads REAL, OS-native sources via a build-tag-selected detectGPUsPlatform: NVIDIA /proc on Linux and system_profiler on macOS (unsupported OSes return an explicit error). Production never substitutes a mock inventory; tests that need a controlled inventory use Manager.InjectGPUsForTest. The Manager type owns the device inventory and is safe for concurrent use; the Monitor type periodically refreshes per-GPU health metrics.
Package gpu — pure NVIDIA /proc information parser. No build tag: runs on all platforms (M3, Linux, etc.). Does no I/O; all I/O is in nvidia_reader_linux.go (linux-only).
Index ¶
- Variables
- func ConfigureSharing(ctx context.Context, b GPUBackend, deviceID string, req SharingRequest) error
- type AESGCMEncryptor
- type AppleBackend
- func (a *AppleBackend) AllocateMemory(_ context.Context, _ string, _ int) (string, error)
- func (a *AppleBackend) DetectDevices(ctx context.Context) ([]DeviceInfo, error)
- func (a *AppleBackend) DisableMPS(_ context.Context, _ string) error
- func (a *AppleBackend) EnableMPS(_ context.Context, _ string, _ int) error
- func (a *AppleBackend) Execute(_ context.Context, _, _ string, _ time.Duration) (string, error)
- func (a *AppleBackend) ExecuteDistributed(_ context.Context, _ string, _ []string, _ time.Duration) (map[string]string, error)
- func (a *AppleBackend) FreeMemory(_ context.Context, _ string, _ string) error
- func (a *AppleBackend) GetDeviceStatus(ctx context.Context, id string) (DeviceStatus, error)
- func (a *AppleBackend) GetMetrics(ctx context.Context, id string) (GPUMetrics, error)
- func (a *AppleBackend) Vendor() string
- type Attestor
- type BackendRegistry
- type ChutesAttestor
- type Decision
- type DeviceInfo
- type DeviceSharingState
- type DeviceStatus
- type Encryptor
- type GPU
- type GPUBackend
- type GPUMetrics
- type GPUStatus
- type HMACAttestor
- type MIGDevice
- type MIGInstance
- type MIGProfile
- type Manager
- func (m *Manager) AllocateForChutes(jobID string, attestor ChutesAttestor) (*GPU, error)
- func (m *Manager) AllocateGPU(jobID string) (*GPU, error)
- func (m *Manager) AllocateGPUByMemory(jobID string, minMemoryMB int) (*GPU, error)
- func (m *Manager) AttestGPU(id string, attestor ChutesAttestor) (bool, error)
- func (m *Manager) AttestationResult(id string) (result bool, ok bool)
- func (m *Manager) ChutesCapacity() Stats
- func (m *Manager) DetectGPUs() error
- func (m *Manager) GetGPUByID(id string) (*GPU, error)
- func (m *Manager) InjectGPUsForTest(gpus []*GPU)
- func (m *Manager) IsChutesEligible(id string) bool
- func (m *Manager) ListGPUs() []*GPU
- func (m *Manager) ReleaseGPU(jobID string)
- func (m *Manager) ReserveForHelixPoW(fraction float64) error
- func (m *Manager) SetHelixLoad(fraction float64)
- func (m *Manager) SetOffline(id string) error
- func (m *Manager) SetOnline(id string) error
- func (m *Manager) StartMonitoring(ctx context.Context, opts ...MonitorOption) (*Monitor, error)
- func (m *Manager) Stats() Stats
- func (m *Manager) StopMonitoring()
- type Monitor
- type MonitorOption
- type NvidiaGPUInfo
- type Reservation
- func (r *Reservation) Admit(class WorkloadClass, amount int) Decision
- func (r *Reservation) AvailableFor(class WorkloadClass) int
- func (r *Reservation) Capacity() int
- func (r *Reservation) Release(class WorkloadClass, amount int) error
- func (r *Reservation) Reserve(class WorkloadClass, amount int) (Decision, error)
- func (r *Reservation) Reserved(class WorkloadClass) int
- func (r *Reservation) Used(class WorkloadClass) int
- type ReservationConfig
- type SharingMode
- type SharingRequest
- type Stats
- type WorkloadClass
Constants ¶
This section is empty.
Variables ¶
var ( // ProfileSmall is "vGPU-Small": 1g.10gb. ProfileSmall = MIGProfile{Name: "1g.10gb", Slices: 1, MemoryGB: 10} // ProfileMedium is "vGPU-Medium": 2g.20gb. ProfileMedium = MIGProfile{Name: "2g.20gb", Slices: 2, MemoryGB: 20} // ProfileLarge is "vGPU-Large": 3g.40gb. ProfileLarge = MIGProfile{Name: "3g.40gb", Slices: 3, MemoryGB: 40} // ProfileFull is "full": 7g.80gb. ProfileFull = MIGProfile{Name: "7g.80gb", Slices: 7, MemoryGB: 80} )
Standard MIG profile catalog for H100 / H200 / B200 class devices (80 GB HBM, 7 compute slices). These are the four vGPU tiers supported by the Helix scheduler. The values mirror the NVIDIA-defined partition plan for an 80 GB SXM device:
- 1g.10gb → 1 slice, 10 GB → up to 7 instances
- 2g.20gb → 2 slices, 20 GB → up to 3 instances (6 slices, leaving 1 unused)
- 3g.40gb → 3 slices, 40 GB → up to 2 instances (6 slices, leaving 1 unused)
- 7g.80gb → 7 slices, 80 GB → 1 instance (the whole device)
WHY these four: they cover the four standard Helix vGPU tiers (Small/Medium/ Large/Full). Any other partition shape requires a custom MIGProfile and is NOT in the catalog — unknown profiles are rejected by the partition function.
var ErrAttestationRequired = errors.New("gpu: GPU has not passed attestation; ineligible for Chutes inference")
ErrAttestationRequired is returned by AllocateForChutes when the candidate GPU did not PASS attestation. It is a sentinel so callers (and tests) can match the "not attested" refusal with errors.Is rather than string-sniffing, and so the refusal is unambiguously distinct from a plain "no available GPUs" condition.
WHY a typed sentinel: the Chutes inference allocation gate is security-relevant (only GraVal-attested GPUs may serve external inference). A caller that wants to surface "attestation pending/failed" to an operator must be able to detect this precise condition; a bare fmt error would force brittle string matching.
var ErrMIGAlreadyPartitioned = errors.New("mig: device is already partitioned; reconfiguration requires a device reset")
ErrMIGAlreadyPartitioned is returned by PartitionGPU when a MIGDevice is already partitioned and cannot be repartitioned without reset.
var ErrNoBackendDetected = errors.New("gpu: no backend detected any device")
errNoBackendDetected is the sentinel returned by AutoDetect when none of the registered backends reports at least one device. Callers compare via errors.Is(err, ErrNoBackendDetected).
var ErrOversubscribed = errors.New("mig: partition would oversubscribe device capacity")
ErrOversubscribed is returned by PartitionGPU when the requested profile would exceed the device's slice or memory capacity. It is a sentinel so callers can distinguish oversubscription from "no such GPU" or "profile unknown".
var ErrProfileNotFound = errors.New("mig: profile not found in standard catalog")
ErrProfileNotFound is returned by LookupProfile when the requested name does not correspond to a catalog entry. It is a sentinel so callers can distinguish "unknown profile" from other errors without string-matching.
var ErrUnsupported = errors.New("gpu: operation unsupported on this backend/host")
ErrUnsupported is returned by GPUBackend operations that are not available on the current backend or host toolchain (e.g. Metal-compute execution on Apple Silicon where no compute toolchain is wired). Callers MUST test via errors.Is(err, ErrUnsupported) rather than string comparison.
var StandardProfiles map[string]MIGProfile
StandardProfiles is the ordered catalog of Helix-supported MIG profiles, keyed by their canonical SKU name. It is populated at package init so it is always consistent with the profile variables above.
Functions ¶
func ConfigureSharing ¶
func ConfigureSharing(ctx context.Context, b GPUBackend, deviceID string, req SharingRequest) error
ConfigureSharing attempts to apply the hardware-level sharing configuration described by req to the device identified by deviceID on backend b.
For AppleBackend (and any backend that does not expose MPS/MIG hardware partitioning), this returns ErrUnsupported. The caller MUST test via errors.Is(err, ErrUnsupported). The admission state machine (DeviceSharingState) is always host-provable and independent of hardware support; this function is ONLY the hardware-enable seam and must NOT fake success.
Types ¶
type AESGCMEncryptor ¶
type AESGCMEncryptor struct {
// contains filtered or unexported fields
}
AESGCMEncryptor is the stdlib reference Encryptor using AES-256-GCM. Each Seal uses a fresh random nonce prepended to the ciphertext.
func NewAESGCMEncryptor ¶
func NewAESGCMEncryptor(key []byte) (*AESGCMEncryptor, error)
NewAESGCMEncryptor builds an AESGCMEncryptor from a 16/24/32-byte key.
type AppleBackend ¶
type AppleBackend struct{}
AppleBackend is the GPUBackend for Apple Silicon (and Intel Mac) GPUs. DetectDevices, GetDeviceStatus, and GetMetrics use real OS-native data via the existing detectGPUsPlatform() / Manager pipeline (system_profiler + sysctl on darwin; an explicit error on other platforms).
AllocateMemory, FreeMemory, Execute, ExecuteDistributed, EnableMPS, and DisableMPS all return ErrUnsupported: this host has no Metal-compute execution toolchain wired. They MUST NOT fake success (CLAUDE-1/CLAUDE-2).
func (*AppleBackend) AllocateMemory ¶
AllocateMemory returns ErrUnsupported. Apple Silicon GPU memory is managed by the Metal/unified-memory model; there is no wired execution toolchain.
func (*AppleBackend) DetectDevices ¶
func (a *AppleBackend) DetectDevices(ctx context.Context) ([]DeviceInfo, error)
DetectDevices probes the host for Apple GPU devices. On darwin it delegates to detectGPUsPlatform (system_profiler + sysctl) and converts the results to []DeviceInfo. On other platforms the platform detector returns an error and DetectDevices returns that error with no fabricated inventory.
func (*AppleBackend) DisableMPS ¶
func (a *AppleBackend) DisableMPS(_ context.Context, _ string) error
DisableMPS returns ErrUnsupported. Apple GPUs do not expose an MPS-equivalent interface accessible without proprietary tooling.
func (*AppleBackend) EnableMPS ¶
EnableMPS returns ErrUnsupported. Apple GPUs do not expose an MPS-equivalent interface accessible without proprietary tooling.
func (*AppleBackend) Execute ¶
Execute returns ErrUnsupported. No Metal-compute execution toolchain is wired; returning fake success would be a CLAUDE-1 PASS-bluff.
func (*AppleBackend) ExecuteDistributed ¶
func (a *AppleBackend) ExecuteDistributed(_ context.Context, _ string, _ []string, _ time.Duration) (map[string]string, error)
ExecuteDistributed returns ErrUnsupported. No Metal-compute execution toolchain is wired; returning fake success would be a CLAUDE-1 PASS-bluff.
func (*AppleBackend) FreeMemory ¶
FreeMemory returns ErrUnsupported. Apple Silicon GPU memory is managed by the Metal/unified-memory model; there is no wired execution toolchain.
func (*AppleBackend) GetDeviceStatus ¶
func (a *AppleBackend) GetDeviceStatus(ctx context.Context, id string) (DeviceStatus, error)
GetDeviceStatus returns a health snapshot for the device identified by id. On darwin, temperature and utilization are not exposed by system_profiler (Metal Performance HUD requires a running app). We return the device as Healthy with zero metrics, which is honest for an idle GPU and does not fabricate sensor readings.
func (*AppleBackend) GetMetrics ¶
func (a *AppleBackend) GetMetrics(ctx context.Context, id string) (GPUMetrics, error)
GetMetrics returns a lightweight runtime-metric snapshot for the device id. Apple's GPU metrics are not accessible without IOKit/Metal-Performance-HUD; we return a zero-valued GPUMetrics (honest: the GPU is idle / not measured).
func (*AppleBackend) Vendor ¶
func (a *AppleBackend) Vendor() string
Vendor implements GPUBackend. The registry key is "apple".
type Attestor ¶
type Attestor interface {
// Attest returns a proof over snapshot bound to the configured identity.
Attest(snapshot []byte) ([]byte, error)
// Verify reports whether proof is a valid attestation over snapshot.
Verify(snapshot, proof []byte) bool
}
Attestor produces and verifies a proof binding a reservation snapshot to a node/GPU identity. The stdlib reference implementation (HMACAttestor) uses HMAC-SHA256 and genuinely verifies in tests. The submodule's proof-of-GPU attestation implements this SAME interface, so a real hardware-rooted Attestor plugs in without changing the reservation seam.
type BackendRegistry ¶
type BackendRegistry struct {
// contains filtered or unexported fields
}
BackendRegistry holds the set of registered GPUBackend implementations and provides AutoDetect to choose a usable one at runtime. It is safe for concurrent use.
func NewBackendRegistry ¶
func NewBackendRegistry() *BackendRegistry
NewBackendRegistry returns an empty, ready-to-use BackendRegistry.
func (*BackendRegistry) AutoDetect ¶
func (r *BackendRegistry) AutoDetect(ctx context.Context) (GPUBackend, error)
AutoDetect iterates the registered backends in a deterministic probe order — nvidia → amd → intel → apple → (everything else in registration order) — performing REAL toolchain probes via exec.LookPath and actually calling each backend's DetectDevices. The first backend that reports >=1 device is returned. If no backend finds any device, ErrNoBackendDetected is returned.
The probe order for the preferred vendors is hard-coded so that CUDA takes precedence over ROCm / oneAPI / Apple on hosts with multiple toolchains.
func (*BackendRegistry) Get ¶
func (r *BackendRegistry) Get(vendor string) (GPUBackend, bool)
Get returns the registered backend for vendor, or (nil, false) if none.
func (*BackendRegistry) List ¶
func (r *BackendRegistry) List() []string
List returns all registered vendor strings in registration order.
func (*BackendRegistry) Register ¶
func (r *BackendRegistry) Register(b GPUBackend) error
Register adds b to the registry, keyed by b.Vendor(). An error is returned if a backend for the same vendor is already registered — duplicate registration is almost always a bug in the caller.
type ChutesAttestor ¶
type ChutesAttestor interface {
// Attest returns whether gpu has passed attestation. gpu is a defensive
// clone owned by the caller; implementations must treat it as read-only.
Attest(gpu *GPU) (bool, error)
}
ChutesAttestor is the injectable attestation seam. A ChutesAttestor reports whether a given GPU has passed hardware/GraVal attestation and is therefore trustworthy to serve Chutes (external) inference workloads.
- (true, nil) -> attestation PASSED; the GPU is eligible.
- (false, nil) -> attestation FAILED cleanly; the GPU is NOT eligible.
- (_, err) -> attestation could not be determined (e.g. the GraVal controller was unreachable); the result is unknown and the GPU is NOT eligible. The error is surfaced honestly (never coerced to a fake PASS), satisfying CLAUDE-1 / CLAUDE-2 degrade-honestly rules.
Production wires this to the GraVal attest controller (pkg/chutes attest) or pkg/gpuattest; tests inject a controllable fake. This package intentionally does NOT import those packages so the gate stays unit-testable without real hardware and without cross-package coupling — the adapter lives at the call site that already depends on both.
type Decision ¶
type Decision struct {
// Admitted reports whether the allocation is allowed.
Admitted bool
// Reason is a human-readable explanation, always set (even when admitted)
// so audit logs and tests can assert on the exact branch taken.
Reason string
}
Decision is the outcome of an admission check for a prospective allocation.
type DeviceInfo ¶
type DeviceInfo struct {
// ID is the backend-assigned device identifier (e.g. "gpu-0").
ID string
// UUID is a stable, world-unique identifier for the device.
UUID string
// Vendor is the device vendor name (e.g. "nvidia", "amd", "apple", "intel").
Vendor string
// Model is the human-readable model name (e.g. "Apple M3 GPU").
Model string
// MemoryMB is total device memory in mebibytes. May be 0 if the backend
// cannot determine it without fabricating a value.
MemoryMB int
// ComputeUnits is the number of compute units / shader processors reported
// by the device. May be 0 when the backend cannot read this value.
ComputeUnits int
}
DeviceInfo describes static, read-once properties of a GPU device.
type DeviceSharingState ¶
type DeviceSharingState struct {
// DeviceID is the backend-assigned identifier for the device.
DeviceID string
// Mode is the sharing mode currently in effect for the device.
// Unset (zero value = ShareExclusive) until the first job is admitted.
Mode SharingMode
// contains filtered or unexported fields
}
DeviceSharingState tracks the live sharing state for a single GPU device. It is safe for concurrent use.
func NewDeviceSharingState ¶
func NewDeviceSharingState(deviceID string) *DeviceSharingState
NewDeviceSharingState constructs an empty DeviceSharingState for deviceID.
func (*DeviceSharingState) Admit ¶
func (s *DeviceSharingState) Admit(req SharingRequest, deviceTotalQuantumMS, deviceMIGmax int) error
Admit attempts to admit req onto the device. deviceTotalQuantumMS is the maximum milliseconds budget for the device (used by ShareTimeSlice). deviceMIGmax is the maximum number of MIG partitions supported (used by ShareMIG). On success the job is tracked and nil is returned. On failure a descriptive error is returned and state is unchanged.
func (*DeviceSharingState) Release ¶
func (s *DeviceSharingState) Release(jobID string) error
Release removes the job identified by jobID from the device, freeing its associated quantum or partition slot. Returns an error if the job is not currently admitted.
type DeviceStatus ¶
type DeviceStatus struct {
// ID is the device identifier from DeviceInfo.
ID string
// TemperatureC is the current die temperature in Celsius.
TemperatureC float64
// UtilizationPercent is the current shader/compute utilization in [0,100].
UtilizationPercent float64
// MemoryUsedMB is the number of mebibytes currently occupied on the device.
MemoryUsedMB int
// Healthy is false when the device is in an error or degraded state.
Healthy bool
}
DeviceStatus captures the dynamic health state of a GPU device.
type Encryptor ¶
type Encryptor interface {
// Seal encrypts plaintext, returning an authenticated ciphertext.
Seal(plaintext []byte) ([]byte, error)
// Open authenticates and decrypts a ciphertext produced by Seal.
Open(ciphertext []byte) ([]byte, error)
}
Encryptor seals and opens reservation-ledger bytes. The stdlib reference implementation (AESGCMEncryptor) provides authenticated AES-256-GCM and actually round-trips in tests. The digital.vasic.security submodule's post-quantum end-to-end encryption implements this SAME interface, so a PQ-E2EE Encryptor can be swapped in without touching reservation logic.
type GPU ¶
type GPU struct {
ID string `json:"id"`
UUID string `json:"uuid"`
Model string `json:"model"`
MemoryMB int `json:"memory_mb"`
UtilizationPercent float64 `json:"utilization_percent"`
TemperatureC float64 `json:"temperature_c"`
AllocatedTo string `json:"allocated_to"`
Status GPUStatus `json:"status"`
}
GPU represents a single GPU device.
type GPUBackend ¶
type GPUBackend interface {
// Vendor returns the vendor string this backend handles (e.g. "nvidia",
// "amd", "apple", "intel"). It is used as the registry key and is always
// lower-case.
Vendor() string
// DetectDevices probes the host for devices managed by this backend and
// returns their static descriptions. A nil error with an empty slice means
// the backend is functional but finds no devices. An error means the probe
// could not be completed (toolchain missing, permission denied, etc.).
DetectDevices(ctx context.Context) ([]DeviceInfo, error)
// GetDeviceStatus returns the current health snapshot for the device
// identified by id (as returned by DetectDevices).
GetDeviceStatus(ctx context.Context, id string) (DeviceStatus, error)
// GetMetrics returns a lightweight runtime-metric snapshot for the device
// identified by id.
GetMetrics(ctx context.Context, id string) (GPUMetrics, error)
// AllocateMemory requests that mb mebibytes be reserved on the device
// identified by id. On success it returns an opaque buffer handle that
// must be passed to FreeMemory. Returns ErrUnsupported when the backend
// has no memory-management facility wired.
AllocateMemory(ctx context.Context, id string, mb int) (string, error)
// FreeMemory releases the buffer identified by buf on the device id, as
// previously allocated by AllocateMemory. Returns ErrUnsupported when the
// backend has no memory-management facility wired.
FreeMemory(ctx context.Context, id string, buf string) error
// Execute dispatches jobID to the single device id, waiting up to timeout.
// On success it returns an opaque execution receipt. Returns ErrUnsupported
// when the backend has no compute-execution toolchain wired.
Execute(ctx context.Context, jobID string, id string, timeout time.Duration) (string, error)
// ExecuteDistributed dispatches jobID across the set of device ids, waiting
// up to timeout. Returns a map from device id to execution receipt.
// Returns ErrUnsupported when the backend has no distributed-compute
// toolchain wired.
ExecuteDistributed(ctx context.Context, jobID string, ids []string, timeout time.Duration) (map[string]string, error)
// EnableMPS enables Multi-Process Service (or an equivalent time-slicing
// mechanism) on device id, configuring a time slice of sliceMS milliseconds.
// Returns ErrUnsupported when MPS is not available for this backend.
EnableMPS(ctx context.Context, id string, sliceMS int) error
// DisableMPS tears down Multi-Process Service on device id.
// Returns ErrUnsupported when MPS is not available for this backend.
DisableMPS(ctx context.Context, id string) error
}
GPUBackend is the unified interface that every vendor-specific GPU backend must satisfy. Implementations are registered in a BackendRegistry and probed by AutoDetect to find a usable backend at runtime.
Implementations MUST be safe for concurrent use.
Methods that cannot be supported on the current host or toolchain MUST return ErrUnsupported (not nil, and not a fabricated success).
type GPUMetrics ¶
type GPUMetrics struct {
// TemperatureC is the current die temperature in Celsius.
TemperatureC float64
// UtilizationPercent is the current compute utilization in [0,100].
UtilizationPercent float64
// MemoryUsedMB is the number of mebibytes currently in use.
MemoryUsedMB int
}
GPUMetrics is a lightweight snapshot of runtime metrics for a single device. It is the per-sample type used by the monitoring loop and scrape endpoints.
type GPUStatus ¶
type GPUStatus int
GPUStatus represents the current status of a GPU.
func (GPUStatus) MarshalJSON ¶
MarshalJSON implements json.Marshaler.
func (*GPUStatus) UnmarshalJSON ¶
UnmarshalJSON implements json.Unmarshaler.
type HMACAttestor ¶
type HMACAttestor struct {
// contains filtered or unexported fields
}
HMACAttestor is the stdlib reference Attestor using HMAC-SHA256 over the identity and snapshot. Verify is constant-time.
func NewHMACAttestor ¶
func NewHMACAttestor(identity string, key []byte) *HMACAttestor
NewHMACAttestor builds an HMACAttestor for the given identity and key.
func (*HMACAttestor) Attest ¶
func (a *HMACAttestor) Attest(snapshot []byte) ([]byte, error)
Attest implements Attestor.
func (*HMACAttestor) Verify ¶
func (a *HMACAttestor) Verify(snapshot, proof []byte) bool
Verify implements Attestor.
type MIGDevice ¶
type MIGDevice struct {
// GPUID is the Manager inventory ID of the physical GPU backing this
// MIGDevice.
GPUID string
// TotalSlices is the total number of compute slices on this physical GPU.
// Typically 7 for H100/H200/B200.
TotalSlices int
// TotalMemoryGB is the total GPU memory in GB (e.g. 80 for an 80 GB device).
TotalMemoryGB int
// Partitioned records whether PartitionGPU has been called. Once true the
// partition plan cannot change — the MIG hardware constraint: a re-partition
// requires a GPU reset which is an operator action, not a runtime call.
Partitioned bool
// Profile is the MIGProfile used to partition this device. Valid only when
// Partitioned is true.
Profile MIGProfile
// contains filtered or unexported fields
}
MIGDevice is the in-memory representation of a physical GPU partitioned into MIG instances. It is safe for concurrent use.
WHY separate from Manager's GPU slice: MIG operation replaces whole-device allocation with fine-grained instance allocation. A MIGDevice tracks both the partition plan (profile, instance count) and per-instance allocation state so the scheduler can decide which instance to hand to a job without re-reading hardware on every decision.
Immutability contract: once Partitioned is true, no field may change except per-instance allocation state. Enforced by returning ErrMIGAlreadyPartitioned from PartitionGPU.
func NewMIGDevice ¶
NewMIGDevice creates an unpartitioned MIGDevice for the given GPU.
totalSlices and totalMemoryGB must match the physical device. For H100/H200/ B200-class devices these are 7 and 80 respectively.
func (*MIGDevice) Instances ¶
func (d *MIGDevice) Instances() []*MIGInstance
Instances returns a snapshot of the MIG instances created by PartitionGPU. Returns nil if the device has not been partitioned yet.
func (*MIGDevice) IsPartitioned ¶
IsPartitioned reports whether the device has been partitioned.
func (*MIGDevice) PartitionGPU ¶
func (d *MIGDevice) PartitionGPU(profile MIGProfile) ([]*MIGInstance, error)
PartitionGPU partitions the device into as many MIGInstance objects as the profile allows within the device's slice and memory capacity.
Rules enforced (per CLAUDE-1 / spec):
- The profile must fit the device: both floor(TotalSlices/profile.Slices) and floor(TotalMemoryGB/profile.MemoryGB) must be >= 1, i.e. the profile's slice and memory requirements must individually not exceed the device. If either bound is exceeded, ErrOversubscribed is returned and the device is NOT modified.
- The device may be partitioned at most once. A second call returns ErrMIGAlreadyPartitioned (the hardware constraint: live reconfiguration is impossible without a reset, which is an operator action).
- The instance count is floor(TotalSlices / profile.Slices) limited by floor(TotalMemoryGB / profile.MemoryGB) — the binding resource is the more constrained of the two.
On success the device transitions to Partitioned=true and the instances slice is populated.
Mutation guard: the load-bearing oversubscription check is the "count < 1" comparison at the end of the capacity computation (below). With integer floor division, any profile whose Slices or MemoryGB exceeds the device total will produce a per-resource ratio of 0, making count = min(0, …) = 0 and triggering ErrOversubscribed. Removing that guard would allow count=0 to reach the instance-creation loop, returning an empty (non-nil) slice with a nil error — TestMIGPartition_OversubscriptionRejected and TestMIGPartition_OversubscriptionRejected_MemoryBound both assert that the error IS NOT nil and IS ErrOversubscribed, so they fail when the guard is removed. The TestMIGPartition_OversubscriptionRejected_GuardIsolation test additionally proves that the sole source of the rejection is the count<1 branch, not any other pre-filter.
type MIGInstance ¶
type MIGInstance struct {
// ID uniquely identifies the instance within its parent device.
// Format: "<gpuID>-mig-<index>".
ID string
// Profile is the partition plan that defines this instance's resources.
Profile MIGProfile
// ParentGPUID is the ID of the physical GPU from which this instance was
// created. It is the link back to GPU.ID in the inventory.
ParentGPUID string
}
MIGInstance represents a single MIG instance carved from a physical GPU. It is read-only after creation — per NVIDIA MIG semantics, instances are fixed until the device is reset. The only mutable state is its allocation (which job, if any, holds it).
type MIGProfile ¶
type MIGProfile struct {
// Name is the canonical SKU name, e.g. "1g.10gb". It is the lookup key in
// the standard catalog.
Name string
// Slices is the number of GPU compute slices consumed by this profile. A
// full H100/H200/B200 class GPU exposes 7 compute slices.
Slices int
// MemoryGB is the GPU memory consumed by one instance of this profile.
MemoryGB int
}
MIGProfile describes a MIG (Multi-Instance GPU) slice configuration.
WHY a named profile rather than raw integers: NVIDIA MIG profiles have well-known SKU names (e.g. "1g.10gb") that hardware operators recognise; the name is the stable API handle while slice count and memory are the derived implementation values. Keeping the three together prevents callers from accidentally inventing profiles that do not correspond to real partition plans.
WHY immutable (read-only exported fields): MIG hardware partitions are FIXED at creation time — the GPU must be reset to change them. Representing that constraint in the type (no setters, no exported mutating fields behind a pointer) prevents any in-process code from attempting a live reconfiguration, which would be both incorrect and dangerously misleading to an operator.
func LookupProfile ¶
func LookupProfile(name string) (MIGProfile, error)
LookupProfile retrieves a MIGProfile from the standard catalog by its SKU name. It returns ErrProfileNotFound if the name is not known.
type Manager ¶
type Manager struct {
// contains filtered or unexported fields
}
Manager manages GPU resources for the cluster.
func (*Manager) AllocateForChutes ¶
func (m *Manager) AllocateForChutes(jobID string, attestor ChutesAttestor) (*GPU, error)
AllocateForChutes allocates an Available GPU to jobID for Chutes (external) inference — but ONLY a GPU that PASSES attestation via the injected attestor. This is the attestation-gated allocation path: an unattested or failed GPU is REFUSED with ErrAttestationRequired and is left untouched in the pool.
Behaviour:
- Idempotent per job: if jobID already holds a GPU it is returned unchanged (no second device consumed), mirroring AllocateGPU.
- Each candidate GPU is attested at allocation time (the gate re-evaluates; it is not a permanent denylist), and the result is recorded so IsChutesEligible / AttestationResult reflect it.
- On attestor transport error the wrapped error is returned and nothing is allocated.
WHY the gate is load-bearing: per CLAUDE-1, allowing a non-attested GPU to serve external inference would be a security PASS-bluff. Removing the "must pass attestation" check would let TestAllocateForChutes_RefusesUnattestedGPU allocate a failed GPU — that test is the mutation guard for this gate.
func (*Manager) AllocateGPU ¶
AllocateGPU allocates an available GPU to the given jobID. Allocation is idempotent per job: if jobID already holds a GPU, that same GPU is returned rather than reserving a second device. An empty jobID is rejected so that a caller cannot accidentally mass-bind unallocated GPUs (which all carry an empty AllocatedTo).
func (*Manager) AllocateGPUByMemory ¶
AllocateGPUByMemory allocates an Available GPU with at least minMemoryMB of total memory to jobID, preferring the smallest-fitting device so larger GPUs remain free for heavier jobs (best-fit). Like AllocateGPU it is idempotent per job, but a job that already holds a GPU smaller than minMemoryMB is reported as an error rather than silently returning an undersized device.
func (*Manager) AttestGPU ¶
func (m *Manager) AttestGPU(id string, attestor ChutesAttestor) (bool, error)
AttestGPU runs the given ChutesAttestor against the GPU identified by id, records the boolean result in the per-GPU attestation map (in-memory, mu-guarded), and returns it. A non-nil attestor error is wrapped and returned, and the result is NOT recorded as a pass — an indeterminate attestation must never be cached as eligible.
AttestGPU is the single source of truth for "did this GPU pass": both AllocateForChutes and IsChutesEligible build on the result it records.
func (*Manager) AttestationResult ¶
AttestationResult returns the last recorded attestation result for the GPU id and whether any result has been recorded at all. The ok=false return distinguishes "never attested" from "attested and failed" (result=false, ok=true).
func (*Manager) ChutesCapacity ¶
ChutesCapacity returns a Stats snapshot describing the GPU capacity available to external schedulers ("Chutes") after accounting for:
- The Helix PoW reservation fraction set by ReserveForHelixPoW.
- The Gepetto starvation guard: if the current Helix load (set by SetHelixLoad) is strictly above helixPoWStarvationThreshold (0.80), Available and FreeMemoryMB are forced to zero so Chutes receives no resources at all, regardless of reservation arithmetic.
Non-count fields (Total, Allocated, Unhealthy, Offline, TotalMemoryMB) reflect the full physical inventory; only Available and FreeMemoryMB are adjusted to show the Chutes-visible slice.
func (*Manager) DetectGPUs ¶
DetectGPUs detects available GPUs using real, OS-native sources via the build-tag-selected detectGPUsPlatform (NVIDIA /proc on Linux, system_profiler on macOS, an explicit "unsupported" error elsewhere).
Production NEVER falls back to a mock inventory: if detection fails OR finds zero devices, DetectGPUs returns that error honestly so callers observe "no GPUs detected" rather than fabricated hardware (CLAUDE-1/CLAUDE-2). Tests that need a controlled inventory must use InjectGPUsForTest instead.
func (*Manager) GetGPUByID ¶
GetGPUByID returns a GPU by its ID.
func (*Manager) InjectGPUsForTest ¶
InjectGPUsForTest installs a controlled GPU inventory and marks the Manager as started, bypassing real hardware detection. It exists SOLELY as a test seam so unit tests can exercise allocation/stats/offline logic against a deterministic multi-GPU inventory on any host. Production code paths use DetectGPUs, which reads real hardware and never invokes this.
func (*Manager) IsChutesEligible ¶
IsChutesEligible reports whether the GPU identified by id has a RECORDED, PASSED attestation result. A GPU that was never attested, or whose last attestation failed, is not eligible. This is the predicate other components use to decide if a GPU may serve Chutes inference.
func (*Manager) ReleaseGPU ¶
ReleaseGPU releases the GPU allocated to the given jobID.
func (*Manager) ReserveForHelixPoW ¶
ReserveForHelixPoW records fraction as the portion [0, 1] of the total GPU pool reserved for Helix Proof-of-Work. It does NOT allocate or mark any GPU; it stores an accounting fraction that ChutesCapacity uses when computing the non-reserved remainder available to external schedulers (e.g. Chutes).
Validation:
- fraction must be in [0, 1] (inclusive on both ends).
- Storing 1.0 means the entire pool is claimed for Helix PoW; 0.0 means no reservation (ChutesCapacity equals the full Stats).
func (*Manager) SetHelixLoad ¶
SetHelixLoad records fraction as the current Helix PoW load [0, 1]. Callers (e.g. the PoW scheduler) report this continuously; ChutesCapacity reads it to decide whether the Gepetto starvation guard must engage.
A value > 0.80 triggers the hard-cap: ChutesCapacity returns zero Available and zero FreeMemoryMB regardless of the reservation fraction, preventing external schedulers from being assigned any GPU while Helix is under heavy load.
func (*Manager) SetOffline ¶
SetOffline marks a GPU as Offline for maintenance/draining. An Allocated GPU cannot be taken offline without first releasing its job, so SetOffline refuses to silently evict a running job and returns an error instead.
func (*Manager) SetOnline ¶
SetOnline returns an Offline GPU to the Available pool. GPUs that are not Offline are left unchanged (an Allocated or Unhealthy GPU must not be forced back to Available by this call).
func (*Manager) StartMonitoring ¶
StartMonitoring creates and starts a background health Monitor bound to this Manager. It is a no-op (returning the already-running Monitor) if monitoring is already active. The returned Monitor can be inspected; StopMonitoring tears it down.
func (*Manager) StopMonitoring ¶
func (m *Manager) StopMonitoring()
StopMonitoring stops the background Monitor if one is running.
type Monitor ¶
type Monitor struct {
// contains filtered or unexported fields
}
Monitor periodically checks GPU health and updates metrics.
type MonitorOption ¶
type MonitorOption func(*Monitor)
MonitorOption configures the Monitor.
func WithInterval ¶
func WithInterval(d time.Duration) MonitorOption
WithInterval sets the monitoring interval.
func WithTemperatureThreshold ¶
func WithTemperatureThreshold(t float64) MonitorOption
WithTemperatureThreshold sets the temperature threshold for marking GPUs unhealthy.
func WithUtilizationThreshold ¶
func WithUtilizationThreshold(u float64) MonitorOption
WithUtilizationThreshold sets the utilization threshold for marking GPUs unhealthy.
type NvidiaGPUInfo ¶
NvidiaGPUInfo holds the fields extracted from a single /proc/driver/nvidia/gpus/<pci>/information file.
func DetectNvidiaViaProc ¶
func DetectNvidiaViaProc() ([]NvidiaGPUInfo, error)
DetectNvidiaViaProc detects NVIDIA GPUs by parsing the kernel-exposed /proc/driver/nvidia/gpus tree. It combines readNvidiaProcTree (real I/O) with ParseNvidiaProcTree (pure parsing).
Available only on Linux (build-tagged); callers on other platforms must use the platform-specific detectGPUsPlatform path.
func ParseNvidiaInformation ¶
func ParseNvidiaInformation(content string) (NvidiaGPUInfo, error)
ParseNvidiaInformation parses a single /proc/driver/nvidia/gpus/<pci>/information file body (supplied as a string). It recognises:
"Model:" — GPU model name (required; missing → error) "GPU UUID:" — preferred UUID key "UUID:" — fallback UUID key "Video Memory:" — value in "MiB" or "MB" (malformed → MemoryMB=0, not error)
Returns an error when the content contains no "Model:" line (empty/garbage input). The returned NvidiaGPUInfo has Index==0; the caller assigns the real index.
func ParseNvidiaProcTree ¶
func ParseNvidiaProcTree(files map[string]string) ([]NvidiaGPUInfo, error)
ParseNvidiaProcTree parses a set of /proc/driver/nvidia/gpus "information" file contents. The map key is the full path (e.g. "/proc/driver/nvidia/gpus/0000:01:00.0/information"); the value is the file body.
Each entry whose content contains no "Model:" line is silently skipped (not an error). Surviving entries are sorted by map key and assigned contiguous Index values starting at 0. The PCIID is derived from the last path component of the directory (the PCI address portion, one directory up from the "information" leaf).
An empty map returns ([]NvidiaGPUInfo{}, nil).
type Reservation ¶
type Reservation struct {
// contains filtered or unexported fields
}
Reservation enforces a dual-class capacity split plus a starvation hard-cap. It is safe for concurrent use.
func NewReservation ¶
func NewReservation(cfg ReservationConfig) (*Reservation, error)
NewReservation builds a Reservation from cfg, validating the parameters.
func (*Reservation) Admit ¶
func (r *Reservation) Admit(class WorkloadClass, amount int) Decision
Admit reports whether a new allocation of amount units for class would be granted under the current split and load WITHOUT mutating state. It is the pure decision function; Reserve applies it and then commits.
func (*Reservation) AvailableFor ¶
func (r *Reservation) AvailableFor(class WorkloadClass) int
AvailableFor reports how many additional units class could currently reserve, accounting for the class split, the pool capacity, and the starvation hard-cap. It is consistent with Admit: AvailableFor(c) == n implies Admit(c, n) is admitted and Admit(c, n+1) is not (when n+1 <= capacity).
func (*Reservation) Capacity ¶
func (r *Reservation) Capacity() int
Capacity returns the total pool capacity in units.
func (*Reservation) Release ¶
func (r *Reservation) Release(class WorkloadClass, amount int) error
Release returns amount units previously reserved for class back to the pool. Releasing more than is currently in use for the class is an error and leaves usage unchanged, so a buggy caller cannot drive usage negative and silently hand the freed-but-never-held units to the other class.
func (*Reservation) Reserve ¶
func (r *Reservation) Reserve(class WorkloadClass, amount int) (Decision, error)
Reserve attempts to claim amount units for class. On success it commits the usage and returns a nil error; on denial it returns an error describing the admission decision and leaves usage unchanged.
func (*Reservation) Reserved ¶
func (r *Reservation) Reserved(class WorkloadClass) int
Reserved reports the configured reserved-unit ceiling for class.
func (*Reservation) Used ¶
func (r *Reservation) Used(class WorkloadClass) int
Used reports the current in-use units for class.
type ReservationConfig ¶
type ReservationConfig struct {
// Capacity is the total number of allocatable units in the pool (e.g. the
// number of GPUs, or finer-grained slices). Must be > 0.
Capacity int
// InteractiveRatio is the fraction of Capacity reserved for the Interactive
// class, in (0,1). With 0.7, batch may use at most 30% of capacity even
// while interactive capacity sits idle.
InteractiveRatio float64
// HighWaterRatio is the total-load fraction in (0,1] at or above which the
// batch hard cap engages. Defaults to 0.8 when zero.
HighWaterRatio float64
// BatchHardCapRatio is the fraction of Capacity to which Batch is clamped
// once load is at/above the high-water mark. Defaults to InteractiveRatio's
// complement (the batch reserve) when zero, i.e. batch may not grow beyond
// its own reserve under pressure.
BatchHardCapRatio float64
}
ReservationConfig parameterizes a Reservation.
type SharingMode ¶
type SharingMode int
SharingMode describes how a single GPU device is shared among multiple jobs.
const ( // to co-locate a second job is rejected until the first one is released. ShareExclusive SharingMode = iota // NVIDIA Multi-Process Service (or equivalent). The device must not // already be set to Exclusive mode. ShareMPS // SharingRequest carries its own TimeSliceMS; admission is rejected when // quantumUsedMS + req.TimeSliceMS would exceed the device total quantum. ShareTimeSlice // admitted job consumes one partition; admission is rejected when the // device limit is reached. ShareMIG )
func (SharingMode) String ¶
func (m SharingMode) String() string
String returns the human-readable name of the SharingMode.
type SharingRequest ¶
type SharingRequest struct {
// Mode is the sharing mode being requested.
Mode SharingMode
// JobID is a unique identifier for the job.
JobID string
// TimeSliceMS is the milliseconds of time-slice budget the job requires.
// Only relevant for ShareTimeSlice; ignored for other modes.
TimeSliceMS int
// MIGProfile is the requested MIG profile identifier (e.g. "1g.5gb").
// Only relevant for ShareMIG; ignored for other modes.
MIGProfile string
}
SharingRequest is the per-job admission request for a device.
type Stats ¶
type Stats struct {
Total int `json:"total"`
Available int `json:"available"`
Allocated int `json:"allocated"`
Unhealthy int `json:"unhealthy"`
Offline int `json:"offline"`
TotalMemoryMB int `json:"total_memory_mb"`
// FreeMemoryMB is the total memory of GPUs that are currently Available.
FreeMemoryMB int `json:"free_memory_mb"`
}
Stats summarizes the current GPU inventory. It is the aggregate view used by observability and scheduling components.
type WorkloadClass ¶
type WorkloadClass int
WorkloadClass partitions GPU demand into two service tiers that share one pool. Interactive (latency-sensitive) work is the high-priority class whose reserved slice batch work may never consume; Batch (best-effort) is the lower-priority class that is hard-capped under load so it cannot starve interactive work — the "Gepetto" anti-starvation guarantee.
const ( // Interactive is the high-priority, latency-sensitive class. Interactive WorkloadClass = iota // Batch is the lower-priority, best-effort class. Batch )
func (WorkloadClass) String ¶
func (c WorkloadClass) String() string