Documentation
¶
Overview ¶
Package cuda provides low-level bindings for the CUDA runtime API using dlopen/dlsym (no CGo). CUDA availability is detected at runtime; when libcudart is not loadable the package functions return descriptive errors.
Index ¶
- func Available() bool
- func Ccall(fn uintptr, args ...uintptr) uintptr
- func DeviceComputeCapability(deviceID int) (major, minor int, err error)
- func DeviceGetAttribute(attr, deviceID int) (int, error)
- func DlopenKernels() (uintptr, error)
- func DlopenPath(path string) (uintptr, error)
- func Dlsym(handle uintptr, name string) (uintptr, error)
- func Free(devPtr unsafe.Pointer) error
- func GetDeviceCount() (int, error)
- func GraphDestroy(g *Graph) error
- func GraphExecDestroy(ge *GraphExec) error
- func GraphLaunch(ge *GraphExec, s *Stream) error
- func Malloc(size int) (unsafe.Pointer, error)
- func MallocManaged(size int) (unsafe.Pointer, error)
- func ManagedMemorySupported(deviceID int) bool
- func Memcpy(dst, src unsafe.Pointer, count int, kind MemcpyKind) error
- func MemcpyAsync(dst, src unsafe.Pointer, count int, kind MemcpyKind, stream *Stream) error
- func MemcpyPeer(dst unsafe.Pointer, dstDevice int, src unsafe.Pointer, srcDevice int, ...) error
- func SetDefaultArenaPool(a *ArenaPool)
- func SetDefaultMemPool(p *MemPool)
- func SetDevice(deviceID int) error
- func StreamBeginCapture(s *Stream) error
- type ArenaPool
- func (a *ArenaPool) Alloc(deviceID, byteSize int) (unsafe.Pointer, error)
- func (a *ArenaPool) AllocManaged(deviceID, byteSize int) (unsafe.Pointer, error)
- func (a *ArenaPool) Capacity() int
- func (a *ArenaPool) Drain() error
- func (a *ArenaPool) Free(deviceID int, ptr unsafe.Pointer, byteSize int)
- func (a *ArenaPool) FreeManaged(deviceID int, ptr unsafe.Pointer, byteSize int)
- func (a *ArenaPool) HitMissStats() (hits, misses, resets int64)
- func (a *ArenaPool) IsManaged() bool
- func (a *ArenaPool) Reset()
- func (a *ArenaPool) SetResetFloor(floor int)
- func (a *ArenaPool) Stats() (allocations int, totalBytes int)
- func (a *ArenaPool) UsedBytes() int
- type CUDALib
- type Graph
- type GraphExec
- type MemPool
- func (p *MemPool) Alloc(deviceID, byteSize int) (unsafe.Pointer, error)
- func (p *MemPool) AllocManaged(deviceID, byteSize int) (unsafe.Pointer, error)
- func (p *MemPool) Drain() error
- func (p *MemPool) Free(deviceID int, ptr unsafe.Pointer, byteSize int)
- func (p *MemPool) FreeManaged(deviceID int, ptr unsafe.Pointer, byteSize int)
- func (p *MemPool) HitMissStats() (hits, misses, frees int64)
- func (p *MemPool) ResetHitMissStats()
- func (p *MemPool) Stats() (allocations int, totalBytes int)
- type MemcpyKind
- type Stream
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func Available ¶
func Available() bool
Available returns true if CUDA runtime is loadable on this machine. The result is cached after the first call.
func Ccall ¶
Ccall calls a C function pointer with up to 12 arguments using the platform-specific zero-CGo mechanism. Exported for use by the kernels package.
func DeviceComputeCapability ¶
DeviceComputeCapability returns the major and minor compute capability.
func DeviceGetAttribute ¶
DeviceGetAttribute queries a device attribute.
func DlopenKernels ¶
DlopenKernels loads the custom kernels shared library (libkernels.so) and returns the dlopen handle. Returns an error if the library cannot be found.
func DlopenPath ¶
DlopenPath opens a shared library at the given path via dlopen. Returns the handle or an error if the library cannot be loaded.
func Dlsym ¶
Dlsym resolves a symbol from a dlopen handle. Returns the function pointer address or an error if the symbol is not found.
func GetDeviceCount ¶
GetDeviceCount returns the number of CUDA-capable devices.
func GraphExecDestroy ¶
GraphExecDestroy releases an executable graph.
func GraphLaunch ¶
GraphLaunch launches an executable graph on the given stream. This replays the entire captured sequence of operations with minimal overhead.
func MallocManaged ¶
MallocManaged allocates size bytes of unified memory accessible from both host and device.
func ManagedMemorySupported ¶
ManagedMemorySupported returns true if the device supports unified (managed) memory with concurrent access from CPU and GPU. On GB10 with NVLink-C2C and shared LPDDR5x, this avoids all explicit H2D/D2H copies.
func Memcpy ¶
func Memcpy(dst, src unsafe.Pointer, count int, kind MemcpyKind) error
Memcpy copies count bytes between host and device memory.
func MemcpyAsync ¶
MemcpyAsync copies count bytes asynchronously on the given stream.
func MemcpyPeer ¶
func MemcpyPeer(dst unsafe.Pointer, dstDevice int, src unsafe.Pointer, srcDevice int, count int) error
MemcpyPeer copies count bytes between devices using peer-to-peer transfer.
func SetDefaultArenaPool ¶
func SetDefaultArenaPool(a *ArenaPool)
SetDefaultArenaPool registers an ArenaPool as the process-wide default.
func SetDefaultMemPool ¶
func SetDefaultMemPool(p *MemPool)
SetDefaultMemPool registers a MemPool as the process-wide default. Typically called by GPUEngine during initialization.
func StreamBeginCapture ¶
StreamBeginCapture starts capturing GPU operations on the given stream. All kernel launches, memcpys, and other operations on the stream are recorded into a graph instead of being executed.
Types ¶
type ArenaPool ¶
type ArenaPool struct {
// contains filtered or unexported fields
}
ArenaPool is a bump-pointer allocator backed by a single large CUDA allocation. Each Alloc advances the offset within the pre-allocated region. Free is a no-op for individual pointers. Call Reset() between forward passes to reclaim all arena memory at once (zero-cost compared to per-pointer free).
This eliminates cudaMalloc/cudaFree overhead during inference, which is the #1 bottleneck for per-token latency on the DGX Spark GPU.
On devices with concurrent managed memory support (e.g., GB10 with NVLink-C2C and shared LPDDR5x), the arena is allocated with cudaMallocManaged. This makes the arena accessible from both CPU and GPU without explicit H2D copies.
Weight tensors and KV cache should NOT use the arena (they persist across passes). The arena is only for per-pass intermediates.
func DefaultArenaPool ¶
func DefaultArenaPool() *ArenaPool
DefaultArenaPool returns the process-wide ArenaPool singleton, or nil.
func NewArenaPool ¶
NewArenaPool allocates a contiguous GPU region of the given capacity bytes on the specified device. A fallback MemPool handles any overflow. On devices with concurrent managed memory support, the arena uses cudaMallocManaged to enable zero-copy CPU/GPU access.
func (*ArenaPool) Alloc ¶
Alloc returns a device pointer of at least byteSize bytes from the arena. Allocations are 256-byte aligned for GPU coalescing. If the arena is full, falls back to the MemPool.
func (*ArenaPool) AllocManaged ¶
AllocManaged delegates to the fallback MemPool (arena is device-only).
func (*ArenaPool) Free ¶
Free is a no-op for arena pointers (reclaimed in bulk via Reset). Fallback pointers are returned to the MemPool.
func (*ArenaPool) FreeManaged ¶
FreeManaged delegates to the fallback MemPool.
func (*ArenaPool) HitMissStats ¶
HitMissStats returns arena hits, fallback misses, and reset count.
func (*ArenaPool) IsManaged ¶
IsManaged returns true if the arena was allocated with managed memory.
func (*ArenaPool) Reset ¶
func (a *ArenaPool) Reset()
Reset rewinds the arena offset to the reset floor (default 0), reclaiming per-pass allocations while preserving buffers below the floor (e.g. CUDA graph captured buffers).
func (*ArenaPool) SetResetFloor ¶
SetResetFloor sets the minimum offset that Reset will rewind to. Allocations below this offset are preserved across resets. This is used by CUDA graph capture to protect GPU buffers that the captured graph references.
type CUDALib ¶
type CUDALib struct {
// contains filtered or unexported fields
}
CUDALib holds dlopen handles and resolved function pointers for CUDA runtime functions. All function pointers are resolved at Open() time via dlsym. The actual calls go through platform-specific ccall implementations that do NOT use CGo (zero runtime.cgocall overhead).
func Lib ¶
func Lib() *CUDALib
Lib returns the global CUDALib instance, or nil if CUDA is not available.
func Open ¶
Open loads libcudart via dlopen and resolves all CUDA runtime function pointers via dlsym. Returns an error if CUDA is not available (library not found or symbols missing).
func (*CUDALib) GraphAvailable ¶
GraphAvailable returns true if CUDA graph capture APIs are available.
type Graph ¶
type Graph struct {
// contains filtered or unexported fields
}
Graph wraps a cudaGraph_t handle.
func StreamEndCapture ¶
StreamEndCapture stops capturing on the stream and returns the captured graph.
type GraphExec ¶
type GraphExec struct {
// contains filtered or unexported fields
}
GraphExec wraps a cudaGraphExec_t handle for graph replay.
func GraphInstantiate ¶
GraphInstantiate creates an executable graph from a captured graph. The executable graph can be launched repeatedly without re-capturing.
type MemPool ¶
type MemPool struct {
// contains filtered or unexported fields
}
MemPool is a per-device, size-bucketed free-list allocator for CUDA device memory. It caches freed allocations by (deviceID, byteSize) for reuse, avoiding the overhead of cudaMalloc/cudaFree on every operation and preventing cross-device pointer reuse in multi-GPU setups.
func DefaultMemPool ¶
func DefaultMemPool() *MemPool
DefaultMemPool returns a process-wide MemPool singleton. Returns nil if called before SetDefaultMemPool.
func (*MemPool) Alloc ¶
Alloc returns a device pointer of at least the given byte size on the specified device. Sizes >= 4KB are rounded up to power-of-2 buckets for better reuse across slightly varying allocation sizes. If a cached allocation exists for the bucket, it is reused. Otherwise SetDevice is called and a fresh cudaMalloc is performed at the bucketed size.
func (*MemPool) AllocManaged ¶
AllocManaged returns a unified memory pointer of at least the given byte size. Uses the same power-of-2 bucketing as Alloc.
func (*MemPool) Drain ¶
Drain releases all cached device memory back to CUDA. Iterates all devices, calling SetDevice before freeing each device's pointers. Returns the first error encountered, but attempts to free all pointers.
func (*MemPool) Free ¶
Free returns a device pointer to the pool for later reuse. The byteSize is bucketed to match the Alloc bucket so the pointer can be found on the next Alloc of a similar size.
func (*MemPool) FreeManaged ¶
FreeManaged returns a managed memory pointer to the pool for later reuse.
func (*MemPool) HitMissStats ¶
HitMissStats returns the cache hit, miss, and free counts since the pool was created. Used for diagnosing pool effectiveness.
func (*MemPool) ResetHitMissStats ¶
func (p *MemPool) ResetHitMissStats()
ResetHitMissStats resets the cache hit/miss/free counters.
type MemcpyKind ¶
type MemcpyKind int
MemcpyKind specifies the direction of a memory copy.
const ( // MemcpyHostToDevice copies from host to device. MemcpyHostToDevice MemcpyKind = 1 // MemcpyDeviceToHost copies from device to host. MemcpyDeviceToHost MemcpyKind = 2 // MemcpyDeviceToDevice copies from device to device. MemcpyDeviceToDevice MemcpyKind = 3 )
type Stream ¶
type Stream struct {
// contains filtered or unexported fields
}
Stream wraps a cudaStream_t handle for asynchronous kernel execution.
func StreamFromPtr ¶
StreamFromPtr wraps an existing cudaStream_t handle as a Stream. The caller retains ownership of the handle; Destroy() must NOT be called on the returned Stream (it would destroy the engine's stream).
func (*Stream) Synchronize ¶
Synchronize blocks until all work on this stream completes.