cuda

package
v0.1.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 16, 2026 License: Apache-2.0 Imports: 5 Imported by: 0

Documentation

Overview

Package cuda provides low-level bindings for the CUDA runtime API using dlopen/dlsym (no CGo). CUDA availability is detected at runtime; when libcudart is not loadable the package functions return descriptive errors.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func Available

func Available() bool

Available returns true if CUDA runtime is loadable on this machine. The result is cached after the first call.

func Ccall

func Ccall(fn uintptr, args ...uintptr) uintptr

Ccall calls a C function pointer with up to 12 arguments using the platform-specific zero-CGo mechanism. Exported for use by the kernels package.

func DeviceComputeCapability

func DeviceComputeCapability(deviceID int) (major, minor int, err error)

DeviceComputeCapability returns the major and minor compute capability.

func DeviceGetAttribute

func DeviceGetAttribute(attr, deviceID int) (int, error)

DeviceGetAttribute queries a device attribute.

func DlopenKernels

func DlopenKernels() (uintptr, error)

DlopenKernels loads the custom kernels shared library (libkernels.so) and returns the dlopen handle. Returns an error if the library cannot be found.

func DlopenPath

func DlopenPath(path string) (uintptr, error)

DlopenPath opens a shared library at the given path via dlopen. Returns the handle or an error if the library cannot be loaded.

func Dlsym

func Dlsym(handle uintptr, name string) (uintptr, error)

Dlsym resolves a symbol from a dlopen handle. Returns the function pointer address or an error if the symbol is not found.

func Free

func Free(devPtr unsafe.Pointer) error

Free releases device memory previously allocated with Malloc or MallocManaged.

func GetDeviceCount

func GetDeviceCount() (int, error)

GetDeviceCount returns the number of CUDA-capable devices.

func GraphDestroy

func GraphDestroy(g *Graph) error

GraphDestroy releases a captured graph.

func GraphExecDestroy

func GraphExecDestroy(ge *GraphExec) error

GraphExecDestroy releases an executable graph.

func GraphLaunch

func GraphLaunch(ge *GraphExec, s *Stream) error

GraphLaunch launches an executable graph on the given stream. This replays the entire captured sequence of operations with minimal overhead.

func Malloc

func Malloc(size int) (unsafe.Pointer, error)

Malloc allocates size bytes on the CUDA device and returns a device pointer.

func MallocManaged

func MallocManaged(size int) (unsafe.Pointer, error)

MallocManaged allocates size bytes of unified memory accessible from both host and device.

func ManagedMemorySupported

func ManagedMemorySupported(deviceID int) bool

ManagedMemorySupported returns true if the device supports unified (managed) memory with concurrent access from CPU and GPU. On GB10 with NVLink-C2C and shared LPDDR5x, this avoids all explicit H2D/D2H copies.

func Memcpy

func Memcpy(dst, src unsafe.Pointer, count int, kind MemcpyKind) error

Memcpy copies count bytes between host and device memory.

func MemcpyAsync

func MemcpyAsync(dst, src unsafe.Pointer, count int, kind MemcpyKind, stream *Stream) error

MemcpyAsync copies count bytes asynchronously on the given stream.

func MemcpyPeer

func MemcpyPeer(dst unsafe.Pointer, dstDevice int, src unsafe.Pointer, srcDevice int, count int) error

MemcpyPeer copies count bytes between devices using peer-to-peer transfer.

func SetDefaultArenaPool

func SetDefaultArenaPool(a *ArenaPool)

SetDefaultArenaPool registers an ArenaPool as the process-wide default.

func SetDefaultMemPool

func SetDefaultMemPool(p *MemPool)

SetDefaultMemPool registers a MemPool as the process-wide default. Typically called by GPUEngine during initialization.

func SetDevice

func SetDevice(deviceID int) error

SetDevice sets the current CUDA device.

func StreamBeginCapture

func StreamBeginCapture(s *Stream) error

StreamBeginCapture starts capturing GPU operations on the given stream. All kernel launches, memcpys, and other operations on the stream are recorded into a graph instead of being executed.

Types

type ArenaPool

type ArenaPool struct {
	// contains filtered or unexported fields
}

ArenaPool is a bump-pointer allocator backed by a single large CUDA allocation. Each Alloc advances the offset within the pre-allocated region. Free is a no-op for individual pointers. Call Reset() between forward passes to reclaim all arena memory at once (zero-cost compared to per-pointer free).

This eliminates cudaMalloc/cudaFree overhead during inference, which is the #1 bottleneck for per-token latency on the DGX Spark GPU.

On devices with concurrent managed memory support (e.g., GB10 with NVLink-C2C and shared LPDDR5x), the arena is allocated with cudaMallocManaged. This makes the arena accessible from both CPU and GPU without explicit H2D copies.

Weight tensors and KV cache should NOT use the arena (they persist across passes). The arena is only for per-pass intermediates.

func DefaultArenaPool

func DefaultArenaPool() *ArenaPool

DefaultArenaPool returns the process-wide ArenaPool singleton, or nil.

func NewArenaPool

func NewArenaPool(deviceID, capacityBytes int, fallback *MemPool) (*ArenaPool, error)

NewArenaPool allocates a contiguous GPU region of the given capacity bytes on the specified device. A fallback MemPool handles any overflow. On devices with concurrent managed memory support, the arena uses cudaMallocManaged to enable zero-copy CPU/GPU access.

func (*ArenaPool) Alloc

func (a *ArenaPool) Alloc(deviceID, byteSize int) (unsafe.Pointer, error)

Alloc returns a device pointer of at least byteSize bytes from the arena. Allocations are 256-byte aligned for GPU coalescing. If the arena is full, falls back to the MemPool.

func (*ArenaPool) AllocManaged

func (a *ArenaPool) AllocManaged(deviceID, byteSize int) (unsafe.Pointer, error)

AllocManaged delegates to the fallback MemPool (arena is device-only).

func (*ArenaPool) Capacity

func (a *ArenaPool) Capacity() int

Capacity returns the total arena capacity in bytes.

func (*ArenaPool) Drain

func (a *ArenaPool) Drain() error

Drain frees the underlying CUDA allocation and drains the fallback pool.

func (*ArenaPool) Free

func (a *ArenaPool) Free(deviceID int, ptr unsafe.Pointer, byteSize int)

Free is a no-op for arena pointers (reclaimed in bulk via Reset). Fallback pointers are returned to the MemPool.

func (*ArenaPool) FreeManaged

func (a *ArenaPool) FreeManaged(deviceID int, ptr unsafe.Pointer, byteSize int)

FreeManaged delegates to the fallback MemPool.

func (*ArenaPool) HitMissStats

func (a *ArenaPool) HitMissStats() (hits, misses, resets int64)

HitMissStats returns arena hits, fallback misses, and reset count.

func (*ArenaPool) IsManaged

func (a *ArenaPool) IsManaged() bool

IsManaged returns true if the arena was allocated with managed memory.

func (*ArenaPool) Reset

func (a *ArenaPool) Reset()

Reset rewinds the arena offset to the reset floor (default 0), reclaiming per-pass allocations while preserving buffers below the floor (e.g. CUDA graph captured buffers).

func (*ArenaPool) SetResetFloor

func (a *ArenaPool) SetResetFloor(floor int)

SetResetFloor sets the minimum offset that Reset will rewind to. Allocations below this offset are preserved across resets. This is used by CUDA graph capture to protect GPU buffers that the captured graph references.

func (*ArenaPool) Stats

func (a *ArenaPool) Stats() (allocations int, totalBytes int)

Stats returns the arena utilization and fallback pool stats.

func (*ArenaPool) UsedBytes

func (a *ArenaPool) UsedBytes() int

UsedBytes returns the current arena offset (bytes in use).

type CUDALib

type CUDALib struct {
	// contains filtered or unexported fields
}

CUDALib holds dlopen handles and resolved function pointers for CUDA runtime functions. All function pointers are resolved at Open() time via dlsym. The actual calls go through platform-specific ccall implementations that do NOT use CGo (zero runtime.cgocall overhead).

func Lib

func Lib() *CUDALib

Lib returns the global CUDALib instance, or nil if CUDA is not available.

func Open

func Open() (*CUDALib, error)

Open loads libcudart via dlopen and resolves all CUDA runtime function pointers via dlsym. Returns an error if CUDA is not available (library not found or symbols missing).

func (*CUDALib) Close

func (lib *CUDALib) Close() error

Close releases the dlopen handle.

func (*CUDALib) GraphAvailable

func (lib *CUDALib) GraphAvailable() bool

GraphAvailable returns true if CUDA graph capture APIs are available.

type Graph

type Graph struct {
	// contains filtered or unexported fields
}

Graph wraps a cudaGraph_t handle.

func StreamEndCapture

func StreamEndCapture(s *Stream) (*Graph, error)

StreamEndCapture stops capturing on the stream and returns the captured graph.

type GraphExec

type GraphExec struct {
	// contains filtered or unexported fields
}

GraphExec wraps a cudaGraphExec_t handle for graph replay.

func GraphInstantiate

func GraphInstantiate(g *Graph) (*GraphExec, error)

GraphInstantiate creates an executable graph from a captured graph. The executable graph can be launched repeatedly without re-capturing.

type MemPool

type MemPool struct {
	// contains filtered or unexported fields
}

MemPool is a per-device, size-bucketed free-list allocator for CUDA device memory. It caches freed allocations by (deviceID, byteSize) for reuse, avoiding the overhead of cudaMalloc/cudaFree on every operation and preventing cross-device pointer reuse in multi-GPU setups.

func DefaultMemPool

func DefaultMemPool() *MemPool

DefaultMemPool returns a process-wide MemPool singleton. Returns nil if called before SetDefaultMemPool.

func NewMemPool

func NewMemPool() *MemPool

NewMemPool creates a new empty memory pool.

func (*MemPool) Alloc

func (p *MemPool) Alloc(deviceID, byteSize int) (unsafe.Pointer, error)

Alloc returns a device pointer of at least the given byte size on the specified device. Sizes >= 4KB are rounded up to power-of-2 buckets for better reuse across slightly varying allocation sizes. If a cached allocation exists for the bucket, it is reused. Otherwise SetDevice is called and a fresh cudaMalloc is performed at the bucketed size.

func (*MemPool) AllocManaged

func (p *MemPool) AllocManaged(deviceID, byteSize int) (unsafe.Pointer, error)

AllocManaged returns a unified memory pointer of at least the given byte size. Uses the same power-of-2 bucketing as Alloc.

func (*MemPool) Drain

func (p *MemPool) Drain() error

Drain releases all cached device memory back to CUDA. Iterates all devices, calling SetDevice before freeing each device's pointers. Returns the first error encountered, but attempts to free all pointers.

func (*MemPool) Free

func (p *MemPool) Free(deviceID int, ptr unsafe.Pointer, byteSize int)

Free returns a device pointer to the pool for later reuse. The byteSize is bucketed to match the Alloc bucket so the pointer can be found on the next Alloc of a similar size.

func (*MemPool) FreeManaged

func (p *MemPool) FreeManaged(deviceID int, ptr unsafe.Pointer, byteSize int)

FreeManaged returns a managed memory pointer to the pool for later reuse.

func (*MemPool) HitMissStats

func (p *MemPool) HitMissStats() (hits, misses, frees int64)

HitMissStats returns the cache hit, miss, and free counts since the pool was created. Used for diagnosing pool effectiveness.

func (*MemPool) ResetHitMissStats

func (p *MemPool) ResetHitMissStats()

ResetHitMissStats resets the cache hit/miss/free counters.

func (*MemPool) Stats

func (p *MemPool) Stats() (allocations int, totalBytes int)

Stats returns the number of cached allocations and total cached bytes across all devices.

type MemcpyKind

type MemcpyKind int

MemcpyKind specifies the direction of a memory copy.

const (
	// MemcpyHostToDevice copies from host to device.
	MemcpyHostToDevice MemcpyKind = 1
	// MemcpyDeviceToHost copies from device to host.
	MemcpyDeviceToHost MemcpyKind = 2
	// MemcpyDeviceToDevice copies from device to device.
	MemcpyDeviceToDevice MemcpyKind = 3
)

type Stream

type Stream struct {
	// contains filtered or unexported fields
}

Stream wraps a cudaStream_t handle for asynchronous kernel execution.

func CreateStream

func CreateStream() (*Stream, error)

CreateStream creates a new CUDA stream.

func StreamFromPtr

func StreamFromPtr(ptr unsafe.Pointer) *Stream

StreamFromPtr wraps an existing cudaStream_t handle as a Stream. The caller retains ownership of the handle; Destroy() must NOT be called on the returned Stream (it would destroy the engine's stream).

func (*Stream) Destroy

func (s *Stream) Destroy() error

Destroy releases the CUDA stream.

func (*Stream) Ptr

func (s *Stream) Ptr() unsafe.Pointer

Ptr returns the underlying cudaStream_t as an unsafe.Pointer.

func (*Stream) Synchronize

func (s *Stream) Synchronize() error

Synchronize blocks until all work on this stream completes.

Directories

Path Synopsis
Package kernels provides Go wrappers for custom CUDA kernels.
Package kernels provides Go wrappers for custom CUDA kernels.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL