cuda

package

v0.1.0 Latest Latest Go to latest Published: Mar 16, 2026 License: Apache-2.0 Imports: 5 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/zerfoo/ztensor

Links

Open Source Insights

Documentation ¶

Rendered for

Overview ¶

Package cuda provides low-level bindings for the CUDA runtime API using dlopen/dlsym (no CGo). CUDA availability is detected at runtime; when libcudart is not loadable the package functions return descriptive errors.

Index ¶

func Available() bool
func Ccall(fn uintptr, args ...uintptr) uintptr
func DeviceComputeCapability(deviceID int) (major, minor int, err error)
func DeviceGetAttribute(attr, deviceID int) (int, error)
func DlopenKernels() (uintptr, error)
func DlopenPath(path string) (uintptr, error)
func Dlsym(handle uintptr, name string) (uintptr, error)
func Free(devPtr unsafe.Pointer) error
func GetDeviceCount() (int, error)
func GraphDestroy(g *Graph) error
func GraphExecDestroy(ge *GraphExec) error
func GraphLaunch(ge *GraphExec, s *Stream) error
func Malloc(size int) (unsafe.Pointer, error)
func MallocManaged(size int) (unsafe.Pointer, error)
func ManagedMemorySupported(deviceID int) bool
func Memcpy(dst, src unsafe.Pointer, count int, kind MemcpyKind) error
func MemcpyAsync(dst, src unsafe.Pointer, count int, kind MemcpyKind, stream *Stream) error
func MemcpyPeer(dst unsafe.Pointer, dstDevice int, src unsafe.Pointer, srcDevice int, ...) error
func SetDefaultArenaPool(a *ArenaPool)
func SetDefaultMemPool(p *MemPool)
func SetDevice(deviceID int) error
func StreamBeginCapture(s *Stream) error
type ArenaPool
- func DefaultArenaPool() *ArenaPool
- func NewArenaPool(deviceID, capacityBytes int, fallback *MemPool) (*ArenaPool, error)
- func (a *ArenaPool) Alloc(deviceID, byteSize int) (unsafe.Pointer, error)
- func (a *ArenaPool) AllocManaged(deviceID, byteSize int) (unsafe.Pointer, error)
- func (a *ArenaPool) Capacity() int
- func (a *ArenaPool) Drain() error
- func (a *ArenaPool) Free(deviceID int, ptr unsafe.Pointer, byteSize int)
- func (a *ArenaPool) FreeManaged(deviceID int, ptr unsafe.Pointer, byteSize int)
- func (a *ArenaPool) HitMissStats() (hits, misses, resets int64)
- func (a *ArenaPool) IsManaged() bool
- func (a *ArenaPool) Reset()
- func (a *ArenaPool) SetResetFloor(floor int)
- func (a *ArenaPool) Stats() (allocations int, totalBytes int)
- func (a *ArenaPool) UsedBytes() int
type CUDALib
- func Lib() *CUDALib
- func Open() (*CUDALib, error)
- func (lib *CUDALib) Close() error
- func (lib *CUDALib) GraphAvailable() bool
type Graph
- func StreamEndCapture(s *Stream) (*Graph, error)
type GraphExec
- func GraphInstantiate(g *Graph) (*GraphExec, error)
type MemPool
- func DefaultMemPool() *MemPool
- func NewMemPool() *MemPool
- func (p *MemPool) Alloc(deviceID, byteSize int) (unsafe.Pointer, error)
- func (p *MemPool) AllocManaged(deviceID, byteSize int) (unsafe.Pointer, error)
- func (p *MemPool) Drain() error
- func (p *MemPool) Free(deviceID int, ptr unsafe.Pointer, byteSize int)
- func (p *MemPool) FreeManaged(deviceID int, ptr unsafe.Pointer, byteSize int)
- func (p *MemPool) HitMissStats() (hits, misses, frees int64)
- func (p *MemPool) ResetHitMissStats()
- func (p *MemPool) Stats() (allocations int, totalBytes int)
type MemcpyKind
type Stream
- func CreateStream() (*Stream, error)
- func StreamFromPtr(ptr unsafe.Pointer) *Stream
- func (s *Stream) Destroy() error
- func (s *Stream) Ptr() unsafe.Pointer
- func (s *Stream) Synchronize() error

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func Available ¶

func Available() bool

Available returns true if CUDA runtime is loadable on this machine. The result is cached after the first call.

func Ccall ¶

func Ccall(fn uintptr, args ...uintptr) uintptr

Ccall calls a C function pointer with up to 12 arguments using the platform-specific zero-CGo mechanism. Exported for use by the kernels package.

func DeviceComputeCapability ¶

func DeviceComputeCapability(deviceID int) (major, minor int, err error)

DeviceComputeCapability returns the major and minor compute capability.

func DeviceGetAttribute ¶

func DeviceGetAttribute(attr, deviceID int) (int, error)

DeviceGetAttribute queries a device attribute.

func DlopenKernels ¶

func DlopenKernels() (uintptr, error)

DlopenKernels loads the custom kernels shared library (libkernels.so) and returns the dlopen handle. Returns an error if the library cannot be found.

func DlopenPath ¶

func DlopenPath(path string) (uintptr, error)

DlopenPath opens a shared library at the given path via dlopen. Returns the handle or an error if the library cannot be loaded.

func Dlsym ¶

func Dlsym(handle uintptr, name string) (uintptr, error)

Dlsym resolves a symbol from a dlopen handle. Returns the function pointer address or an error if the symbol is not found.

func Free ¶

func Free(devPtr unsafe.Pointer) error

Free releases device memory previously allocated with Malloc or MallocManaged.

func GetDeviceCount ¶

func GetDeviceCount() (int, error)

GetDeviceCount returns the number of CUDA-capable devices.

func GraphDestroy ¶

func GraphDestroy(g *Graph) error

GraphDestroy releases a captured graph.

func GraphExecDestroy ¶

func GraphExecDestroy(ge *GraphExec) error

GraphExecDestroy releases an executable graph.

func GraphLaunch ¶

func GraphLaunch(ge *GraphExec, s *Stream) error

GraphLaunch launches an executable graph on the given stream. This replays the entire captured sequence of operations with minimal overhead.

func Malloc ¶

func Malloc(size int) (unsafe.Pointer, error)

Malloc allocates size bytes on the CUDA device and returns a device pointer.

func MallocManaged ¶

func MallocManaged(size int) (unsafe.Pointer, error)

MallocManaged allocates size bytes of unified memory accessible from both host and device.

func ManagedMemorySupported ¶

func ManagedMemorySupported(deviceID int) bool

ManagedMemorySupported returns true if the device supports unified (managed) memory with concurrent access from CPU and GPU. On GB10 with NVLink-C2C and shared LPDDR5x, this avoids all explicit H2D/D2H copies.

func Memcpy ¶

func Memcpy(dst, src unsafe.Pointer, count int, kind MemcpyKind) error

Memcpy copies count bytes between host and device memory.

func MemcpyAsync ¶

func MemcpyAsync(dst, src unsafe.Pointer, count int, kind MemcpyKind, stream *Stream) error

MemcpyAsync copies count bytes asynchronously on the given stream.

func MemcpyPeer ¶

func MemcpyPeer(dst unsafe.Pointer, dstDevice int, src unsafe.Pointer, srcDevice int, count int) error

MemcpyPeer copies count bytes between devices using peer-to-peer transfer.

func SetDefaultArenaPool ¶

func SetDefaultArenaPool(a *ArenaPool)

SetDefaultArenaPool registers an ArenaPool as the process-wide default.

func SetDefaultMemPool ¶

func SetDefaultMemPool(p *MemPool)

SetDefaultMemPool registers a MemPool as the process-wide default. Typically called by GPUEngine during initialization.

func SetDevice ¶

func SetDevice(deviceID int) error

SetDevice sets the current CUDA device.

func StreamBeginCapture ¶

func StreamBeginCapture(s *Stream) error

StreamBeginCapture starts capturing GPU operations on the given stream. All kernel launches, memcpys, and other operations on the stream are recorded into a graph instead of being executed.

Types ¶

type ArenaPool ¶

type ArenaPool struct {
	// contains filtered or unexported fields
}

ArenaPool is a bump-pointer allocator backed by a single large CUDA allocation. Each Alloc advances the offset within the pre-allocated region. Free is a no-op for individual pointers. Call Reset() between forward passes to reclaim all arena memory at once (zero-cost compared to per-pointer free).

This eliminates cudaMalloc/cudaFree overhead during inference, which is the #1 bottleneck for per-token latency on the DGX Spark GPU.

On devices with concurrent managed memory support (e.g., GB10 with NVLink-C2C and shared LPDDR5x), the arena is allocated with cudaMallocManaged. This makes the arena accessible from both CPU and GPU without explicit H2D copies.

Weight tensors and KV cache should NOT use the arena (they persist across passes). The arena is only for per-pass intermediates.

func DefaultArenaPool ¶

func DefaultArenaPool() *ArenaPool

DefaultArenaPool returns the process-wide ArenaPool singleton, or nil.

func NewArenaPool ¶

func NewArenaPool(deviceID, capacityBytes int, fallback *MemPool) (*ArenaPool, error)

NewArenaPool allocates a contiguous GPU region of the given capacity bytes on the specified device. A fallback MemPool handles any overflow. On devices with concurrent managed memory support, the arena uses cudaMallocManaged to enable zero-copy CPU/GPU access.

func (*ArenaPool) Alloc ¶

func (a *ArenaPool) Alloc(deviceID, byteSize int) (unsafe.Pointer, error)

Alloc returns a device pointer of at least byteSize bytes from the arena. Allocations are 256-byte aligned for GPU coalescing. If the arena is full, falls back to the MemPool.

func (*ArenaPool) AllocManaged ¶

func (a *ArenaPool) AllocManaged(deviceID, byteSize int) (unsafe.Pointer, error)

AllocManaged delegates to the fallback MemPool (arena is device-only).

func (*ArenaPool) Capacity ¶

func (a *ArenaPool) Capacity() int

Capacity returns the total arena capacity in bytes.

func (*ArenaPool) Drain ¶

func (a *ArenaPool) Drain() error

Drain frees the underlying CUDA allocation and drains the fallback pool.

func (*ArenaPool) Free ¶

func (a *ArenaPool) Free(deviceID int, ptr unsafe.Pointer, byteSize int)

Free is a no-op for arena pointers (reclaimed in bulk via Reset). Fallback pointers are returned to the MemPool.

func (*ArenaPool) FreeManaged ¶

func (a *ArenaPool) FreeManaged(deviceID int, ptr unsafe.Pointer, byteSize int)

FreeManaged delegates to the fallback MemPool.

func (*ArenaPool) HitMissStats ¶

func (a *ArenaPool) HitMissStats() (hits, misses, resets int64)

HitMissStats returns arena hits, fallback misses, and reset count.

func (*ArenaPool) IsManaged ¶

func (a *ArenaPool) IsManaged() bool

IsManaged returns true if the arena was allocated with managed memory.

func (*ArenaPool) Reset ¶

func (a *ArenaPool) Reset()

Reset rewinds the arena offset to the reset floor (default 0), reclaiming per-pass allocations while preserving buffers below the floor (e.g. CUDA graph captured buffers).

func (*ArenaPool) SetResetFloor ¶

func (a *ArenaPool) SetResetFloor(floor int)

SetResetFloor sets the minimum offset that Reset will rewind to. Allocations below this offset are preserved across resets. This is used by CUDA graph capture to protect GPU buffers that the captured graph references.

func (*ArenaPool) Stats ¶

func (a *ArenaPool) Stats() (allocations int, totalBytes int)

Stats returns the arena utilization and fallback pool stats.

func (*ArenaPool) UsedBytes ¶

func (a *ArenaPool) UsedBytes() int

UsedBytes returns the current arena offset (bytes in use).

type CUDALib ¶

type CUDALib struct {
	// contains filtered or unexported fields
}

CUDALib holds dlopen handles and resolved function pointers for CUDA runtime functions. All function pointers are resolved at Open() time via dlsym. The actual calls go through platform-specific ccall implementations that do NOT use CGo (zero runtime.cgocall overhead).

func Lib ¶

func Lib() *CUDALib

Lib returns the global CUDALib instance, or nil if CUDA is not available.

func Open ¶

func Open() (*CUDALib, error)

Open loads libcudart via dlopen and resolves all CUDA runtime function pointers via dlsym. Returns an error if CUDA is not available (library not found or symbols missing).

func (*CUDALib) Close ¶

func (lib *CUDALib) Close() error

Close releases the dlopen handle.

func (*CUDALib) GraphAvailable ¶

func (lib *CUDALib) GraphAvailable() bool

GraphAvailable returns true if CUDA graph capture APIs are available.

type Graph ¶

type Graph struct {
	// contains filtered or unexported fields
}

Graph wraps a cudaGraph_t handle.

func StreamEndCapture ¶

func StreamEndCapture(s *Stream) (*Graph, error)

StreamEndCapture stops capturing on the stream and returns the captured graph.

type GraphExec ¶

type GraphExec struct {
	// contains filtered or unexported fields
}

GraphExec wraps a cudaGraphExec_t handle for graph replay.

func GraphInstantiate ¶

func GraphInstantiate(g *Graph) (*GraphExec, error)

GraphInstantiate creates an executable graph from a captured graph. The executable graph can be launched repeatedly without re-capturing.

type MemPool ¶

type MemPool struct {
	// contains filtered or unexported fields
}

MemPool is a per-device, size-bucketed free-list allocator for CUDA device memory. It caches freed allocations by (deviceID, byteSize) for reuse, avoiding the overhead of cudaMalloc/cudaFree on every operation and preventing cross-device pointer reuse in multi-GPU setups.

func DefaultMemPool ¶

func DefaultMemPool() *MemPool

DefaultMemPool returns a process-wide MemPool singleton. Returns nil if called before SetDefaultMemPool.

func NewMemPool ¶

func NewMemPool() *MemPool

NewMemPool creates a new empty memory pool.

func (*MemPool) Alloc ¶

func (p *MemPool) Alloc(deviceID, byteSize int) (unsafe.Pointer, error)

Alloc returns a device pointer of at least the given byte size on the specified device. Sizes >= 4KB are rounded up to power-of-2 buckets for better reuse across slightly varying allocation sizes. If a cached allocation exists for the bucket, it is reused. Otherwise SetDevice is called and a fresh cudaMalloc is performed at the bucketed size.

func (*MemPool) AllocManaged ¶

func (p *MemPool) AllocManaged(deviceID, byteSize int) (unsafe.Pointer, error)

AllocManaged returns a unified memory pointer of at least the given byte size. Uses the same power-of-2 bucketing as Alloc.

func (*MemPool) Drain ¶

func (p *MemPool) Drain() error

Drain releases all cached device memory back to CUDA. Iterates all devices, calling SetDevice before freeing each device's pointers. Returns the first error encountered, but attempts to free all pointers.

func (*MemPool) Free ¶

func (p *MemPool) Free(deviceID int, ptr unsafe.Pointer, byteSize int)

Free returns a device pointer to the pool for later reuse. The byteSize is bucketed to match the Alloc bucket so the pointer can be found on the next Alloc of a similar size.

func (*MemPool) FreeManaged ¶

func (p *MemPool) FreeManaged(deviceID int, ptr unsafe.Pointer, byteSize int)

FreeManaged returns a managed memory pointer to the pool for later reuse.

func (*MemPool) HitMissStats ¶

func (p *MemPool) HitMissStats() (hits, misses, frees int64)

HitMissStats returns the cache hit, miss, and free counts since the pool was created. Used for diagnosing pool effectiveness.

func (*MemPool) ResetHitMissStats ¶

func (p *MemPool) ResetHitMissStats()

ResetHitMissStats resets the cache hit/miss/free counters.

func (*MemPool) Stats ¶

func (p *MemPool) Stats() (allocations int, totalBytes int)

Stats returns the number of cached allocations and total cached bytes across all devices.

type MemcpyKind ¶

type MemcpyKind int

MemcpyKind specifies the direction of a memory copy.

const (
	// MemcpyHostToDevice copies from host to device.
	MemcpyHostToDevice MemcpyKind = 1
	// MemcpyDeviceToHost copies from device to host.
	MemcpyDeviceToHost MemcpyKind = 2
	// MemcpyDeviceToDevice copies from device to device.
	MemcpyDeviceToDevice MemcpyKind = 3
)

type Stream ¶

type Stream struct {
	// contains filtered or unexported fields
}

Stream wraps a cudaStream_t handle for asynchronous kernel execution.

func CreateStream ¶

func CreateStream() (*Stream, error)

CreateStream creates a new CUDA stream.

func StreamFromPtr ¶

func StreamFromPtr(ptr unsafe.Pointer) *Stream

StreamFromPtr wraps an existing cudaStream_t handle as a Stream. The caller retains ownership of the handle; Destroy() must NOT be called on the returned Stream (it would destroy the engine's stream).

func (*Stream) Destroy ¶

func (s *Stream) Destroy() error

Destroy releases the CUDA stream.

func (*Stream) Ptr ¶

func (s *Stream) Ptr() unsafe.Pointer

Ptr returns the underlying cudaStream_t as an unsafe.Pointer.

func (*Stream) Synchronize ¶

func (s *Stream) Synchronize() error

Synchronize blocks until all work on this stream completes.

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
kernels Package kernels provides Go wrappers for custom CUDA kernels.	Package kernels provides Go wrappers for custom CUDA kernels.

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL