ztensor

module

v1.6.0 Latest Latest Go to latest Published: Apr 17, 2026 License: Apache-2.0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/zerfoo/ztensor

Links

README ¶

ztensor

GPU-accelerated tensor, compute engine, and computation graph library for Go. Zero CGo.

Part of the Zerfoo ML ecosystem.

Features

Multi-type tensors with compile-time type safety via Go generics (float32, float64, float16, bfloat16, float8, integer types)
GPU backends — CUDA (cuBLAS, cuDNN, TensorRT, custom kernels), ROCm (HIP, rocBLAS, MIOpen), and OpenCL (CLBlast), all loaded dynamically via purego (zero CGo)
Computation graphs with fusion passes and CUDA graph capture for optimized inference
CPU SIMD — ARM NEON and x86 AVX2 hand-written assembly for GEMM, RMSNorm, RoPE, SiLU, softmax
Memory management — arena-based GPU memory pools with O(1) per-pass reclamation
Quantized storage — FP8 E4M3/E5M2, FP16, BFloat16 tensor storage with automatic dequantization

Installation

go get github.com/zerfoo/ztensor

No CGo required. GPU backends are discovered and loaded at runtime via dlopen/purego.

Quick Start

package main

import (
    "context"
    "fmt"

    "github.com/zerfoo/ztensor/compute"
    "github.com/zerfoo/ztensor/numeric"
    "github.com/zerfoo/ztensor/tensor"
)

func main() {
    ctx := context.Background()

    // Create a CPU compute engine for float32
    eng := compute.NewCPUEngine[float32](numeric.Float32Ops{})

    // Create two tensors
    a, _ := tensor.New[float32]([]int{2, 3}, []float32{1, 2, 3, 4, 5, 6})
    b, _ := tensor.New[float32]([]int{3, 2}, []float32{1, 2, 3, 4, 5, 6})

    // Matrix multiplication
    c, _ := eng.MatMul(ctx, a, b)
    fmt.Println(c.Shape()) // [2, 2]
    fmt.Println(c.Data())  // [22 28 49 64]

    // Element-wise operations
    x, _ := tensor.New[float32]([]int{2, 2}, []float32{1, 2, 3, 4})
    y, _ := tensor.New[float32]([]int{2, 2}, []float32{5, 6, 7, 8})
    sum, _ := eng.Add(ctx, x, y)
    fmt.Println(sum.Data()) // [6 8 10 12]
}

GPU Backend Example

GPU libraries are loaded at runtime via purego — no CGo, no build tags, no linking. If CUDA/ROCm/OpenCL is not available, the engine constructor returns an error and you fall back to CPU.

package main

import (
    "context"
    "fmt"
    "log"

    "github.com/zerfoo/ztensor/compute"
    "github.com/zerfoo/ztensor/numeric"
    "github.com/zerfoo/ztensor/tensor"
)

func main() {
    ctx := context.Background()

    // Try CUDA first, fall back to CPU
    eng, err := compute.NewGPUEngine[float32](numeric.Float32Ops{})
    if err != nil {
        fmt.Println("CUDA not available, using CPU:", err)
        cpuEng := compute.NewCPUEngine[float32](numeric.Float32Ops{})
        run(ctx, cpuEng)
        return
    }
    run(ctx, eng)
}

func run(ctx context.Context, eng compute.Engine[float32]) {
    a, _ := tensor.New[float32]([]int{2, 3}, []float32{1, 2, 3, 4, 5, 6})
    b, _ := tensor.New[float32]([]int{3, 2}, []float32{1, 2, 3, 4, 5, 6})
    c, _ := eng.MatMul(ctx, a, b)
    fmt.Println(c.Data()) // [22 28 49 64]
}

Other GPU backends follow the same pattern:

// ROCm (AMD GPUs)
eng, err := compute.NewROCmEngine[float32](numeric.Float32Ops{})

// OpenCL (cross-vendor)
eng, err := compute.NewOpenCLEngine[float32](numeric.Float32Ops{})

Type Safety with Generics

The tensor.Numeric type constraint ensures compile-time type safety across all supported numeric types:

// Works with any Numeric type
func dotProduct[T tensor.Numeric](eng compute.Engine[T], a, b *tensor.TensorNumeric[T]) (*tensor.TensorNumeric[T], error) {
    return eng.MatMul(context.Background(), a, b)
}

Supported types include float32, float64, float16.Float16, float16.BFloat16, float8.Float8, and all Go integer types.

Use Cases

ML inference engines — ztensor powers the zerfoo inference runtime for transformer models
Scientific computing — GPU-accelerated linear algebra with automatic backend selection
GPU compute from Go — use CUDA/ROCm/OpenCL from pure Go without CGo or build tags
Custom ML operators — build neural network layers on top of the compute.Engine interface

Package Overview

Package	Description
`tensor/`	Multi-type tensor storage — CPU, GPU, quantized (FP8, FP16, BFloat16)
`compute/`	Compute engine interface with CPU, CUDA, ROCm, and OpenCL implementations
`graph/`	Computation graph compiler with operator fusion and CUDA graph capture
`numeric/`	Type-safe `Arithmetic[T]` interface for all numeric types
`device/`	Device abstraction and memory allocators
`types/`	Shared type definitions
`log/`	Structured logging interface
`metrics/`	Performance metrics and profiling
`internal/cuda/`	Zero-CGo CUDA runtime bindings via purego, 25+ custom kernels
`internal/xblas/`	ARM NEON and x86 AVX2 SIMD assembly (GEMM, RMSNorm, RoPE, SiLU, softmax)
`internal/gpuapi/`	GPU Runtime Abstraction Layer — unified adapter for CUDA, ROCm, OpenCL
`internal/codegen/`	Megakernel code generator

Dependencies

ztensor depends on:

float16 — IEEE 754 half-precision and BFloat16 arithmetic
float8 — FP8 E4M3FN arithmetic for quantized inference

ztensor is used by:

zerfoo — ML inference, training, and serving framework

License

Apache 2.0

Directories ¶

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL

Path	Synopsis
batched Package batched provides batched multi-model inference, enabling 1000+ per-source models that share the same architecture to run in a single batched GEMM call rather than N sequential matrix multiplications.	Package batched provides batched multi-model inference, enabling 1000+ per-source models that share the same architecture to run in a single batched GEMM call rather than N sequential matrix multiplications.
compute Package compute implements tensor computation engines and operations.	Package compute implements tensor computation engines and operations.
device Package device provides device abstraction and memory allocation interfaces.	Package device provides device abstraction and memory allocation interfaces.
gguf Package gguf provides a shared GGUF v3 writer for serializing model files.	Package gguf provides a shared GGUF v3 writer for serializing model files.
graph Package graph provides a computational graph abstraction.	Package graph provides a computational graph abstraction.
kv Package kv provides paged-attention KV cache primitives for transformer inference.	Package kv provides paged-attention KV cache primitives for transformer inference.
internal
clblast Package clblast provides Go wrappers for the CLBlast BLAS library.	Package clblast provides Go wrappers for the CLBlast BLAS library.
codegen Package codegen generates CUDA megakernel source code from a compiled ExecutionPlan instruction tape.	Package codegen generates CUDA megakernel source code from a compiled ExecutionPlan instruction tape.
cublas Package cublas provides low-level purego bindings for the cuBLAS library.	Package cublas provides low-level purego bindings for the cuBLAS library.
cuda Package cuda provides low-level bindings for the CUDA runtime API using dlopen/dlsym (no CGo).	Package cuda provides low-level bindings for the CUDA runtime API using dlopen/dlsym (no CGo).
cuda/kernels Package kernels provides Go wrappers for custom CUDA kernels.	Package kernels provides Go wrappers for custom CUDA kernels.
cudnn Package cudnn provides purego bindings for the NVIDIA cuDNN library.	Package cudnn provides purego bindings for the NVIDIA cuDNN library.
fpga Package fpga provides zero-CGo bindings to FPGA runtime libraries via purego/dlopen.	Package fpga provides zero-CGo bindings to FPGA runtime libraries via purego/dlopen.
gpuapi Package gpuapi defines internal interfaces for GPU runtime operations.	Package gpuapi defines internal interfaces for GPU runtime operations.
hip Package hip provides low-level bindings for the AMD HIP runtime API using purego dlopen.	Package hip provides low-level bindings for the AMD HIP runtime API using purego dlopen.
hip/kernels Package kernels provides Go wrappers for custom HIP kernels via purego dlopen.	Package kernels provides Go wrappers for custom HIP kernels via purego dlopen.
metal Package metal provides zero-CGo bindings to Apple's Metal and Metal Performance Shaders (MPS) frameworks via purego/dlopen.	Package metal provides zero-CGo bindings to Apple's Metal and Metal Performance Shaders (MPS) frameworks via purego/dlopen.
miopen Package miopen provides low-level bindings for the AMD MIOpen library using purego dlopen.	Package miopen provides low-level bindings for the AMD MIOpen library using purego dlopen.
nccl Package nccl provides a zero-CGo binding for the NVIDIA Collective Communications Library (NCCL).	Package nccl provides a zero-CGo binding for the NVIDIA Collective Communications Library (NCCL).
opencl Package opencl provides Go wrappers for the OpenCL 2.0 runtime API.	Package opencl provides Go wrappers for the OpenCL 2.0 runtime API.
opencl/kernels Package kernels provides OpenCL kernel source and dispatch for elementwise operations.	Package kernels provides OpenCL kernel source and dispatch for elementwise operations.
pjrt Package pjrt provides purego bindings for the PJRT C API.	Package pjrt provides purego bindings for the PJRT C API.
rocblas Package rocblas provides low-level bindings for the AMD rocBLAS library using purego dlopen.	Package rocblas provides low-level bindings for the AMD rocBLAS library using purego dlopen.
stablehlo Package stablehlo provides StableHLO MLIR text emission for the PJRT backend.	Package stablehlo provides StableHLO MLIR text emission for the PJRT backend.
sycl Package sycl provides zero-CGo bindings to the SYCL runtime via purego/dlopen.	Package sycl provides zero-CGo bindings to the SYCL runtime via purego/dlopen.
tensorrt Package tensorrt provides bindings for the NVIDIA TensorRT inference library via purego (dlopen/dlsym, no CGo).	Package tensorrt provides bindings for the NVIDIA TensorRT inference library via purego (dlopen/dlsym, no CGo).
workerpool Package workerpool provides a persistent pool of goroutines that process submitted tasks.	Package workerpool provides a persistent pool of goroutines that process submitted tasks.
xblas
log Package log provides a structured, leveled logging abstraction.	Package log provides a structured, leveled logging abstraction.
metrics
runtime Package runtime provides a backend-agnostic metrics collection abstraction for runtime observability.	Package runtime provides a backend-agnostic metrics collection abstraction for runtime observability.
numeric Package numeric provides precision types, arithmetic operations, and generic constraints for the Zerfoo ML framework.	Package numeric provides precision types, arithmetic operations, and generic constraints for the Zerfoo ML framework.
tensor Package tensor provides a multi-dimensional array (tensor) implementation.	Package tensor provides a multi-dimensional array (tensor) implementation.
testing
testutils
types Package types contains shared, fundamental types for the Zerfoo framework.	Package types contains shared, fundamental types for the Zerfoo framework.