xblas

package

v0.1.0 Latest Latest Go to latest Published: Mar 16, 2026 License: Apache-2.0 Imports: 10 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/zerfoo/ztensor

Links

Open Source Insights

Documentation ¶

Rendered for

Index ¶

func GemmF8(m, n, k int, a, b, c []float8.Float8)
func GemmF16(m, n, k int, a, b, c []float16.Float16)
func GemmF32(m, n, k int, a, b, c []float32)
func GemmF32Q4NT(m, n, k int, a []float32, b *tensor.Q4Storage, c []float32)
func GemmF32Q8NT(m, n, k int, a []float32, b *tensor.Q8Storage, c []float32)
func GemmF64(m, n, k int, a, b, c []float64)
func GemmQ4F32(m, n, k int, a *tensor.Q4Storage, b, c []float32)
func GemmQ4F32Fused(m, n, k int, a *tensor.Q4Storage, b, c []float32)
func GemmQ8F32(m, n, k int, a *tensor.Q8Storage, b, c []float32)
func InitPool(n int)
func RMSNormF32(out, x, weight *float32, dim int, eps float32, scale *float32)
func RoPEF32(out, in, cos, sin *float32, halfDim, headDim int)
func SgemmSimd(m, n, k int, a, b, c []float32)
func ShutdownPool()
func SiLUF32(out, x *float32, n int)
func SiLUGateF32(out, gate, up *float32, n int)
func SoftmaxF32(data *float32, n int)
func VaddF32(out, a, b *float32, n int)
func VaddScalarF32(out, a *float32, scalar float32, n int)
func VdivF32(out, a, b *float32, n int)
func VdivScalarF32(out, a *float32, scalar float32, n int)
func VexpF32(out, x *float32, n int)
func VmulF32(out, a, b *float32, n int)
func VmulScalarF32(out, a *float32, scalar float32, n int)
func VsubF32(out, a, b *float32, n int)

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func GemmF8 ¶

func GemmF8(m, n, k int, a, b, c []float8.Float8)

GemmF8 computes C = A * B for Float8 by converting through float32 SGEMM.

func GemmF16 ¶

func GemmF16(m, n, k int, a, b, c []float16.Float16)

GemmF16 computes C = A * B for Float16 by converting through float32 SGEMM.

func GemmF32 ¶

func GemmF32(m, n, k int, a, b, c []float32)

GemmF32 computes C = A * B for row-major contiguous matrices. A has shape (m, k), B has shape (k, n), C has shape (m, n). Strides are assumed to be k for A and n for B and C. Uses SIMD-accelerated kernel (AVX2 on amd64, NEON on arm64) when available.

func GemmF32Q4NT ¶

func GemmF32Q4NT(m, n, k int, a []float32, b *tensor.Q4Storage, c []float32)

GemmF32Q4NT computes C = A * B^T where A is float32 [M,K] and B is Q4_0 [N,K]. B is stored in row-major Q4 format: each row j of B (length K) is contiguous in Q4 blocks. The "NT" suffix means B is Not Transposed — the caller passes B in its original [N,K] layout and this function computes the transpose implicitly. K must be a multiple of 32. Falls back to dequant+transpose+SGEMM otherwise.

func GemmF32Q8NT ¶

func GemmF32Q8NT(m, n, k int, a []float32, b *tensor.Q8Storage, c []float32)

GemmF32Q8NT computes C = A * B^T where A is float32 [M,K] and B is Q8_0 [N,K]. B is stored in row-major Q8 format: each row j of B (length K) is contiguous in Q8 blocks. The "NT" suffix means B is Not Transposed. K must be a multiple of 32. Falls back to dequant+transpose+SGEMM otherwise.

func GemmF64 ¶

func GemmF64(m, n, k int, a, b, c []float64)

GemmF64 computes C = A * B for row-major contiguous matrices.

func GemmQ4F32 ¶

func GemmQ4F32(m, n, k int, a *tensor.Q4Storage, b, c []float32)

GemmQ4F32 computes C = A * B where A is Q4_0 quantized and B, C are float32. A has logical shape (m, k), B has shape (k, n), C has shape (m, n). Uses the fused dequant+multiply path that avoids heap-allocating the full dequantized A matrix. Falls back to dequant+SgemmSimd for non-32-aligned K.

func GemmQ4F32Fused ¶

func GemmQ4F32Fused(m, n, k int, a *tensor.Q4Storage, b, c []float32)

GemmQ4F32Fused computes C = dequant(A) * B with fused dequant+multiply. Instead of allocating a full M*K dequantized buffer, it dequantizes one Q4 block (32 values = 128 bytes) at a time into a stack buffer, then multiplies using SIMD-accelerated sgemmAccRow. This eliminates the O(M*K) heap allocation of GemmQ4F32, making it faster for decode (M=1) where the dequant allocation dominates, and equally fast for larger M. K must be a multiple of 32; falls back to GemmQ4F32 otherwise.

func GemmQ8F32 ¶

func GemmQ8F32(m, n, k int, a *tensor.Q8Storage, b, c []float32)

GemmQ8F32 computes C = A * B where A is Q8_0 quantized and B, C are float32. Dequantizes A once upfront, then uses SIMD-accelerated SGEMM.

func InitPool ¶

func InitPool(n int)

InitPool creates the shared worker pool used by parallel GEMV routines. Call once from the engine constructor. n is the number of workers.

func RMSNormF32 ¶

func RMSNormF32(out, x, weight *float32, dim int, eps float32, scale *float32)

func RoPEF32 ¶

func RoPEF32(out, in, cos, sin *float32, halfDim, headDim int)

func SgemmSimd ¶

func SgemmSimd(m, n, k int, a, b, c []float32)

SgemmSimd computes C = A*B using AVX2-accelerated operations. A is m×k, B is k×n, C is m×n. All row-major.

func ShutdownPool ¶

func ShutdownPool()

ShutdownPool closes the shared worker pool. Safe to call if not initialized.

func SiLUF32 ¶

func SiLUF32(out, x *float32, n int)

SiLUF32 computes silu(x) = x / (1 + exp(-x)) for n float32 values.

func SiLUGateF32 ¶

func SiLUGateF32(out, gate, up *float32, n int)

SiLUGateF32 computes silu(gate) * up for n float32 values (SwiGLU operation).

func SoftmaxF32 ¶

func SoftmaxF32(data *float32, n int)

func VaddF32 ¶

func VaddF32(out, a, b *float32, n int)

VaddF32 computes out[i] = a[i] + b[i] for n float32 values.

func VaddScalarF32 ¶

func VaddScalarF32(out, a *float32, scalar float32, n int)

VaddScalarF32 computes out[i] = a[i] + scalar for n float32 values.

func VdivF32 ¶

func VdivF32(out, a, b *float32, n int)

VdivF32 computes out[i] = a[i] / b[i] for n float32 values.

func VdivScalarF32 ¶

func VdivScalarF32(out, a *float32, scalar float32, n int)

VdivScalarF32 computes out[i] = a[i] / scalar for n float32 values.

func VexpF32 ¶

func VexpF32(out, x *float32, n int)

VexpF32 computes out[i] = exp(x[i]) for n float32 values (scalar fallback).

func VmulF32 ¶

func VmulF32(out, a, b *float32, n int)

VmulF32 computes out[i] = a[i] * b[i] for n float32 values.

func VmulScalarF32 ¶

func VmulScalarF32(out, a *float32, scalar float32, n int)

VmulScalarF32 computes out[i] = a[i] * scalar for n float32 values.

func VsubF32 ¶

func VsubF32(out, a, b *float32, n int)

VsubF32 computes out[i] = a[i] - b[i] for n float32 values.

Types ¶

This section is empty.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL