xblas

package
v0.1.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 16, 2026 License: Apache-2.0 Imports: 10 Imported by: 0

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func GemmF8

func GemmF8(m, n, k int, a, b, c []float8.Float8)

GemmF8 computes C = A * B for Float8 by converting through float32 SGEMM.

func GemmF16

func GemmF16(m, n, k int, a, b, c []float16.Float16)

GemmF16 computes C = A * B for Float16 by converting through float32 SGEMM.

func GemmF32

func GemmF32(m, n, k int, a, b, c []float32)

GemmF32 computes C = A * B for row-major contiguous matrices. A has shape (m, k), B has shape (k, n), C has shape (m, n). Strides are assumed to be k for A and n for B and C. Uses SIMD-accelerated kernel (AVX2 on amd64, NEON on arm64) when available.

func GemmF32Q4NT

func GemmF32Q4NT(m, n, k int, a []float32, b *tensor.Q4Storage, c []float32)

GemmF32Q4NT computes C = A * B^T where A is float32 [M,K] and B is Q4_0 [N,K]. B is stored in row-major Q4 format: each row j of B (length K) is contiguous in Q4 blocks. The "NT" suffix means B is Not Transposed — the caller passes B in its original [N,K] layout and this function computes the transpose implicitly. K must be a multiple of 32. Falls back to dequant+transpose+SGEMM otherwise.

func GemmF32Q8NT

func GemmF32Q8NT(m, n, k int, a []float32, b *tensor.Q8Storage, c []float32)

GemmF32Q8NT computes C = A * B^T where A is float32 [M,K] and B is Q8_0 [N,K]. B is stored in row-major Q8 format: each row j of B (length K) is contiguous in Q8 blocks. The "NT" suffix means B is Not Transposed. K must be a multiple of 32. Falls back to dequant+transpose+SGEMM otherwise.

func GemmF64

func GemmF64(m, n, k int, a, b, c []float64)

GemmF64 computes C = A * B for row-major contiguous matrices.

func GemmQ4F32

func GemmQ4F32(m, n, k int, a *tensor.Q4Storage, b, c []float32)

GemmQ4F32 computes C = A * B where A is Q4_0 quantized and B, C are float32. A has logical shape (m, k), B has shape (k, n), C has shape (m, n). Uses the fused dequant+multiply path that avoids heap-allocating the full dequantized A matrix. Falls back to dequant+SgemmSimd for non-32-aligned K.

func GemmQ4F32Fused

func GemmQ4F32Fused(m, n, k int, a *tensor.Q4Storage, b, c []float32)

GemmQ4F32Fused computes C = dequant(A) * B with fused dequant+multiply. Instead of allocating a full M*K dequantized buffer, it dequantizes one Q4 block (32 values = 128 bytes) at a time into a stack buffer, then multiplies using SIMD-accelerated sgemmAccRow. This eliminates the O(M*K) heap allocation of GemmQ4F32, making it faster for decode (M=1) where the dequant allocation dominates, and equally fast for larger M. K must be a multiple of 32; falls back to GemmQ4F32 otherwise.

func GemmQ8F32

func GemmQ8F32(m, n, k int, a *tensor.Q8Storage, b, c []float32)

GemmQ8F32 computes C = A * B where A is Q8_0 quantized and B, C are float32. Dequantizes A once upfront, then uses SIMD-accelerated SGEMM.

func InitPool

func InitPool(n int)

InitPool creates the shared worker pool used by parallel GEMV routines. Call once from the engine constructor. n is the number of workers.

func RMSNormF32

func RMSNormF32(out, x, weight *float32, dim int, eps float32, scale *float32)

func RoPEF32

func RoPEF32(out, in, cos, sin *float32, halfDim, headDim int)

func SgemmSimd

func SgemmSimd(m, n, k int, a, b, c []float32)

SgemmSimd computes C = A*B using AVX2-accelerated operations. A is m×k, B is k×n, C is m×n. All row-major.

func ShutdownPool

func ShutdownPool()

ShutdownPool closes the shared worker pool. Safe to call if not initialized.

func SiLUF32

func SiLUF32(out, x *float32, n int)

SiLUF32 computes silu(x) = x / (1 + exp(-x)) for n float32 values.

func SiLUGateF32

func SiLUGateF32(out, gate, up *float32, n int)

SiLUGateF32 computes silu(gate) * up for n float32 values (SwiGLU operation).

func SoftmaxF32

func SoftmaxF32(data *float32, n int)

func VaddF32

func VaddF32(out, a, b *float32, n int)

VaddF32 computes out[i] = a[i] + b[i] for n float32 values.

func VaddScalarF32

func VaddScalarF32(out, a *float32, scalar float32, n int)

VaddScalarF32 computes out[i] = a[i] + scalar for n float32 values.

func VdivF32

func VdivF32(out, a, b *float32, n int)

VdivF32 computes out[i] = a[i] / b[i] for n float32 values.

func VdivScalarF32

func VdivScalarF32(out, a *float32, scalar float32, n int)

VdivScalarF32 computes out[i] = a[i] / scalar for n float32 values.

func VexpF32

func VexpF32(out, x *float32, n int)

VexpF32 computes out[i] = exp(x[i]) for n float32 values (scalar fallback).

func VmulF32

func VmulF32(out, a, b *float32, n int)

VmulF32 computes out[i] = a[i] * b[i] for n float32 values.

func VmulScalarF32

func VmulScalarF32(out, a *float32, scalar float32, n int)

VmulScalarF32 computes out[i] = a[i] * scalar for n float32 values.

func VsubF32

func VsubF32(out, a, b *float32, n int)

VsubF32 computes out[i] = a[i] - b[i] for n float32 values.

Types

This section is empty.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL