Documentation
¶
Index ¶
- func GemmF8(m, n, k int, a, b, c []float8.Float8)
- func GemmF16(m, n, k int, a, b, c []float16.Float16)
- func GemmF32(m, n, k int, a, b, c []float32)
- func GemmF32Q4NT(m, n, k int, a []float32, b *tensor.Q4Storage, c []float32)
- func GemmF32Q8NT(m, n, k int, a []float32, b *tensor.Q8Storage, c []float32)
- func GemmF64(m, n, k int, a, b, c []float64)
- func GemmQ4F32(m, n, k int, a *tensor.Q4Storage, b, c []float32)
- func GemmQ4F32Fused(m, n, k int, a *tensor.Q4Storage, b, c []float32)
- func GemmQ8F32(m, n, k int, a *tensor.Q8Storage, b, c []float32)
- func InitPool(n int)
- func RMSNormF32(out, x, weight *float32, dim int, eps float32, scale *float32)
- func RoPEF32(out, in, cos, sin *float32, halfDim, headDim int)
- func SgemmSimd(m, n, k int, a, b, c []float32)
- func ShutdownPool()
- func SiLUF32(out, x *float32, n int)
- func SiLUGateF32(out, gate, up *float32, n int)
- func SoftmaxF32(data *float32, n int)
- func VaddF32(out, a, b *float32, n int)
- func VaddScalarF32(out, a *float32, scalar float32, n int)
- func VdivF32(out, a, b *float32, n int)
- func VdivScalarF32(out, a *float32, scalar float32, n int)
- func VexpF32(out, x *float32, n int)
- func VmulF32(out, a, b *float32, n int)
- func VmulScalarF32(out, a *float32, scalar float32, n int)
- func VsubF32(out, a, b *float32, n int)
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func GemmF32 ¶
GemmF32 computes C = A * B for row-major contiguous matrices. A has shape (m, k), B has shape (k, n), C has shape (m, n). Strides are assumed to be k for A and n for B and C. Uses SIMD-accelerated kernel (AVX2 on amd64, NEON on arm64) when available.
func GemmF32Q4NT ¶
GemmF32Q4NT computes C = A * B^T where A is float32 [M,K] and B is Q4_0 [N,K]. B is stored in row-major Q4 format: each row j of B (length K) is contiguous in Q4 blocks. The "NT" suffix means B is Not Transposed — the caller passes B in its original [N,K] layout and this function computes the transpose implicitly. K must be a multiple of 32. Falls back to dequant+transpose+SGEMM otherwise.
func GemmF32Q8NT ¶
GemmF32Q8NT computes C = A * B^T where A is float32 [M,K] and B is Q8_0 [N,K]. B is stored in row-major Q8 format: each row j of B (length K) is contiguous in Q8 blocks. The "NT" suffix means B is Not Transposed. K must be a multiple of 32. Falls back to dequant+transpose+SGEMM otherwise.
func GemmQ4F32 ¶
GemmQ4F32 computes C = A * B where A is Q4_0 quantized and B, C are float32. A has logical shape (m, k), B has shape (k, n), C has shape (m, n). Uses the fused dequant+multiply path that avoids heap-allocating the full dequantized A matrix. Falls back to dequant+SgemmSimd for non-32-aligned K.
func GemmQ4F32Fused ¶
GemmQ4F32Fused computes C = dequant(A) * B with fused dequant+multiply. Instead of allocating a full M*K dequantized buffer, it dequantizes one Q4 block (32 values = 128 bytes) at a time into a stack buffer, then multiplies using SIMD-accelerated sgemmAccRow. This eliminates the O(M*K) heap allocation of GemmQ4F32, making it faster for decode (M=1) where the dequant allocation dominates, and equally fast for larger M. K must be a multiple of 32; falls back to GemmQ4F32 otherwise.
func GemmQ8F32 ¶
GemmQ8F32 computes C = A * B where A is Q8_0 quantized and B, C are float32. Dequantizes A once upfront, then uses SIMD-accelerated SGEMM.
func InitPool ¶
func InitPool(n int)
InitPool creates the shared worker pool used by parallel GEMV routines. Call once from the engine constructor. n is the number of workers.
func SgemmSimd ¶
SgemmSimd computes C = A*B using AVX2-accelerated operations. A is m×k, B is k×n, C is m×n. All row-major.
func ShutdownPool ¶
func ShutdownPool()
ShutdownPool closes the shared worker pool. Safe to call if not initialized.
func SiLUGateF32 ¶
SiLUGateF32 computes silu(gate) * up for n float32 values (SwiGLU operation).
func SoftmaxF32 ¶
func VaddScalarF32 ¶
VaddScalarF32 computes out[i] = a[i] + scalar for n float32 values.
func VdivScalarF32 ¶
VdivScalarF32 computes out[i] = a[i] / scalar for n float32 values.
func VmulScalarF32 ¶
VmulScalarF32 computes out[i] = a[i] * scalar for n float32 values.
Types ¶
This section is empty.