manta

module
v0.1.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 26, 2026 License: Apache-2.0

README

Manta

Manta is an inference-first GPU language and runtime stack. It compiles .manta source into backend-neutral .mll execution plans for GPU-accelerated embedding, reranking, retrieval-time scoring, and decode-time inference. Write the model-facing compute once, then run it through CUDA, Metal, Vulkan, DirectML, or WebGPU with the same artifact and the same entrypoint contract.

  • Inference-first product surface with kernel and pipeline abstractions
  • Three-level IR pipeline: HIR (typed) -> MIR (semantic) -> LIR (scheduled)
  • Portable artifact format (.mll): compile once, deploy anywhere
  • Portable backend surface: CUDA, Metal, Vulkan, DirectML, and WebGPU variants from the same source
  • First-class KV cache: kv_cache type with kv_read/kv_write for autoregressive decoding
  • TurboQuant-native direction: Manta is designed to consume and emit quantized tensors and quantized vectors without repacking through a separate framework
  • Schedule hints: tile, vector_width, subgroup, memory classes -- backend-neutral, lowered late
  • Hybrid runtime: backend-native execution where promoted kernels exist, host reference execution where they do not yet
  • Go-authored tree-sitter grammar: Manta source parsing is backed by gotreesitter, while the compiler keeps its source-oriented AST
  • Pure Go toolchain: no Python, no C++ build dependencies

Product direction

The near-term target is an inference-first product surface for embedding, reranking, retrieval-time scoring, decode, and CorkScrewDB integration. Manta also owns the native training path needed to produce those default artifacts, so model authors do not need to train in Python, export through another format, and deploy through a separate runtime.

The credible long-context wedge is the best local long-context embedder: consumer-GPU trainable, compressed for local serving, sealed as .mll, and strong on long-document retrieval. The target and scorecard are tracked in docs/local-long-context-embedder-wedge.md, with lower-level sparse attention success criteria in docs/consumer-subquadratic-gpu-spec.md.

Agent Skill

Agents working with Manta should use the using-manta skill.

Current embedder work is focused on retrieval-aligned training, not pairwise-only wins. The alignment harness now supports source-aware hard-negative scheduling, promotion gates over full retrieval scoreboards, recall@100 guardrails, grouped hard-negative InfoNCE, hybrid InfoNCE, and teacher-score distillation over mined candidate groups. The current nDCG best is the teacher-distilled hybrid follow-up with grouped weight 0.05, teacher weight 0.20, LR 0.000010, NF-biased model-hard mining, and nfcorpus=3 source bias during training; macro nDCG@10 improves from 0.145568 to 0.147862 against the previous best while staying inside the nDCG and recall@100 floors.

That means the language and runtime should bias toward:

  • quantized weights and quantized vectors as first-class runtime values
  • low-friction query-time embedding and scoring entrypoints
  • rerank/select entrypoints that can return ids directly
  • inference kernels that can consume TurboQuant-native layouts directly
  • portable .mll artifacts that CorkScrewDB can load without rewriting host code
  • sealed MLL package exports that carry model definition, weights, tokenizer, memory plan, and metadata together

Training A CorkScrew Embedding

The default CorkScrew embedding model is intended to be born from this repository's own pipeline: local BEIR data, Manta-native training, Manta retrieval evaluation, sealed .mll export, and optional CorkScrew asset installation.

MANTA_REPO_ROOT=$PWD \
MANTA_INSTALL_CORKSCREW=1 \
ferrous-wheel run scripts/train_manta_embed_v1_shipping_pipeline.fw

The shipping pipeline trains a mixed pretraining + BEIR Stage A model, mines model-hard negatives from that candidate, builds a FiQA-weighted BEIR fine-tune set, trains Stage B from the Stage A trainable package, and gates the final sealed artifact on SciFact, NfCorpus, and FiQA retrieval ndcg@10.

Install

go install github.com/odvcencio/manta/cmd/manta@latest

Quick Start

Write a .manta source file:

param token_embedding: f16[V, D] @weight("weights/token_embedding")
param projection: f16[D, E] @weight("weights/projection")

kernel l2_normalize(x: f16[T, E]) -> f16[T, E] {
    return normalize(x)
}

pipeline embed(tokens: i32[T]) -> f16[T, E] {
    let hidden = gather(token_embedding, tokens)
    let projected = @matmul(hidden, projection)
    return l2_normalize(projected)
}

Compile and run:

manta compile embed.manta embed.mll
manta run embed.mll embed

Or use the built-in demo:

manta demo tiny_embed
manta demo tiny_decode
manta demo tiny_score
manta demo tiny_rerank
manta demo tiny_select

Language

Declarations

Parameters bind external weights with shape and dtype:

param wq: f16[D, D] @weight("weights/wq")

Kernels define fused compute regions:

kernel l2_normalize(x: f16[T, E]) -> f16[T, E] {
    return normalize(x)
}

Pipelines orchestrate steps including intrinsics, kernel calls, and KV cache operations:

pipeline decode_step(x: f16[T, D], cache: kv_cache) -> f16[T, D] {
    let q = @matmul(x, wq)
    let q2 = rope(q)
    let past = kv_read(cache)
    kv_write(cache, q2)
    return softmax(q2 + past)
}
Types
Type Description
f16[D1, D2, ...] Half-precision tensor with symbolic shape
f32[D1, D2, ...] Single-precision tensor
i32[D1, ...] Integer tensor
q4[D1, D2, ...] 4-bit quantized tensor/value buffer
q8[D1, D2, ...] 8-bit quantized tensor/value buffer
q_norm[D1, D2, ...] Log-quantized norm sidecar for TurboQuant blocks
kv_cache Mutable key-value cache for autoregressive decoding

Dimensions are symbolic (T, D, V, E) and resolved at load time.

Quantized inference dtypes such as q4[...] and q8[...] are part of the active Manta surface. The current bootstrap runtime can consume them directly for quantized scoring paths, and the next step is richer TurboQuant block-format coverage.

Operations

Intrinsics (prefixed with @) dispatch to backend libraries:

Intrinsic Signature Backend dispatch
@matmul(a, b) f16[M,K] x f16[K,N] -> f16[M,N] cuBLAS / MPS

Built-in functions:

Function Description
gather(table, indices) Lookup rows by index
softmax(x) Row-wise softmax
normalize(x) L2 normalization per row
rmsnorm(x) RMS normalization per row
rope(x) Rotary position embeddings
dequant(x) Dequantization conversion
dot(query, docs) Row-wise dot product scoring
cosine(query, docs) Row-wise cosine scoring
l2_distance(query, docs) Row-wise distance scoring
topk(scores, k) Return top-k indices from a score vector
sparse_attention(q, k, v, top_k) Top-k sparse attention over dense Q/K/V tensors
turbo_sparse_attention(q, kc, kn, vc, vn, top_k[, route_block_size, route_top_blocks]) Sparse attention over TurboQuant-compressed K/V tensors, optionally routed through top-scoring key blocks
kv_read(cache) Read from KV cache
kv_write(cache, value) Write to KV cache

Binary operators: +, -, *, / (element-wise on tensors).

Statements
let result = @matmul(x, w)    // local binding
return softmax(result)         // return value
kv_write(cache, value)         // expression statement (side effect)

Compilation Pipeline

.manta source
  |  Parse (gotreesitter grammar)
  v
Syntax AST
  |  Semantic analysis (type check, scope, constraints)
  v
HIR -- source-oriented, fully typed, symbolic shapes
  |
  v
MIR -- tensor-semantic operations (gather, matmul, softmax, ...)
  |
  v
LIR -- scheduled plan: buffers, kernels, schedule hints, storage classes
  |
  v
.mll artifact -- backend-neutral plan with CUDA, Metal, Vulkan, DirectML, and WebGPU kernel variants
Intermediate representations

HIR preserves source structure with full type information. Parameters carry binding paths, entry points distinguish kernels from pipelines, types include symbolic shape dimensions.

MIR classifies operations semantically: gather, matmul, softmax, rope, kv_read, kv_write, kernel_call, pointwise, reduce, dequant. Backend-neutral.

LIR is a scheduled execution plan. Buffers have storage classes (device_local, host_visible, unified). Kernels have schedule hints (tile, vector_width, subgroup, memory). All concepts are backend-neutral -- tile not blockDim, subgroup not warp.

Schedule hints

Hints guide backend-specific lowering without leaking backend concepts into the IR:

Operation Tile Vector Width Subgroup Memory
softmax [64] 1 yes workgroup_local
normalize [128] 4 yes workgroup_local
rope [128] 2 yes --
binary ops [128] 4 -- --

These map to thread blocks, threadgroups, workgroups, or backend graph dispatches at the backend level.

Artifact Format

The .mll artifact carries a Manta execution plan:

{
  "version": "manta/v0alpha1",
  "name": "tiny_embed",
  "params": [{"name": "token_embedding", "binding": "weights/token_embedding", ...}],
  "entry_points": [{"name": "embed", "kind": "pipeline", ...}],
  "requirements": {"supported_backends": ["cuda", "metal", "vulkan", "directml", "webgpu"]},
  "buffers": [{"name": "hidden", "dtype": "f16", "storage_class": "device_local", ...}],
  "kernels": [{"name": "l2_normalize", "variants": [
    {"backend": "cuda", "entry": "l2_normalize_cuda", "source": "..."},
    {"backend": "metal", "entry": "l2_normalize_metal", "source": "..."},
    {"backend": "vulkan", "entry": "l2_normalize_vulkan", "source": "..."},
    {"backend": "directml", "entry": "l2_normalize_directml", "source": "..."},
    {"backend": "webgpu", "entry": "l2_normalize_webgpu", "source": "..."}
  ], ...}],
  "steps": [
    {"kind": "gather", "inputs": ["token_embedding", "tokens"], "outputs": ["hidden"]},
    {"kind": "matmul", "inputs": ["hidden", "projection"], "outputs": ["projected"]},
    {"kind": "launch_kernel", "kernel": "l2_normalize", "inputs": ["projected"], "outputs": ["result"]},
    {"kind": "return", "outputs": ["embeddings"]}
  ]
}

Artifacts are validated on load: all referenced buffers, kernels, and entry points must exist, I/O flows must be consistent, and kernel variants must be present for all declared backends.

Runtime

Loading and executing
import (
    mantaartifact "github.com/odvcencio/manta/artifact/manta"
    "github.com/odvcencio/manta/runtime"
    "github.com/odvcencio/manta/runtime/backend"
    "github.com/odvcencio/manta/runtime/backends/cuda"
    "github.com/odvcencio/manta/runtime/backends/directml"
    "github.com/odvcencio/manta/runtime/backends/metal"
    "github.com/odvcencio/manta/runtime/backends/vulkan"
    "github.com/odvcencio/manta/runtime/backends/webgpu"
)

rt := runtime.New(cuda.New(), metal.New(), vulkan.New(), directml.New(), webgpu.New())

prog, err := rt.LoadFile(ctx, "model.mll",
    runtime.WithWeight("token_embedding", embeddingData),
    runtime.WithWeight("projection", projectionData),
)

result, err := prog.Run(ctx, backend.Request{
    Entry:  "embed",
    Inputs: map[string]any{"tokens": []int32{1, 42, 7}},
})

output := result.Outputs["embeddings"]

The runtime tries each backend in registration order. The first backend that can load the module is selected. Weight bindings are validated against the module's parameter declarations.

For promoted kernel classes, the runtime can compile and launch backend-native kernels. Where a kernel shape has not been promoted yet, the backend still owns execution but may fall back to the host reference path. This keeps the runtime honest while Manta grows.

Backend interface
type Backend interface {
    Kind() mantaartifact.BackendKind
    CanLoad(mod *mantaartifact.Module) bool
    Load(ctx context.Context, mod *mantaartifact.Module, weights map[string]WeightBinding) (Executor, error)
}

type Executor interface {
    Backend() mantaartifact.BackendKind
    Run(ctx context.Context, req Request) (Result, error)
}
Execution trace

Every execution returns a trace of steps with kernel variant information:

for _, step := range result.Trace {
    fmt.Printf("%s: %s (variant: %s)\n", step.Kind, step.Name, step.Variant)
}
// gather: gather (variant: )
// matmul: matmul (variant: )
// launch_kernel: l2_normalize (variant: l2_normalize_cuda)
// return: return (variant: )

Outputs also carry backend launch metadata such as the selected kernel entry, launch API, tile-derived launch shape, and whether execution ran on device or through backend-owned host fallback.

Deployment targets

Manta is being shaped around two concrete deployment targets:

  1. standalone inference binaries written in Go
  2. CorkScrewDB as a runtime host for embedding, reranking, and quantized vector-aware scoring

That means .mll artifacts should be good at:

  • embed(tokens) -> embeddings
  • score(query, docs) -> scores
  • rerank(query, docs) -> top_ids
  • decode_step(x, cache) -> logits

and eventually:

  • consuming quantized vector collections directly
  • producing TurboQuant-native outputs without post-hoc repacking

CLI

manta compile <source.manta> [output.mll]             Compile .manta source to a Manta artifact
manta init-model [flags] <artifact.mll>             Create the default quantized embedding training package
manta train-corpus [flags] <artifact.mll> <corpus>  Train tokenizer, mine pairs, and fit the embedder
manta tokenize-embed <artifact.mll> <text> <tokens> Convert text JSONL to reusable token JSONL
manta train-embed [flags] <artifact.mll> <train>    Fit an initialized package on token or text JSONL
manta train-embed --eval-only <artifact.mll> <eval> Evaluate a package without optimizer steps
manta train-embed --no-tokenizer <artifact.mll> <tokens> Force token JSONL beside a tokenizer
manta eval-retrieval [flags] <artifact.mll> <beir>  Score BEIR-style retrieval with Manta embeddings
manta eval-retrieval-bm25 [flags] <beir>            Score the same retrieval files with BM25
manta compare-train-metrics <current> [baseline]    Summarize training metrics JSON and deltas
manta diagnose-train-metrics <metrics.json>        Explain backend use and transfer pressure
manta gate-train-metrics [flags] <metrics.json>    Enforce quality and efficiency thresholds
manta export-mll <artifact.mll> [output.mll]        Seal an artifact package into a weight-carrying MLL file
manta inspect <artifact.mll>                        Inspect and verify an artifact package
manta run <artifact.mll> [entry]                    Load and execute an artifact entry point
manta demo [tiny_embed|tiny_decode|tiny_score]      Run a built-in preset module
manta version                                      Print version

Before a candidate run, use ferrous-wheel run scripts/verify_manta_production.fw to preflight the local .mll training, eval-only, sealed export, and inspect path.

For production-grade manta-embed-v1 candidate training, use scripts/acquire_manta_embed_v1_datasets.fw followed by scripts/train_manta_embed_v1_candidate.fw; they record dataset hashes, repo provenance, eval-only gates, sealed export, and artifact hashes. See docs/production-embedding.md.

For the local long-context embedder wedge scoreboard, run scripts/score_manta_embed_v1_baselines.fw against a sealed candidate plus pairwise, hard, retrieval, and optional long-document eval sets. It writes scoreboard.tsv and scoreboard.json under runs/<run-id>/. See docs/benchmarks.md and docs/local-long-context-embedder-wedge.md.

Design Constraints

  • No CUDA-only concepts leak upward. The MIR and LIR use tile, subgroup, vector, and abstract memory classes. Backend-specific mapping happens only during kernel source emission.
  • Same source, same artifact, same runtime contract across all backends.

Status

The compiler, IR pipeline, artifact format, semantic analysis, runtime, and CLI are functional. CUDA executes promoted kernel classes on device on Linux, Metal has the matching Apple device path, and Vulkan, DirectML, and WebGPU now have artifact/compiler/runtime surfaces that execute through backend-owned host fallback while device runtimes land. The retrieval surface includes direct quantized scoring plus topk-based reranking.

Development

CGO_ENABLED=0 go test ./artifact/manta ./cmd/manta ./compiler ./models ./runtime/backend ./runtime/backends/metal ./runtime/backends/vulkan ./runtime/backends/directml ./runtime/backends/webgpu ./syntax
go build ./cmd/manta/

CUDA-backed runtime tests require a working CUDA device and should be run separately from the no-cgo public gate.

Benchmarks

Manta keeps reproducible perf checks as Ferrous Wheel workflows.

MANTA_BENCH_ROOT=$PWD ferrous-wheel run scripts/bench.fw
MANTA_BENCH_ROOT=$PWD MANTA_BENCH_CUDA=1 ferrous-wheel run scripts/bench.fw
MANTA_BENCH_ROOT=$PWD MANTA_BENCH_MODEL_ASSETS=/path/to/assets/manta-embed-v1 ferrous-wheel run scripts/bench.fw

Current manta-embed-v1 CUDA smoke: 845.15 train examples/s and 865437.87 train pairs/s on batch 1024, with the promoted grouped CUDA training path enabled by default. See docs/benchmarks.md for the full profile and the next perf targets.

License

Manta is open source under the Apache License, Version 2.0. See LICENSE and NOTICE.

Directories

Path Synopsis
artifact
cmd
manta command
retrievaldump command
retrievaldump: per-query diagnostic dump for an embedding model.
retrievaldump: per-query diagnostic dump for an embedding model.
ir
hir
lir
mir

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL