ai

command module
v1.4.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 25, 2026 License: MIT Imports: 29 Imported by: 0

README

ai

GPU-accelerated ML CLI. Train, infer, quantize, and serve LLMs. Zero Python, one binary.

Powered by the mongoose GPU compute engine.

Sponsor

Install

macOS (Metal — Apple Silicon):

go install github.com/open-ai-org/ai@latest

Linux (CUDA — NVIDIA GPUs):

CGO_CFLAGS="-I/usr/local/cuda/include" \
CGO_LDFLAGS="-L/usr/local/cuda/lib64" \
go install github.com/open-ai-org/ai@latest

Any GPU via Vulkan (WebGPU — AMD, Intel, NVIDIA):

CGO_ENABLED=0 go install github.com/open-ai-org/ai@latest

CPU only:

CGO_ENABLED=0 go install github.com/open-ai-org/ai@latest

Or build from source:

git clone https://github.com/open-ai-org/ai.git
cd ai
go build -o ai .

Quick Start

# Train from scratch
ai train data=corpus.txt

# Fine-tune a pretrained model
ai train model=Qwen2.5-0.5B data=corpus.txt

# Download a model
ai pull Qwen/Qwen2.5-0.5B

# Chat
ai chat Qwen2.5-0.5B

# Generate text
ai infer Qwen2.5-0.5B "The meaning of life is"

# Start an OpenAI-compatible API server
ai serve Qwen2.5-0.5B

Commands

Commands use key=value for required args. --flags are advanced overrides for users who know what they want — everything has safe defaults.

Training
Command Description
ai train data=<file> Train from scratch
ai train model=<name> data=<file> Fine-tune a pretrained model
ai resume checkpoint=<dir> data=<file> Continue training from checkpoint

Auto-detects GPU: Metal on macOS, CUDA on Linux, Vulkan everywhere else.

Advanced overrides (optional — defaults are opinionated):

ai train data="/data/*.txt" --dim 512 --layers 8 --steps 5000 --lr 3e-4
Evaluation & Inference
Command Description
ai chat <model> Interactive chat (bubbletea TUI)
ai infer <model> "prompt" Generate text
ai eval model=<name> data=<file> Validation pass (loss + perplexity)
ai benchmark <model> Inference throughput and latency
Optimization
Command Description
ai quantize <model> [q8|q4|f16] Reduce precision
ai prune <model> Remove low-magnitude weights (50% default)
ai prune <model> --sparsity 0.7 --structured Structured head pruning
ai convert gguf <model> Export to GGUF (for Ollama)
ai merge <base> <adapters> Merge LoRA into base model
Data
Command Description
ai dataset inspect <file> Dataset statistics and recommendations
ai dataset split <file> Partition into train/val/test
ai dataset augment <file> Dedup, lowercase, repeat, shuffle
Command Description
ai sweep data=<file> lr=1e-4,3e-4,6e-4 Hyperparameter search
ai distill teacher=<model> data=<file> Knowledge distillation
ai profile Per-op GPU timing breakdown
Deployment
Command Description
ai serve <model> OpenAI-compatible API server
ai serve <model> --no-stream Serve without streaming weight load (full VRAM, max tok/s from first token)
Introspection
Command Description
ai explain <model> "prompt" Token-level attribution
ai checkpoint ls List training checkpoints
ai checkpoint diff <a> <b> Compare checkpoints
ai bench Raw GPU compute benchmark
ai gpus Detect and calibrate hardware
Models
Command Description
ai pull <org/model> Download from HuggingFace
ai models List downloaded models
ai info <model> Show model architecture

Platform Support Matrix

Every command works on every backend. Optimized paths are used when available, with automatic fallback.

Command Metal CUDA WebGPU (Vulkan) CPU
train yes yes yes yes
finetune yes yes yes (CPU train) yes (CPU train)
chat yes (fused/graph/generic) yes (generic) yes (generic) yes (generic)
infer yes (fused/graph/tier2) yes (Q8/tier2) yes (tier2/3) yes (tier3)
serve yes (streaming/fused) yes (Q8 fused) yes (generic) yes (generic)
eval yes (planned) yes planned planned
profile planned yes planned planned
explain yes yes yes yes
sweep yes (graph) yes (exec) yes (exec) yes (exec)

Performance

Training convergence — dim=512, RTX 5090
step 1     loss 6.17
step 100   loss 2.59   floor 2.37
step 300   loss 2.05   floor 1.76
step 500   loss 1.95   floor 1.29   365 steps/s
Inference — Qwen2.5-0.5B, Q8, ai serve
Metric M4 Max M1 Pro
ai benchmark (avg) 221.7 tok/s 97.2 tok/s
ai serve throughput 239–241 tok/s 133–145 tok/s
TTFT (streaming) 4ms 29ms
Streaming inference — ai serve

Weights stream into GPU memory in the background using ping-pong double buffers. The server accepts requests immediately — no cold start.

Mode VRAM Tok/s When
Streaming (cold) 2 layers (~19x less) ~9 tok/s First request, weights still loading
Resident (warm) Full model 125–135 tok/s After background load finishes

Auto-switches from streaming to resident once all weights are loaded. Use --no-stream to skip streaming and wait for full load before accepting requests.

ai serve Qwen2.5-0.5B              # streaming (default) — instant cold start
ai serve Qwen2.5-0.5B --no-stream  # wait for full model load, max speed from first token
CUDA serve — RTX 5090
Path Tok/s
Generic (pre-v1.3.1) 1.9
Q8 fused (v1.3.1+) 189

Zero-alloc hot path: all scratch buffers pre-allocated at model load.

Architecture
  • Automatic quantization: Q8 for models <4B params, Q4 for 7B+
  • Metal 4 matmul2d TensorOp on macOS 26+
  • Fused dequant-matvec kernels — zero intermediate buffers
  • Custom Metal/CUDA compute shaders for RMSNorm, RoPE, GQA attention, SiLU
  • Multi-slot inference: independent KV caches per request on separate Metal command queues

GPU Support

Backend Hardware Install
Metal Apple Silicon (M1+) go install on macOS
CUDA NVIDIA GPUs CGO_CFLAGS/LDFLAGS + go install on Linux
Vulkan Any GPU (AMD, Intel, NVIDIA) CGO_ENABLED=0 go install (WebGPU)
CPU Any CGO_ENABLED=0 go install
Compute Kernels

Custom fused kernels unlock the fastest paths for training and inference. Without them, ai falls back to generic GPU matmul which still works but is slower.

Metal (macOS) — pre-compiled .metallib files ship in the mongoose repo. brew install places them automatically. For go install, copy them next to the binary:

# Metal 4 GEMM (optimal path — requires macOS >= 26.3)
cp mongoose/kernels/gemm_metal4.metallib $(go env GOPATH)/bin/

# Fused training kernels (RMSNorm, RoPE, attention, SiLU, AdamW)
cp mongoose/kernels/fused_train.metallib $(go env GOPATH)/bin/

Or compile from source:

cd mongoose/kernels
xcrun metal -std=metal4.0 -O2 -c gemm_metal4.metal -o gemm_metal4.air
xcrun metallib -o gemm_metal4.metallib gemm_metal4.air
cp gemm_metal4.metallib $(go env GOPATH)/bin/

Metal 4 matmul2d TensorOp requires macOS 26.3+ and Apple Silicon. On older macOS, ai falls back to MPS tiled GEMM automatically.

CUDA (Linux) — compile from source (architecture-specific):

cd mongoose/kernels
nvcc -shared -o libmongoose_kernels.so mongoose.cu \
  -Xcompiler -fPIC -gencode arch=compute_90,code=compute_90
cp libmongoose_kernels.so $(go env GOPATH)/bin/

Replace compute_90 with your GPU's compute capability (e.g., compute_89 for RTX 4090, compute_120 for RTX 5090).

Kernel What it fuses
RMSNorm in-place + out-of-place, forward + backward
RoPE rotate_half, forward + backward
GQA Attention grouped query attention decode
SiLU Gate Mul SiLU activation * gate, forward + backward
Q8/Q4 Matvec fused dequant + matrix-vector multiply
AdamW fused optimizer step
Embedding gather + scatter
Cross-Entropy softmax + CE loss + gradient in one pass

Dependencies

License

MIT

Documentation

The Go Gopher

There is no documentation for this package.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL