goDl

module

v0.2.0 Latest Latest Go to latest Published: Mar 13, 2026 License: MIT

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/fab2s/goDl

Links

Open Source Insights

README ¶

goDl

A Go-native deep learning framework built on libtorch.
Same GPU kernels as PyTorch. No Python. No GIL. Just Go.

Status: Succeeded by floDl

goDl is no longer actively developed. Go's garbage collector cannot deterministically manage GPU (VRAM) memory, creating fundamental limitations for training workloads. floDl is a complete Rust port where Drop provides deterministic resource management. goDl remains available as-is under MIT license.

Graph Builder • Quick Start • Features • PyTorch → goDl • Tutorials • Architecture

The Graph Builder

goDl's fluent graph builder lets you describe complex architectures as readable data flow — no boilerplate, no graph construction commands.

model, err := graph.From(nn.MustLinear(2, 16)).   // input projection
    Through(nn.NewGELU()).                          // activation
    Through(nn.MustLayerNorm(16)).                  // normalization
    Also(nn.MustLinear(16, 16)).                    // residual connection
    Through(nn.MustLinear(16, 2)).                  // output projection
    Build()

That's a trainable model. Also adds the residual — input flows through the Linear and gets added to its output. Build() returns a graph.Graph that implements nn.Module — you can nest it inside other graphs.

Things get interesting when architectures get complex:

g, err := graph.From(encoder).Tag("encoded").                // tag for later
    Split(headA, headB, headC).Merge(graph.Mean()).          // multi-head + merge
    Loop(refinementBlock).For(3).Tag("refined").             // iterate 3 times
    Gate(router, expertA, expertB).Using("encoded").         // soft routing with context
    Switch(selector, lightPath, heavyPath).Using("refined"). // hard routing
    Through(graph.StateAdd()).Using("memory").Tag("memory").  // recurrent state
    Loop(decoder).While(haltCondition, 10).                  // adaptive computation
    Through(outputHead).
    Build()

Every construct — Split/Merge, Also, Loop, Gate, Switch, Map, Tag/Using — composes cleanly. Sub-graphs nest like any module. Forward references (Using before Tag) carry state across calls, enabling recurrent architectures without special-casing.

The graph executes nodes at the same topological level in parallel via goroutines — independent branches run concurrently without any extra code.

See the Graph Builder Tutorial and the full showcase that exercises every builder method.

Quick Start

Requirements: Docker (with NVIDIA Container Toolkit for GPU support).

git clone https://github.com/fab2s/goDl.git
cd goDl
make image    # build dev container (Go + libtorch + CUDA)
make test     # run all 482 tests (CPU + CUDA)
make test-cpu # run without GPU
make doc      # local doc server (pkg.go.dev style)
make shell    # interactive shell in container

Train a model in 30 lines

// Task: learn cumulative sum — [a, b] → [a, a+b]

// Build the model.
model, err := graph.From(nn.MustLinear(2, 16)).
    Through(nn.NewGELU()).
    Through(nn.MustLayerNorm(16)).
    Also(nn.MustLinear(16, 16)).
    Through(nn.MustLinear(16, 2)).
    Build()

// Set up training.
optimizer := nn.NewAdam(model.Parameters(), 0.01)
model.SetTraining(true)

// Training loop.
for loader.Next() {
    input, target := loader.Batch()

    pred := model.Forward(autograd.NewVariable(input, true))
    loss := nn.MSELoss(pred, autograd.NewVariable(target, false))

    optimizer.ZeroGrad()
    loss.Backward()
    nn.ClipGradNorm(model.Parameters(), 1.0)
    optimizer.Step()
}

See examples/train/ for the complete runnable version with data generation and evaluation.

Features

Core Stack

Layer	What it does
Tensor	Immutable, chainable API with error propagation. CPU and CUDA.
Autograd	Reverse-mode automatic differentiation. Full backward for every op.
NN Modules	`Linear`, `Conv2d`, `ConvTranspose2d`, `LayerNorm`, `BatchNorm`, `Dropout`, `Embedding`, `GRUCell`, `LSTMCell`
Activations	`ReLU`, `Sigmoid`, `Tanh`, `GELU`, `SiLU`, `Softmax`
Losses	`MSELoss`, `CrossEntropyLoss`, `BCEWithLogitsLoss`, `L1Loss`, `SmoothL1Loss`, `KLDivLoss`
Optimizers	`SGD` (with momentum), `Adam`, `AdamW`
LR Scheduling	`StepDecay`, `Cosine`, `Warmup` (composable), `ReduceOnPlateau`
Mixed Precision	`Float16`/`BFloat16` dtype casting, `GradScaler` for loss scaling

Graph Builder

Method	What it does
`From(m).Through(m)`	Linear chain
`Input(names...)`	Auxiliary graph inputs, accessible via `Using(name)` — multi-input graphs
`Split(m...).Merge(op)`	Parallel branches, merged by `Add()`, `Mean()`, or `Cat(dim)`
`Also(m)`	Residual connection: `input + m(input)`
`Tag(name)` / `Using(refs...)`	Named references — backward (same pass) or forward (across calls)
`Loop(body).For(n)`	Fixed iteration with BPTT
`Loop(body).While(cond, max)`	Condition before body (0..max iterations)
`Loop(body).Until(cond, max)`	Condition after body (1..max iterations)
`Gate(router, experts...)`	Soft routing — all experts execute, weighted combination
`Switch(selector, branches...)`	Hard routing — only selected branch executes
`Map(body).Each()`	Apply body to each element along dim 0
`Map(body).Over(tag)`	Iterate over a tagged tensor
`Map(body).Slices(n)`	Decompose last dim into n slices, map, recompose
`.Batched()`	Fast path for Map — full batch in one call
`TagGroup(name)`	Name parallel branches: `Split(...).TagGroup("head")` → `"head_0"`, `"head_1"`, ...
`g.ForwardCtx(ctx, inputs...)`	Context-aware execution — timeouts, cancellation for loops and maps

Training Tools

Tool	What it does
`nn.ClipGradNorm`	L2 norm gradient clipping
`nn.ClipGradValue`	Element-wise gradient clamping
`g.Freeze(tags...)` / `g.Unfreeze(tags...)`	Freeze parameters by tag name
`nn.SaveParameters` / `nn.LoadParameters`	Binary checkpoint format (file path or `io.Writer`)
`KaimingUniform/Normal`, `XavierUniform/Normal`	Weight initialization
`data.Loader`	Batched data loading with parallel prefetch and shuffle
LR schedulers	`StepDecay`, `Cosine`, `Warmup`, `ReduceOnPlateau` (composable)
`nn.GradScaler`	Dynamic loss scaling for mixed precision (float16) training
`nn.CastParameters`	Cast model parameters to any dtype (`Float16`, `BFloat16`, etc.)

Module Interfaces

Beyond Module and TrainToggler, modules can implement optional interfaces that the graph recognizes automatically:

Interface	Method	What happens
`Resettable`	`Reset(batchSize int64, device tensor.Device)`	Graph auto-calls before each Forward — modules with per-forward state (attention location, counter, accumulator) reset cleanly on the correct device
`Traced`	`Trace() *Variable`	Loop executor collects return value before first iteration and after each step — `g.Traces(tag)` returns the full trajectory
`NamedInputModule`	`ForwardNamed(stream, refs)`	Loop and node Using refs arrive as a named map instead of positional args
`RefValidator`	`RefNames() []string`	Build-time validation that exactly the expected Using refs are wired
`SubModuler`	`SubModules() []Module`	Declares child modules — framework walks the tree for device placement, training mode, state detachment, and reset
`DeviceMover`	`MoveToDevice(device)`	Moves non-parameter tensors (running stats, buffers) when `SetDevice` is called
`Detachable`	`Detach()`	Breaks gradient chains on retained state — called recursively by `DetachState`

These compose: a loop body that implements Resettable + Traced + NamedInputModule gets auto-reset, per-iteration trace collection, and named ref forwarding — all handled by the graph, no manual wiring.

For composite user modules, implementing SubModuler is all it takes — the framework handles parameter collection, device moves, training mode, and state detachment recursively.

Observation & Trends

Tags double as observation points — collect metrics during training, flush to epoch history, and query trends to drive training decisions. Record injects external metrics (losses, hit rates) into the same pipeline:

for epoch := range epochs {
    for _, batch := range loader {
        pred := g.Forward(batch.Input)
        g.Collect("hidden")                                 // from graph tag

        loss := nn.CrossEntropyLoss(pred, target)
        g.Record("loss", loss.Item())                       // external metric
        g.Record("hit_rate", computeHitRate(pred, target))  // any float64
    }
    g.Flush()  // promotes batch means → epoch history (Collected + Recorded)

    if g.Trend("loss").Stalled(5, 1e-4) {
        scheduler.Decay()
    }
    if g.Trend("loss").Improving(3) {
        g.Unfreeze("decoder")
    }
}

Method	What it does
`g.Tagged(tag)`	Access a tagged node's output after Forward
`g.Traces(tag)`	Access per-iteration side outputs from `Traced` loop bodies
`g.Log(tags...)`	Print current tagged values (hookable via `OnLog`)
`g.Collect(tags...)` / `g.Flush(tags...)`	Batch → epoch metric collection (from graph tags)
`g.Record(tag, values...)`	Inject external metrics into the same Collect/Flush pipeline
`g.Trend(tag)`	Epoch-level trend: `Slope`, `Stalled`, `Improving`, `Converged`
`g.Trends(tags...)`	Group trends: `AllImproving`, `AnyStalled`, `MeanSlope` (expands TagGroups)
`g.Sub(tag)`	Reach into a sub-graph's metrics — no extra Forward needed
`g.ETA(totalEpochs)`	Estimated remaining wall-clock time from flush cadence
`g.FlushCount()` / `g.Elapsed()`	How many flushes and total wall time since first
`g.WriteLog(path, total, tags...)`	Human-readable text log with per-epoch metrics and ETA
`trend.Latest()`	Most recent epoch value (convenience for `Values()[len-1]`)

Tag Groups & Trend Groups

TagGroup names parallel branches from Split with auto-suffixed tags. Trends and TimingTrends expand groups for aggregate queries:

g, _ := graph.From(encoder).
    Split(headA, headB, headC).TagGroup("head").  // head_0, head_1, head_2
    Merge(graph.Mean()).
    Build()

// Training loop with group observation.
for epoch := range epochs {
    for _, batch := range loader {
        g.Forward(batch.Input)
        g.Collect("head_0", "head_1", "head_2")
    }
    g.Flush()

    if g.Trends("head").AllImproving(5) {
        fmt.Println("all heads improving")
    }
    if g.Trends("head").AnyStalled(5, 1e-4) {
        fmt.Println("at least one head stalled")
    }
    fmt.Printf("mean slope: %.4f\n", g.Trends("head").MeanSlope(5))
}

Context-Aware Execution

ForwardCtx threads Go's context.Context through the graph. Loops and maps check for cancellation between iterations — enabling wall-clock timeouts that Python cannot express:

ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()
result := g.ForwardCtx(ctx, input)  // loops abort if time runs out

This is a Go-native advantage. Python's GIL prevents cooperative cancellation inside a forward pass. torch.compile breaks on dynamic control flow. In goDl, context propagation is zero-cost when unused (context.Background() checks compile to a nil return) and naturally composes with loops, maps, and parallel branches.

Combined with the observation layer, this enables patterns like trend-driven loop halts, adaptive computation time with hard deadlines, and graceful training interruption — all impossible or clunky in Python.

Visualization

fmt.Println(g.DOT())          // Graphviz DOT with parameter counts
svg, _ := g.SVG("model.svg")  // render to SVG

// Timing-annotated: nodes colored green→yellow→red by execution time.
g.EnableProfiling()
g.Forward(input)
g.SVGWithProfile("profile.svg")

// Training curves as self-contained HTML (open in any browser).
g.PlotHTML("training.html", "loss", "head")  // expands TagGroups
g.ExportTrends("metrics.csv", "loss")        // CSV for external tools
g.WriteLog("training.log", 100, "loss")      // text log with ETA

Node shapes indicate type (input, output, loop, map, switch, activation, normalization). Parameter counts appear on each node. Forward-ref state loops are shown as dotted edges. Profiled graphs show per-node durations and parallelism efficiency per level.

Numerical Verification

Every differentiable path is verified against finite-difference gradients:

40 autograd op-level checks (every op + compositions)
10 module-level checks (every NN module, input + parameter gradients)
11 exact optimizer step verifications (SGD, Adam, AdamW)
482 tests total, all passing with race detector

Why Go for Deep Learning?

The dispatch overhead problem

Python adds ~3-5 us of framework overhead to every GPU operation (interpreter, GIL, argument parsing, dispatch chain). For large operations like a 1024x1024 matmul, this is noise. For architectures built on many small sequential operations — recurrent steps, iterative refinement, multi-head attention with independent heads — this overhead dominates. The GPU starves between kernel launches.

Python's Global Interpreter Lock prevents parallel kernel dispatch. Independent model branches must dispatch kernels sequentially from a single thread, even when the GPU has dozens of idle Streaming Multiprocessors.

torch.compile partially addresses this by tracing and fusing operations, but it breaks on data-dependent control flow and requires recompilation when loop counts or branch structure change — exactly the dynamic architectures that need help most.

Why Go and not C++ or Rust?

Not C++ because writing and iterating on model architectures in C++ is slow and error-prone. Go provides compiled-language performance with a much shorter feedback loop: fast compilation, simple tooling, readable code.

Not Rust because Rust's main advantage — the borrow checker — cannot reason about tensor memory in libtorch's C allocator. Meanwhile Go has concrete advantages: goroutines are simpler than async for parallel dispatch, compilation is seconds not minutes, the code reads close to pseudocode, and the tooling (go test, go build, go vet) just works.

A framework nobody uses solves nothing. Go hits the right trade-off between performance and accessibility for this domain.

See docs/design/cuda-dispatch.md for the full dispatch overhead analysis.

Architecture

+-----------------------------------------------------------+
|  User Code / Model Definitions                            |
+-----------------------------------------------------------+
|  graph/    Fluent builder, parallel execution, DOT/SVG    |
+-----------------------------------------------------------+
|  nn/       Modules, losses, optimizers, checkpoints       |
+-----------------------------------------------------------+
|  autograd/ Reverse-mode AD, gradient tracking             |
+-----------------------------------------------------------+
|  tensor/   Immutable chainable API, CPU + CUDA            |
+-----------------------------------------------------------+
|  internal/libtorch/   CGo bindings to libtorch C++        |
+-----------------------------------------------------------+
|  libtorch / CUDA / ROCm / MPS / CPU                      |
+-----------------------------------------------------------+

The same GPU kernels that power PyTorch run the actual math. goDl replaces everything above them: the dispatch path, autograd tracking, operator composition, and execution scheduling.

Since goDl binds libtorch — not CUDA directly — it inherits libtorch's backend support: NVIDIA (CUDA), AMD (ROCm), Intel (XPU), Apple Silicon (MPS), and CPU. Switching hardware is a build flag, not a code change.

Documentation

PyTorch Migration

Coming from PyTorch? The PyTorch → goDl Migration Guide has side-by-side examples for every common operation — tensors, autograd, modules, losses, optimizers, training loops, and the graph builder.

Tutorials

Step-by-step guides from basics to advanced, each with runnable examples:

Tensors — creation, ops, chaining, error handling
Autograd — variables, gradients, backward pass
Modules — Linear, Conv2d, normalization, RNN cells
Training — losses, optimizers, data loading, full loop
Graph Builder — the fluent API from simple to complex
Advanced Graphs — forward refs, loops, gates, switches
Visualization — DOT/SVG output, reading diagrams
Utilities — checkpoints, clipping, freezing, initialization

Design

Roadmap — phased development plan
CUDA Dispatch Analysis — overhead breakdown and performance thesis
Trajectory Thesis — geometric intuition behind the project

Examples

examples/train/ — complete training loop (data loading, loss, backward, optimizer)
examples/showcase/ — every graph builder method in one graph

License

goDl is open-sourced software licensed under the MIT license.

Directories ¶

Path	Synopsis
autograd Package autograd provides reverse-mode automatic differentiation.	Package autograd provides reverse-mode automatic differentiation.
data
examples
showcase Package showcase demonstrates every method of the graph fluent builder API in a single coherent graph.	Package showcase demonstrates every method of the graph fluent builder API in a single coherent graph.
graph Package graph provides a composable execution graph where nodes are Modules.	Package graph provides a composable execution graph where nodes are Modules.
internal
libtorch gc_callback.go — CUDA OOM → Go GC bridge.	gc_callback.go — CUDA OOM → Go GC bridge.
nn Package nn provides neural network layers, loss functions, and optimizers.	Package nn provides neural network layers, loss functions, and optimizers.
tensor Package tensor provides a safe, idiomatic Go tensor type built on libtorch.	Package tensor provides a safe, idiomatic Go tensor type built on libtorch.

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL