goDl

module
v0.2.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 13, 2026 License: MIT

README

goDl

goDl

A Go-native deep learning framework built on libtorch.
Same GPU kernels as PyTorch. No Python. No GIL. Just Go.

Status: Succeeded by floDl

goDl is no longer actively developed. Go's garbage collector cannot deterministically manage GPU (VRAM) memory, creating fundamental limitations for training workloads. floDl is a complete Rust port where Drop provides deterministic resource management. goDl remains available as-is under MIT license.

CI Go Reference Go Report Card License

Graph BuilderQuick StartFeaturesPyTorch → goDlTutorialsArchitecture


The Graph Builder

goDl's fluent graph builder lets you describe complex architectures as readable data flow — no boilerplate, no graph construction commands.

model, err := graph.From(nn.MustLinear(2, 16)).   // input projection
    Through(nn.NewGELU()).                          // activation
    Through(nn.MustLayerNorm(16)).                  // normalization
    Also(nn.MustLinear(16, 16)).                    // residual connection
    Through(nn.MustLinear(16, 2)).                  // output projection
    Build()

That's a trainable model. Also adds the residual — input flows through the Linear and gets added to its output. Build() returns a graph.Graph that implements nn.Module — you can nest it inside other graphs.

Things get interesting when architectures get complex:

g, err := graph.From(encoder).Tag("encoded").                // tag for later
    Split(headA, headB, headC).Merge(graph.Mean()).          // multi-head + merge
    Loop(refinementBlock).For(3).Tag("refined").             // iterate 3 times
    Gate(router, expertA, expertB).Using("encoded").         // soft routing with context
    Switch(selector, lightPath, heavyPath).Using("refined"). // hard routing
    Through(graph.StateAdd()).Using("memory").Tag("memory").  // recurrent state
    Loop(decoder).While(haltCondition, 10).                  // adaptive computation
    Through(outputHead).
    Build()

Every construct — Split/Merge, Also, Loop, Gate, Switch, Map, Tag/Using — composes cleanly. Sub-graphs nest like any module. Forward references (Using before Tag) carry state across calls, enabling recurrent architectures without special-casing.

The graph executes nodes at the same topological level in parallel via goroutines — independent branches run concurrently without any extra code.

See the Graph Builder Tutorial and the full showcase that exercises every builder method.

Quick Start

Requirements: Docker (with NVIDIA Container Toolkit for GPU support).

git clone https://github.com/fab2s/goDl.git
cd goDl
make image    # build dev container (Go + libtorch + CUDA)
make test     # run all 482 tests (CPU + CUDA)
make test-cpu # run without GPU
make doc      # local doc server (pkg.go.dev style)
make shell    # interactive shell in container

Train a model in 30 lines

// Task: learn cumulative sum — [a, b] → [a, a+b]

// Build the model.
model, err := graph.From(nn.MustLinear(2, 16)).
    Through(nn.NewGELU()).
    Through(nn.MustLayerNorm(16)).
    Also(nn.MustLinear(16, 16)).
    Through(nn.MustLinear(16, 2)).
    Build()

// Set up training.
optimizer := nn.NewAdam(model.Parameters(), 0.01)
model.SetTraining(true)

// Training loop.
for loader.Next() {
    input, target := loader.Batch()

    pred := model.Forward(autograd.NewVariable(input, true))
    loss := nn.MSELoss(pred, autograd.NewVariable(target, false))

    optimizer.ZeroGrad()
    loss.Backward()
    nn.ClipGradNorm(model.Parameters(), 1.0)
    optimizer.Step()
}

See examples/train/ for the complete runnable version with data generation and evaluation.

Features

Core Stack

Layer What it does
Tensor Immutable, chainable API with error propagation. CPU and CUDA.
Autograd Reverse-mode automatic differentiation. Full backward for every op.
NN Modules Linear, Conv2d, ConvTranspose2d, LayerNorm, BatchNorm, Dropout, Embedding, GRUCell, LSTMCell
Activations ReLU, Sigmoid, Tanh, GELU, SiLU, Softmax
Losses MSELoss, CrossEntropyLoss, BCEWithLogitsLoss, L1Loss, SmoothL1Loss, KLDivLoss
Optimizers SGD (with momentum), Adam, AdamW
LR Scheduling StepDecay, Cosine, Warmup (composable), ReduceOnPlateau
Mixed Precision Float16/BFloat16 dtype casting, GradScaler for loss scaling

Graph Builder

Method What it does
From(m).Through(m) Linear chain
Input(names...) Auxiliary graph inputs, accessible via Using(name) — multi-input graphs
Split(m...).Merge(op) Parallel branches, merged by Add(), Mean(), or Cat(dim)
Also(m) Residual connection: input + m(input)
Tag(name) / Using(refs...) Named references — backward (same pass) or forward (across calls)
Loop(body).For(n) Fixed iteration with BPTT
Loop(body).While(cond, max) Condition before body (0..max iterations)
Loop(body).Until(cond, max) Condition after body (1..max iterations)
Gate(router, experts...) Soft routing — all experts execute, weighted combination
Switch(selector, branches...) Hard routing — only selected branch executes
Map(body).Each() Apply body to each element along dim 0
Map(body).Over(tag) Iterate over a tagged tensor
Map(body).Slices(n) Decompose last dim into n slices, map, recompose
.Batched() Fast path for Map — full batch in one call
TagGroup(name) Name parallel branches: Split(...).TagGroup("head")"head_0", "head_1", ...
g.ForwardCtx(ctx, inputs...) Context-aware execution — timeouts, cancellation for loops and maps

Training Tools

Tool What it does
nn.ClipGradNorm L2 norm gradient clipping
nn.ClipGradValue Element-wise gradient clamping
g.Freeze(tags...) / g.Unfreeze(tags...) Freeze parameters by tag name
nn.SaveParameters / nn.LoadParameters Binary checkpoint format (file path or io.Writer)
KaimingUniform/Normal, XavierUniform/Normal Weight initialization
data.Loader Batched data loading with parallel prefetch and shuffle
LR schedulers StepDecay, Cosine, Warmup, ReduceOnPlateau (composable)
nn.GradScaler Dynamic loss scaling for mixed precision (float16) training
nn.CastParameters Cast model parameters to any dtype (Float16, BFloat16, etc.)

Module Interfaces

Beyond Module and TrainToggler, modules can implement optional interfaces that the graph recognizes automatically:

Interface Method What happens
Resettable Reset(batchSize int64, device tensor.Device) Graph auto-calls before each Forward — modules with per-forward state (attention location, counter, accumulator) reset cleanly on the correct device
Traced Trace() *Variable Loop executor collects return value before first iteration and after each step — g.Traces(tag) returns the full trajectory
NamedInputModule ForwardNamed(stream, refs) Loop and node Using refs arrive as a named map instead of positional args
RefValidator RefNames() []string Build-time validation that exactly the expected Using refs are wired
SubModuler SubModules() []Module Declares child modules — framework walks the tree for device placement, training mode, state detachment, and reset
DeviceMover MoveToDevice(device) Moves non-parameter tensors (running stats, buffers) when SetDevice is called
Detachable Detach() Breaks gradient chains on retained state — called recursively by DetachState

These compose: a loop body that implements Resettable + Traced + NamedInputModule gets auto-reset, per-iteration trace collection, and named ref forwarding — all handled by the graph, no manual wiring.

For composite user modules, implementing SubModuler is all it takes — the framework handles parameter collection, device moves, training mode, and state detachment recursively.

Tags double as observation points — collect metrics during training, flush to epoch history, and query trends to drive training decisions. Record injects external metrics (losses, hit rates) into the same pipeline:

for epoch := range epochs {
    for _, batch := range loader {
        pred := g.Forward(batch.Input)
        g.Collect("hidden")                                 // from graph tag

        loss := nn.CrossEntropyLoss(pred, target)
        g.Record("loss", loss.Item())                       // external metric
        g.Record("hit_rate", computeHitRate(pred, target))  // any float64
    }
    g.Flush()  // promotes batch means → epoch history (Collected + Recorded)

    if g.Trend("loss").Stalled(5, 1e-4) {
        scheduler.Decay()
    }
    if g.Trend("loss").Improving(3) {
        g.Unfreeze("decoder")
    }
}
Method What it does
g.Tagged(tag) Access a tagged node's output after Forward
g.Traces(tag) Access per-iteration side outputs from Traced loop bodies
g.Log(tags...) Print current tagged values (hookable via OnLog)
g.Collect(tags...) / g.Flush(tags...) Batch → epoch metric collection (from graph tags)
g.Record(tag, values...) Inject external metrics into the same Collect/Flush pipeline
g.Trend(tag) Epoch-level trend: Slope, Stalled, Improving, Converged
g.Trends(tags...) Group trends: AllImproving, AnyStalled, MeanSlope (expands TagGroups)
g.Sub(tag) Reach into a sub-graph's metrics — no extra Forward needed
g.ETA(totalEpochs) Estimated remaining wall-clock time from flush cadence
g.FlushCount() / g.Elapsed() How many flushes and total wall time since first
g.WriteLog(path, total, tags...) Human-readable text log with per-epoch metrics and ETA
trend.Latest() Most recent epoch value (convenience for Values()[len-1])

Tag Groups & Trend Groups

TagGroup names parallel branches from Split with auto-suffixed tags. Trends and TimingTrends expand groups for aggregate queries:

g, _ := graph.From(encoder).
    Split(headA, headB, headC).TagGroup("head").  // head_0, head_1, head_2
    Merge(graph.Mean()).
    Build()

// Training loop with group observation.
for epoch := range epochs {
    for _, batch := range loader {
        g.Forward(batch.Input)
        g.Collect("head_0", "head_1", "head_2")
    }
    g.Flush()

    if g.Trends("head").AllImproving(5) {
        fmt.Println("all heads improving")
    }
    if g.Trends("head").AnyStalled(5, 1e-4) {
        fmt.Println("at least one head stalled")
    }
    fmt.Printf("mean slope: %.4f\n", g.Trends("head").MeanSlope(5))
}

Context-Aware Execution

ForwardCtx threads Go's context.Context through the graph. Loops and maps check for cancellation between iterations — enabling wall-clock timeouts that Python cannot express:

ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()
result := g.ForwardCtx(ctx, input)  // loops abort if time runs out

This is a Go-native advantage. Python's GIL prevents cooperative cancellation inside a forward pass. torch.compile breaks on dynamic control flow. In goDl, context propagation is zero-cost when unused (context.Background() checks compile to a nil return) and naturally composes with loops, maps, and parallel branches.

Combined with the observation layer, this enables patterns like trend-driven loop halts, adaptive computation time with hard deadlines, and graceful training interruption — all impossible or clunky in Python.

Visualization

fmt.Println(g.DOT())          // Graphviz DOT with parameter counts
svg, _ := g.SVG("model.svg")  // render to SVG

// Timing-annotated: nodes colored green→yellow→red by execution time.
g.EnableProfiling()
g.Forward(input)
g.SVGWithProfile("profile.svg")

// Training curves as self-contained HTML (open in any browser).
g.PlotHTML("training.html", "loss", "head")  // expands TagGroups
g.ExportTrends("metrics.csv", "loss")        // CSV for external tools
g.WriteLog("training.log", 100, "loss")      // text log with ETA

Node shapes indicate type (input, output, loop, map, switch, activation, normalization). Parameter counts appear on each node. Forward-ref state loops are shown as dotted edges. Profiled graphs show per-node durations and parallelism efficiency per level.

Numerical Verification

Every differentiable path is verified against finite-difference gradients:

  • 40 autograd op-level checks (every op + compositions)
  • 10 module-level checks (every NN module, input + parameter gradients)
  • 11 exact optimizer step verifications (SGD, Adam, AdamW)
  • 482 tests total, all passing with race detector

Why Go for Deep Learning?

The dispatch overhead problem

Python adds ~3-5 us of framework overhead to every GPU operation (interpreter, GIL, argument parsing, dispatch chain). For large operations like a 1024x1024 matmul, this is noise. For architectures built on many small sequential operations — recurrent steps, iterative refinement, multi-head attention with independent heads — this overhead dominates. The GPU starves between kernel launches.

Python's Global Interpreter Lock prevents parallel kernel dispatch. Independent model branches must dispatch kernels sequentially from a single thread, even when the GPU has dozens of idle Streaming Multiprocessors.

torch.compile partially addresses this by tracing and fusing operations, but it breaks on data-dependent control flow and requires recompilation when loop counts or branch structure change — exactly the dynamic architectures that need help most.

Why Go and not C++ or Rust?

Not C++ because writing and iterating on model architectures in C++ is slow and error-prone. Go provides compiled-language performance with a much shorter feedback loop: fast compilation, simple tooling, readable code.

Not Rust because Rust's main advantage — the borrow checker — cannot reason about tensor memory in libtorch's C allocator. Meanwhile Go has concrete advantages: goroutines are simpler than async for parallel dispatch, compilation is seconds not minutes, the code reads close to pseudocode, and the tooling (go test, go build, go vet) just works.

A framework nobody uses solves nothing. Go hits the right trade-off between performance and accessibility for this domain.

See docs/design/cuda-dispatch.md for the full dispatch overhead analysis.

Architecture

+-----------------------------------------------------------+
|  User Code / Model Definitions                            |
+-----------------------------------------------------------+
|  graph/    Fluent builder, parallel execution, DOT/SVG    |
+-----------------------------------------------------------+
|  nn/       Modules, losses, optimizers, checkpoints       |
+-----------------------------------------------------------+
|  autograd/ Reverse-mode AD, gradient tracking             |
+-----------------------------------------------------------+
|  tensor/   Immutable chainable API, CPU + CUDA            |
+-----------------------------------------------------------+
|  internal/libtorch/   CGo bindings to libtorch C++        |
+-----------------------------------------------------------+
|  libtorch / CUDA / ROCm / MPS / CPU                      |
+-----------------------------------------------------------+

The same GPU kernels that power PyTorch run the actual math. goDl replaces everything above them: the dispatch path, autograd tracking, operator composition, and execution scheduling.

Since goDl binds libtorch — not CUDA directly — it inherits libtorch's backend support: NVIDIA (CUDA), AMD (ROCm), Intel (XPU), Apple Silicon (MPS), and CPU. Switching hardware is a build flag, not a code change.

Documentation

PyTorch Migration

Coming from PyTorch? The PyTorch → goDl Migration Guide has side-by-side examples for every common operation — tensors, autograd, modules, losses, optimizers, training loops, and the graph builder.

Tutorials

Step-by-step guides from basics to advanced, each with runnable examples:

  1. Tensors — creation, ops, chaining, error handling
  2. Autograd — variables, gradients, backward pass
  3. Modules — Linear, Conv2d, normalization, RNN cells
  4. Training — losses, optimizers, data loading, full loop
  5. Graph Builder — the fluent API from simple to complex
  6. Advanced Graphs — forward refs, loops, gates, switches
  7. Visualization — DOT/SVG output, reading diagrams
  8. Utilities — checkpoints, clipping, freezing, initialization

Design

Examples

License

goDl is open-sourced software licensed under the MIT license.

Directories

Path Synopsis
Package autograd provides reverse-mode automatic differentiation.
Package autograd provides reverse-mode automatic differentiation.
examples
showcase
Package showcase demonstrates every method of the graph fluent builder API in a single coherent graph.
Package showcase demonstrates every method of the graph fluent builder API in a single coherent graph.
Package graph provides a composable execution graph where nodes are Modules.
Package graph provides a composable execution graph where nodes are Modules.
internal
libtorch
gc_callback.go — CUDA OOM → Go GC bridge.
gc_callback.go — CUDA OOM → Go GC bridge.
Package nn provides neural network layers, loss functions, and optimizers.
Package nn provides neural network layers, loss functions, and optimizers.
Package tensor provides a safe, idiomatic Go tensor type built on libtorch.
Package tensor provides a safe, idiomatic Go tensor type built on libtorch.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL