Documentation
¶
Overview ¶
Package mythos implements the OpenMythos recurrent-depth transformer architecture in gorch. Plan 0001 Phase 2.
Architecture (loose summary; see plan 0001 for the long version):
tokens → Embedding → Prelude (N standard blocks)
→ Recurrent loop: for t in 1..MaxLoopIters:
h_{t+1} = lti(h_t, e) + Block(h_t, e)
→ Coda (N standard blocks)
→ final RMSNorm → LM head
v1 = `mythos_tiny` (~5–10 M params). Bigger configs (mythos_1b, mythos_8b, mythos_1t) are out of scope until distributed training, activation checkpointing, and bf16 land — tracked in plan 0001 Phase 5.
Index ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type Config ¶
type Config struct {
// Architecture
VocabSize int // tokenizer vocabulary
Dim int // hidden size
// Attention
NumHeads int // number of query heads
NumKVHeads int // number of key/value heads (≤ NumHeads, must divide it)
MaxSeqLen int // RoPE cache length and causal-mask seq cap
UseMLA bool // false → GQA; true → MLA. v1 ships GQA only.
RopeBaseFreq float32
// Recurrent depth
PreludeLayers int // standard blocks run once before the loop
CodaLayers int // standard blocks run once after the loop
MaxLoopIters int // recurrent block iterations (v1: fixed; ACT defers)
LTIDampInit float32
// Mixture of Experts
NumExperts int
NumExpertsPerToken int // top-K
ExpertDim int
// RMSNorm
NormEps float32
// Training (default values; overridable per-run)
Dropout float32
}
Config captures every shape parameter for an OpenMythos model.
The defaults track the OpenMythos config dataclass field-by-field; drop-in TinyMythos / Mythos1B / Mythos8B presets below mirror the canonical sizes from the source repo.
func TinyConfig ¶
TinyConfig is the v1 target: ~5–10 M parameters; trains end-to-end on TinyStories on a single Apple Silicon Mac in a day. Numbers match the table in plan 0001's "v1 scope decision" section.
vocabSize is provided by the caller — TinyStories' BPE vocab is ~5k, GPT-2's is 50257. Both work; pass whichever the data loader returns.
type LTIInjection ¶
type LTIInjection struct {
Dim int
DampLogit *g.Tensor // raw logits, shape (1, dim); pass through sigmoid for damp ∈ (0,1)
}
LTIInjection is the Linear-Time-Invariant stable mixing of the recurrent block's previous hidden state with the current block's output and the original embedding.
The OpenMythos paper writes this as
h_{t+1} = A · h_t + B · e + Block(h_t, e)
for learnable matrices A, B that satisfy a stability constraint (eigenvalues of A inside the unit disc, parameterized as the matrix exponential of -log eigenvalues). For the v1 port we simplify to a per-channel diagonal A and B — i.e., learnable scalar damping per hidden dim. This preserves the stability property by clamping each scalar to (0, 1) via sigmoid, and keeps the math autograd-ready without introducing a matrix-exp op.
The full matrix-A parameterisation is captured as a follow-up in plan 0001 Phase 5 (scale-up). For mythos_tiny on TinyStories the scalar form should be enough to demonstrate the recurrent-depth claim — the architecture's value comes from the iterative refinement, not from the precise spectral structure of A.
Per-channel damping signals: damp[i] ∈ (0, 1), with init = sigmoid(-1) ≈ 0.27, then h_{t+1} = damp ⊙ h_t + (1 - damp) ⊙ Block(h_t).
func NewLTIInjection ¶
func NewLTIInjection(dim int, dampInit float32) *LTIInjection
NewLTIInjection initialises per-channel damping logits so sigmoid(logit) = cfg.LTIDampInit. Inverse sigmoid: log(p/(1-p)). For the default 0.5 damping that works out to 0 — initialising at 0 gives equal weight to the previous hidden and the new block contribution, then the optimiser learns the right per-channel mixing.
func (*LTIInjection) Apply ¶
func (l *LTIInjection) Apply(hPrev, hBlock *g.Tensor) *g.Tensor
Apply mixes h_prev with h_block:
damp = sigmoid(DampLogit) // per-channel ∈ (0, 1) out = damp ⊙ h_prev + (1 - damp) ⊙ h_block
Both inputs are (M, dim). Output (M, dim). Autograd-aware end-to-end via gorch's broadcast ops (MulB / SubB) and Sigmoid; gradient flows back to DampLogit through Sigmoid.
func (*LTIInjection) Parameters ¶
func (l *LTIInjection) Parameters() []*g.Tensor
Parameters returns the damping logits (the only learnable part).
type Mythos ¶
type Mythos struct {
Cfg Config
Embed *nn.Embedding
RoPE *nn.RoPE
Prelude []*TransformerBlock
Recur *TransformerBlock
LTI *LTIInjection
Coda []*TransformerBlock
Norm *nn.RMSNorm
}
Mythos is the top-level recurrent-depth transformer model.
Forward path:
tokens ─────► Embedding ─────► h0
h0 ─────► Prelude (PreludeLayers blocks) ─────► h_in
h_in ─────► Recurrent (MaxLoopIters iterations of one shared
block, each followed by LTI mixing) ─────► h_out
h_out ─────► Coda (CodaLayers blocks) ─────► h_final
h_final ─────► RMSNorm ─────► LMHead Linear ─────► logits (vocab)
v1 simplifications relative to the OpenMythos paper, captured in plan 0001 Phase 2 / 5:
- LTI: per-channel diagonal damping (sigmoid-of-logit), not the full matrix-A parameterisation.
- Recurrent block weights are SHARED across the MaxLoopIters iterations (depth-wise LoRA adapters are deferred — same as the OpenMythos repo's "USE_LORA = False" branch).
- No ACT halting: MaxLoopIters is a fixed Config field, the loop runs that many times. ACT's halting probabilities + cumulative surplus are deferred until v1 demonstrates the recurrent-depth benefit on TinyStories.
LM head is tied to the input embedding (HF-style; saves vocab*dim parameters and is what GPT-2/Llama do today).
func New ¶
New builds a Mythos model from cfg. The shared RoPE table is built once and threaded into every attention sublayer so the same precomputed cos/sin pairs are used end-to-end.
func (*Mythos) Forward ¶
Forward runs the full model on a flat token-id slice. Returns logits of shape (seq, VocabSize). startPos is the absolute position of tokens[0] (relevant for KV-cached decoding; pass 0 for full-sequence training forward).
loopIters can override the config's MaxLoopIters at inference time (the recurrent-depth-ablation core test). Pass -1 to use cfg.MaxLoopIters.
func (*Mythos) Parameters ¶
Parameters returns every learnable tensor in the model.
type TransformerBlock ¶
TransformerBlock is the standard pre-norm OpenMythos block:
h = h + Attention(RMSNorm(h), startPos) h = h + MoE(RMSNorm(h))
Uses GQA today (UseMLA path is plan 0001's open MLA-completes-autograd item; pinned in tests below). Both sublayers add residual connections in the standard pre-norm pattern. RoPE is composed inside the attention module on Q and K.
One block has:
- 2 RMSNorm layers (gamma per layer)
- 1 GQA attention with Wq, Wk, Wv, Wo Linear projections
- 1 MoE FFN with router + N experts (each 3 Linear projections)
Plan 0001 Phase 2 deliverable.
func NewTransformerBlock ¶
func NewTransformerBlock(cfg Config, rope *nn.RoPE) *TransformerBlock
NewTransformerBlock builds a block sized to cfg.
func (*TransformerBlock) Forward ¶
Forward runs the block with residuals. x is (seq, dim); startPos is the absolute position of x[0] for RoPE. Output: (seq, dim).
func (*TransformerBlock) Parameters ¶
func (b *TransformerBlock) Parameters() []*g.Tensor
Parameters returns every learnable tensor in the block.