mythos

package

v0.0.0-...-4326643 Latest Latest Go to latest Published: Apr 30, 2026 License: MIT Imports: 3 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/vinq1911/gorch

Links

Open Source Insights

Documentation ¶

Rendered for darwin/amd64

Overview ¶

Package mythos implements the OpenMythos recurrent-depth transformer architecture in gorch. Plan 0001 Phase 2.

Architecture (loose summary; see plan 0001 for the long version):

tokens → Embedding → Prelude (N standard blocks)
       → Recurrent loop: for t in 1..MaxLoopIters:
             h_{t+1} = lti(h_t, e) + Block(h_t, e)
       → Coda (N standard blocks)
       → final RMSNorm → LM head

v1 = `mythos_tiny` (~5–10 M params). Bigger configs (mythos_1b, mythos_8b, mythos_1t) are out of scope until distributed training, activation checkpointing, and bf16 land — tracked in plan 0001 Phase 5.

Index ¶

type Config
- func TinyConfig(vocabSize int) Config
- func (c Config) HeadDim() int
type LTIInjection
- func NewLTIInjection(dim int, dampInit float32) *LTIInjection
- func (l *LTIInjection) Apply(hPrev, hBlock *g.Tensor) *g.Tensor
- func (l *LTIInjection) Parameters() []*g.Tensor
type Mythos
- func New(cfg Config) *Mythos
- func (m *Mythos) Forward(tokens []int, startPos, loopIters int) *g.Tensor
- func (m *Mythos) Parameters() []*g.Tensor
type TransformerBlock
- func NewTransformerBlock(cfg Config, rope *nn.RoPE) *TransformerBlock
- func (b *TransformerBlock) Forward(x *g.Tensor, startPos int) *g.Tensor
- func (b *TransformerBlock) Parameters() []*g.Tensor

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type Config ¶

type Config struct {
	// Architecture
	VocabSize int // tokenizer vocabulary
	Dim       int // hidden size

	// Attention
	NumHeads     int  // number of query heads
	NumKVHeads   int  // number of key/value heads (≤ NumHeads, must divide it)
	MaxSeqLen    int  // RoPE cache length and causal-mask seq cap
	UseMLA       bool // false → GQA; true → MLA. v1 ships GQA only.
	RopeBaseFreq float32

	// Recurrent depth
	PreludeLayers int // standard blocks run once before the loop
	CodaLayers    int // standard blocks run once after the loop
	MaxLoopIters  int // recurrent block iterations (v1: fixed; ACT defers)
	LTIDampInit   float32

	// Mixture of Experts
	NumExperts         int
	NumExpertsPerToken int // top-K
	ExpertDim          int

	// RMSNorm
	NormEps float32

	// Training (default values; overridable per-run)
	Dropout float32
}

Config captures every shape parameter for an OpenMythos model.

The defaults track the OpenMythos config dataclass field-by-field; drop-in TinyMythos / Mythos1B / Mythos8B presets below mirror the canonical sizes from the source repo.

func TinyConfig ¶

func TinyConfig(vocabSize int) Config

TinyConfig is the v1 target: ~5–10 M parameters; trains end-to-end on TinyStories on a single Apple Silicon Mac in a day. Numbers match the table in plan 0001's "v1 scope decision" section.

vocabSize is provided by the caller — TinyStories' BPE vocab is ~5k, GPT-2's is 50257. Both work; pass whichever the data loader returns.

func (Config) HeadDim ¶

func (c Config) HeadDim() int

HeadDim returns the per-head dimensionality (Dim / NumHeads).

type LTIInjection ¶

type LTIInjection struct {
	Dim       int
	DampLogit *g.Tensor // raw logits, shape (1, dim); pass through sigmoid for damp ∈ (0,1)
}

LTIInjection is the Linear-Time-Invariant stable mixing of the recurrent block's previous hidden state with the current block's output and the original embedding.

The OpenMythos paper writes this as

h_{t+1} = A · h_t + B · e + Block(h_t, e)

for learnable matrices A, B that satisfy a stability constraint (eigenvalues of A inside the unit disc, parameterized as the matrix exponential of -log eigenvalues). For the v1 port we simplify to a per-channel diagonal A and B — i.e., learnable scalar damping per hidden dim. This preserves the stability property by clamping each scalar to (0, 1) via sigmoid, and keeps the math autograd-ready without introducing a matrix-exp op.

The full matrix-A parameterisation is captured as a follow-up in plan 0001 Phase 5 (scale-up). For mythos_tiny on TinyStories the scalar form should be enough to demonstrate the recurrent-depth claim — the architecture's value comes from the iterative refinement, not from the precise spectral structure of A.

Per-channel damping signals: damp[i] ∈ (0, 1), with init = sigmoid(-1) ≈ 0.27, then h_{t+1} = damp ⊙ h_t + (1 - damp) ⊙ Block(h_t).

func NewLTIInjection ¶

func NewLTIInjection(dim int, dampInit float32) *LTIInjection

NewLTIInjection initialises per-channel damping logits so sigmoid(logit) = cfg.LTIDampInit. Inverse sigmoid: log(p/(1-p)). For the default 0.5 damping that works out to 0 — initialising at 0 gives equal weight to the previous hidden and the new block contribution, then the optimiser learns the right per-channel mixing.

func (*LTIInjection) Apply ¶

func (l *LTIInjection) Apply(hPrev, hBlock *g.Tensor) *g.Tensor

Apply mixes h_prev with h_block:

damp = sigmoid(DampLogit)            // per-channel ∈ (0, 1)
out  = damp ⊙ h_prev + (1 - damp) ⊙ h_block

Both inputs are (M, dim). Output (M, dim). Autograd-aware end-to-end via gorch's broadcast ops (MulB / SubB) and Sigmoid; gradient flows back to DampLogit through Sigmoid.

func (*LTIInjection) Parameters ¶

func (l *LTIInjection) Parameters() []*g.Tensor

Parameters returns the damping logits (the only learnable part).

type Mythos ¶

type Mythos struct {
	Cfg     Config
	Embed   *nn.Embedding
	RoPE    *nn.RoPE
	Prelude []*TransformerBlock
	Recur   *TransformerBlock
	LTI     *LTIInjection
	Coda    []*TransformerBlock
	Norm    *nn.RMSNorm
}

Mythos is the top-level recurrent-depth transformer model.

Forward path:

tokens          ─────► Embedding ─────► h0
h0      ─────► Prelude (PreludeLayers blocks) ─────► h_in
h_in    ─────► Recurrent (MaxLoopIters iterations of one shared
                          block, each followed by LTI mixing) ─────► h_out
h_out   ─────► Coda (CodaLayers blocks) ─────► h_final
h_final ─────► RMSNorm ─────► LMHead Linear ─────► logits (vocab)

v1 simplifications relative to the OpenMythos paper, captured in plan 0001 Phase 2 / 5:

LTI: per-channel diagonal damping (sigmoid-of-logit), not the full matrix-A parameterisation.
Recurrent block weights are SHARED across the MaxLoopIters iterations (depth-wise LoRA adapters are deferred — same as the OpenMythos repo's "USE_LORA = False" branch).
No ACT halting: MaxLoopIters is a fixed Config field, the loop runs that many times. ACT's halting probabilities + cumulative surplus are deferred until v1 demonstrates the recurrent-depth benefit on TinyStories.

LM head is tied to the input embedding (HF-style; saves vocab*dim parameters and is what GPT-2/Llama do today).

func New ¶

func New(cfg Config) *Mythos

New builds a Mythos model from cfg. The shared RoPE table is built once and threaded into every attention sublayer so the same precomputed cos/sin pairs are used end-to-end.

func (*Mythos) Forward ¶

func (m *Mythos) Forward(tokens []int, startPos, loopIters int) *g.Tensor

Forward runs the full model on a flat token-id slice. Returns logits of shape (seq, VocabSize). startPos is the absolute position of tokens[0] (relevant for KV-cached decoding; pass 0 for full-sequence training forward).

loopIters can override the config's MaxLoopIters at inference time (the recurrent-depth-ablation core test). Pass -1 to use cfg.MaxLoopIters.

func (*Mythos) Parameters ¶

func (m *Mythos) Parameters() []*g.Tensor

Parameters returns every learnable tensor in the model.

type TransformerBlock ¶

type TransformerBlock struct {
	NormAttn *nn.RMSNorm
	Attn     *nn.GQA
	NormFFN  *nn.RMSNorm
	FFN      *nn.MoE
}

TransformerBlock is the standard pre-norm OpenMythos block:

h = h + Attention(RMSNorm(h), startPos)
h = h + MoE(RMSNorm(h))

Uses GQA today (UseMLA path is plan 0001's open MLA-completes-autograd item; pinned in tests below). Both sublayers add residual connections in the standard pre-norm pattern. RoPE is composed inside the attention module on Q and K.

One block has:

2 RMSNorm layers (gamma per layer)
1 GQA attention with Wq, Wk, Wv, Wo Linear projections
1 MoE FFN with router + N experts (each 3 Linear projections)

Plan 0001 Phase 2 deliverable.

func NewTransformerBlock ¶

func NewTransformerBlock(cfg Config, rope *nn.RoPE) *TransformerBlock

NewTransformerBlock builds a block sized to cfg.

func (*TransformerBlock) Forward ¶

func (b *TransformerBlock) Forward(x *g.Tensor, startPos int) *g.Tensor

Forward runs the block with residuals. x is (seq, dim); startPos is the absolute position of x[0] for RoPE. Output: (seq, dim).

func (*TransformerBlock) Parameters ¶

func (b *TransformerBlock) Parameters() []*g.Tensor

Parameters returns every learnable tensor in the block.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL