mtop

command module

v1.3.0 Latest Latest Go to latest Published: Jun 18, 2026 License: MIT Imports: 16 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/eladser/mtop

Links

Open Source Insights

README ¶

demo

One terminal window for whatever you're running locally (Ollama, llama.cpp, LM Studio, vLLM). It shows the loaded models and how much VRAM they're sitting on, the GPU, and the requests going through with their tok/s. Hit c to flip the middle pane to per-model stats.

It'll also kick out models that won't leave. Pick one, press u, gone. Ollama is supposed to unload idle models on its own and usually does, but every so often ollama ps shows something that expired ten minutes ago still parked on 8 gigs. mtop flags those as overdue. -idle-unload 15m clears them for you.

Install

macOS or Linux:

brew install eladser/tap/mtop

Windows:

scoop bucket add eladser https://github.com/eladser/scoop-bucket
scoop install mtop

Or grab a binary from releases, or build it with go install github.com/eladser/mtop@latest.

Run mtop. It looks for the usual servers on their usual ports, no config file to write.

Seeing requests

Most local servers don't expose per-request numbers anywhere. The data only exists in the response stream as it streams. So mtop runs a little pass-through proxy on 127.0.0.1:4321 and reads the numbers off the wire. Point your client at it:

OLLAMA_HOST=127.0.0.1:4321              # ollama clients
base_url = "http://127.0.0.1:4321/v1"   # openai-style clients

Your client gets the exact same bytes back. By default the proxy forwards to ollama; -target aims it at llama.cpp or LM Studio instead. The models and GPU panes don't need any of this.

The same port answers /metrics in prometheus format if you'd rather watch it from grafana.

Keys

key	does
`↑`/`↓`, `k`/`j`	move the selection
`u`	unload the selected model
`c`	swap recent requests for per-model stats
`i`	inspector: the last request's prompt, completion, and timing breakdown (needs `-inspect`)
`q`	quit

Comparing models

mtop compare runs the same prompt past a few ollama models, one at a time so they're not fighting for the GPU, and prints how each did:

$ mtop compare "explain tcp in one sentence" qwen3:0.6b smollm2:135m
MODEL                         TOK/S      OUT      TOTAL
qwen3:0.6b                    798.1      175      446ms
smollm2:135m                  795.3       29     10.72s

tok/s is decode speed; TOTAL is wall-clock and includes the model load on the first call. Point it at another box with -ollama, or compare models on an OpenAI-style server with -openai http://127.0.0.1:8080/v1.

Flags

-ollama       ollama base url, or a comma list to watch a few boxes at once
-llamacpp     llama.cpp server url        (default http://127.0.0.1:8080, empty to skip)
-lmstudio     lm studio url               (default http://127.0.0.1:1234, empty to skip)
-vllm         vllm url                    (default http://127.0.0.1:8000, empty to skip)
-listen       proxy listen address        (default 127.0.0.1:4321)
-target       proxy upstream              (defaults to the ollama url)
-idle-unload  unload models idle this long, e.g. 15m (default off)
-notify       desktop notification when a gpu hits the alert line
-history      keep recent requests across restarts (~/.mtop/history.jsonl)
-mem-alert    gpu memory percent for the alert line (default 93)
-temp-alert   gpu temperature for the alert line (default 87)
-inspect      capture prompt and completion text for the inspector (i)
-no-proxy     don't run the proxy

The TOK/S line also tracks session energy: watt-hours used and tokens per watt-hour, from whole-GPU power. It's an estimate (the whole card, not just inference), so read it as a rough efficiency number, not a meter.

Every flag has an MTOP_* env var too, and ~/.mtop.conf can hold them so you're not retyping a homelab box:

MTOP_OLLAMA=http://homelab:11434

FAQ

Does the proxy add latency? No. Bytes go straight through and the counters get read on the way past. Same response, same speed.

Requests pane says "none yet". Your client is talking to the server directly. Send it through the proxy and they'll show up. The models and GPU panes work regardless.

What's "overdue"? Ollama said it'd unload a model by a certain time and didn't. Press u, or set -idle-unload and stop thinking about it.

llama.cpp shows up thinner than ollama. Start it with --metrics (and --slots) for the kv-cache numbers. Without those flags it only hands out the model name.

AMD? Mac? AMD works if rocm-smi is installed. Apple Silicon gives you the unified-memory figure on its own. GPU utilization there comes from powermetrics, which needs root, so run mtop with sudo on a Mac if you want that number too.

tok/s looks different between ollama and openai-style requests. Ollama reports its own generation timings, so that number is real decode speed. OpenAI-style responses carry no timings, so mtop divides tokens by wall-clock time, which folds in prompt processing. Close, not identical.

Does it phone home? Never. No accounts, no telemetry, it only talks to your own servers. The proxy also turns away cross-origin and non-loopback callers, so a browser tab can't reach through it to your ollama.

MIT. If it's useful to you, ko-fi.com/eladser.

Documentation ¶

There is no documentation for this package.

Source Files ¶

View all Source files

main.go

Directories ¶

Path	Synopsis
internal
compare Package compare runs the same prompt against a few models and reports how each one did.	Package compare runs the same prompt against a few models and reports how each one did.
gpu Package gpu reads NVIDIA stats by polling nvidia-smi's query interface.	Package gpu reads NVIDIA stats by polling nvidia-smi's query interface.
notify Package notify pops a desktop notification, best effort, no dependencies.	Package notify pops a desktop notification, best effort, no dependencies.
ollama
proxy Package proxy sits between any client and the inference server.	Package proxy sits between any client and the inference server.
sources Package sources finds models on whatever local AI servers happen to be running.	Package sources finds models on whatever local AI servers happen to be running.
ui

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL