
Giant Swarm's observability-platform MCP server. Exposes Grafana (plus its
Mimir / Loki / Tempo / Alertmanager datasources) to MCP clients, with
per-caller tenant and role scoping derived from GrafanaOrganization CRs.
One MCP per Grafana instance. Authentication via MCP OAuth (Dex as IdP).
Authorization is resolved by Grafana from its org_mapping (which
observability-operator derives from
GrafanaOrganization.spec.rbac.{admins,editors,viewers}); the MCP
asks Grafana per caller. Role and org-membership changes propagate
within ~30s (the per-caller cache TTL).
Roadmap
See docs/roadmap.md for the productionization
plan. docs/ARCHITECTURE.md is the
orientation doc: request flow, package layout, threat model, and
where to add a new tool.
MCP surface
All tools require role: viewer on the target org (Grafana evaluates
org_mapping against the caller's OIDC groups; the operator derives
org_mapping from GrafanaOrganization.spec.rbac). Write operations
are intentionally out of scope for this MCP.
Most tool handlers delegate to upstream
grafana/mcp-grafana — we add a
synthetic org argument and gfBinder resolves it to the org's
OrgID + datasource UID before delegating. Categories without a usable
upstream equivalent (Tempo, Alertmanager v2 alerts, list_orgs) stay
local. See internal/tools/doc.go for the per-category rationale.
Orgs & datasources
| Tool |
Backend |
Notes |
list_orgs |
(local CRs) |
Minimal projection (name / displayName / orgId / role / tenantTypes) |
list_datasources |
Grafana API |
/api/datasources; projected to id / uid / name / type |
get_datasource |
Grafana API |
/api/datasources/uid/{uid} full detail |
Dashboards
| Tool |
Backend |
Notes |
search_dashboards |
Grafana API |
/api/search, grouped by folder, page/pageSize over folders |
search_folders |
Grafana API |
/api/search?type=dash-folder |
get_dashboard_by_uid |
Grafana API |
Full dashboard JSON (usually 100s of KB — prefer the summary) |
get_dashboard_summary |
Grafana API |
Title / tags / vars / row+panel tree (NO queries) |
get_dashboard_panel_queries |
Grafana API |
Queries for one panel (by id or title substring) or all |
get_dashboard_property |
Grafana API |
Sub-tree of the dashboard JSON by RFC 6901 JSON Pointer |
generate_deeplink |
Grafana URL |
Builds /d/{uid}?orgId=…&from=…&to=…&viewPanel=…&var-… |
Metrics (Mimir)
| Tool |
DS proxy path |
query_prometheus |
api/v1/query[_range] |
query_prometheus_histogram |
histogram_quantile(...) wrapper around query_range |
list_prometheus_metric_names |
api/v1/label/__name__/values |
list_prometheus_label_names |
api/v1/labels |
list_prometheus_label_values |
api/v1/label/{label}/values |
list_prometheus_metric_metadata |
api/v1/metadata |
Alert rules (Mimir Ruler)
| Tool |
Backend |
alerting_manage_rules |
Grafana /api/prometheus/{datasourceUID}/api/v1/rules (delegated to upstream mcp-grafana); bound to the Mimir datasource. Useful operation is operation=list — get/versions require Grafana-managed RuleUIDs and don't work for Mimir-side rules. |
Known gaps (tracked in docs/roadmap.md): recording rules are dropped by upstream's projection; Loki rules are not exposed at all. Both Mimir and Loki rulers expose the same Prometheus-shape /prometheus/api/v1/rules endpoint and would unblock once upstream stops filtering recording rules.
Logs (Loki)
| Tool |
DS proxy path |
query_loki_logs |
loki/api/v1/query_range (returns nextStart cursor) |
query_loki_patterns |
loki/api/v1/patterns — log-pattern detection |
query_loki_stats |
loki/api/v1/index/stats |
list_loki_label_names |
loki/api/v1/labels |
list_loki_label_values |
loki/api/v1/label/{label}/values |
Traces (Tempo)
| Tool |
DS proxy path |
query_traces |
api/search |
list_tempo_tag_names |
api/v2/search/tags |
list_tempo_tag_values |
api/v2/search/tag/{tag}/values |
Alerts (Alertmanager)
| Tool |
DS proxy path |
list_alerts |
api/v2/alerts — paged, severity-sorted |
get_alert |
Single alert by fingerprint (derived from list_alerts) |
This MCP exposes only tools; LLM clients invoke them more reliably
than resource URIs or prompts.
Datasource selection is per-org: tools match datasources from
status.dataSources[] by name substring (mimir, loki, tempo,
alertmanager). The tenant header is already baked into the datasource JSON
by observability-operator, so the MCP only picks the right datasource and
lets Grafana apply the header.
Caller identity is propagated to Grafana via X-Grafana-User on every
downstream request so Grafana's audit log attributes to the OIDC subject
rather than the server-admin SA.
Response-size discipline
Tool responses that would exceed TOOL_MAX_RESPONSE_BYTES (default 128 KiB)
return a structured error payload:
{
"error": "response_too_large",
"bytes": 245760,
"limit": 131072,
"message": "response is 245760 bytes, exceeds 131072 byte limit",
"hint": "narrow the query: add label matchers, aggregate with sum/rate/topk, or shorten the time range"
}
LLM clients can react programmatically instead of silently truncating.
For endpoints where pagination is natural (logs, label-values, rule lists,
tag values, dashboards-by-folder, alerts) tools expose page/pageSize or
a nextStart cursor so callers can page forward without re-running the
whole query.
Metrics
Prometheus metrics served at /metrics on the observability port
(METRICS_ADDR, default :9091) — split from the MCP HTTP port so
kubelet probes and Prometheus scrapes keep working through the OAuth
graceful-drain on shutdown:
| Metric |
Type |
Labels |
mcp_tool_call_total |
counter |
tool |
mcp_tool_call_errors_total |
counter |
tool |
mcp_tool_call_duration_seconds |
histogram |
tool |
Per-Grafana-request observation lives on the OTEL span emitted by
internal/grafana.client.fetch — no separate aggregate counter.
Plus default Go and process collectors.
Tool error rate is errors_total / total per tool — the standard two-counter pattern. Error spans are marked Error regardless of whether the handler returned a Go error or an IsError result.
Tracing
OpenTelemetry tracing is wired via the standard OTEL_EXPORTER_OTLP_*
environment variables. When no endpoint is set, spans go to a no-op tracer
and the W3C trace-context propagator is still installed so incoming headers
are respected. Spans are emitted per tool call and per Grafana HTTP request.
Instrument middleware also writes a structured tool_call slog line
to the app logger (caller, tool, error, duration, trace_id, span_id) so
no-OTLP setups still get a queryable record. The cluster log pipeline
ships stderr to Loki; an MCP gateway can correlate via the trace IDs.
Transports
Three MCP transports are wired, selected via MCP_TRANSPORT:
| Transport |
When to use |
OAuth |
Listens on |
streamable-http (default) |
Remote deployment gated by OAuth, the shipping Helm-chart deployment. Works with claude mcp add --transport http …, mcp-inspector, browser clients. |
Required (mcp-oauth + Dex) |
$MCP_ADDR (default :8080), POST /mcp |
sse |
Remote deployment for MCP clients that still prefer SSE (text/event-stream). Identical auth and tool surface as streamable-http. |
Required |
$MCP_ADDR, GET /sse + POST /message |
stdio |
Local-dev and desktop-client integrations (Claude Desktop's command server entry, IDE plugins). No HTTP listener, no OAuth — the client is whoever spawned the process, so authz relies on the caller already having the right Grafana / Kubernetes context. |
None |
stdin/stdout |
OAuth is only meaningful for the network transports; stdio treats the
spawning process as fully trusted (same model as kubectl delegating to the
user's kubeconfig). Configuration env vars are shared across transports —
the Grafana / Dex / OAuth / observability settings apply identically.
Configuration
Env-var driven. Flags override env. See cmd/serve.go.
| Env var |
Required |
Purpose |
GRAFANA_URL |
yes |
Grafana base URL (in-cluster) |
GRAFANA_SA_TOKEN |
one-of |
Grafana server-admin SA token (see below). Production path. |
GRAFANA_BASIC_AUTH |
one-of |
user:password for the built-in admin — dev/bootstrap only when SA promotion is unavailable. Setting both GRAFANA_SA_TOKEN and this var is a startup error. |
OAUTH_DEX_ISSUER_URL |
yes |
Dex issuer (read by oauthconfig.DexFromEnv). |
OAUTH_DEX_CLIENT_ID |
yes |
Dex OAuth client |
OAUTH_DEX_CLIENT_SECRET |
yes |
Dex OAuth client secret. *_FILE variant supported. |
OAUTH_DEX_REDIRECT_URL |
no |
Provider callback URL. Defaults to $OAUTH_ISSUER/oauth/callback; only set if you need a non-canonical path. |
OAUTH_ISSUER |
yes |
Public issuer URL of this MCP |
OAUTH_ALLOW_INSECURE_HTTP |
no |
true to allow plain-HTTP OAuth flows (local dev only). Loopback issuers (http://localhost, http://127.0.0.1, http://[::1]) are accepted without this flag per RFC 8252. |
OAUTH_ALLOW_PUBLIC_CLIENT_REGISTRATION |
no |
true to open /oauth/register to unauthenticated callers (default false). Required for MCP CLI clients (Claude Code, mcp-inspector) that self-register at runtime. |
OAUTH_ENCRYPTION_KEY |
no |
AES-256 key for token-at-rest encryption; 44-char base64 (openssl rand -base64 32) or 64-char hex (openssl rand -hex 32). *_FILE variant supported. |
OAUTH_TRUSTED_AUDIENCES |
no |
CSV of OAuth client IDs whose tokens are accepted as if minted for this server — enables SSO token forwarding from muster or sibling MCPs. Tokens must still be signed by OAUTH_DEX_ISSUER_URL. Empty = own-tokens-only. |
OAUTH_TRUSTED_REDIRECT_SCHEMES |
no |
CSV of custom URI schemes accepted during public client registration (e.g. cursor,vscode). Loopback HTTPS is always allowed; javascript/data/file/ftp are rejected regardless. |
OAUTH_STORAGE_BACKEND |
no |
memory (default) or valkey (read by oauthconfig.StorageFromEnvWithPrefix("OAUTH_")) |
OAUTH_VALKEY_ADDR / _PASSWORD / _TLS |
no |
Required when OAUTH_STORAGE_BACKEND=valkey. OAUTH_VALKEY_PASSWORD accepts the *_FILE variant. |
MCP_TRANSPORT |
no |
streamable-http (default), sse, or stdio. Stdio has no HTTP surface and bypasses OAuth — developer-loop only. |
MCP_ADDR |
no |
Listen address for the MCP HTTP surface — /mcp, /sse, /message, /oauth/*, plus the OAuth discovery routes (default :8080). Ignored when MCP_TRANSPORT=stdio. |
METRICS_ADDR |
no |
Listen address for the observability surface — /metrics, /healthz, /readyz (default :9091). Split from MCP_ADDR so kubelet probes and Prometheus scrapes keep working through the OAuth graceful-drain on shutdown. |
TOOL_MAX_RESPONSE_BYTES |
no |
Cap on tool response body (default 131072; 0 = disabled) |
TOOL_TIMEOUT |
no |
Per-tool-call deadline (default 30s; 0 = disabled). Go duration syntax (500ms, 2m). A tool exceeding the deadline returns an IsError result with timeout text. |
OTEL_EXPORTER_OTLP_ENDPOINT |
no |
OTLP endpoint for span export; spans are no-op when unset |
OTEL_EXPORTER_OTLP_PROTOCOL |
no |
http/protobuf (default) or grpc |
POD_NAME / POD_NAMESPACE / NODE_NAME |
no |
Downward-API attributes added to OTEL resource when set |
DEBUG |
no |
true to enable debug logging |
LOG_FORMAT |
no |
json or text. Defaults to json when KUBERNETES_SERVICE_HOST is set, else text. |
MCP_ALLOW_STDIO_IN_CLUSTER |
no |
true to permit MCP_TRANSPORT=stdio inside Kubernetes (in-cluster integration tests only — stdio bypasses OAuth). |
⚠️ Tool-call arguments are logged verbatim in the tool_call audit slog
line and OTEL span attributes for forensics. Do not register tools that
take secrets as arguments — look credentials up server-side from the
caller identity instead.
Grafana service-account token
Phase 1 uses a single server-admin Grafana service account. In Grafana:
- Log in as a server admin.
- Go to Administration → General → Service accounts → Add.
- Assign the Grafana Admin server role. (Not the org-level Admin — that
would make the SA org-scoped and
X-Grafana-Org-Id would be ignored.)
- Generate a token.
- Put it in a Kubernetes Secret under the key
serviceAccountToken and
reference that secret via grafana.existingSecret in values.yaml.
The MCP performs a startup self-check calling GET /api/orgs; if the token
is not server-admin it fails to start.
Known phase-1 blast-radius limitation: one compromised MCP pod exposes
every Grafana org. Phase 2 narrows this by switching to per-org SA tokens
provisioned by the observability-operator (tracked in the plan).
Install
Create the required secrets first. The serviceAccountToken key holds the
Grafana server-admin service-account token (not a Kubernetes one — the
chart creates the K8s ServiceAccount itself via templates/serviceaccount.yaml).
kubectl -n observability create secret generic mcp-observability-platform-grafana \
--from-literal=serviceAccountToken=<grafana-sa-token>
kubectl -n observability create secret generic mcp-observability-platform-oauth \
--from-literal=clientSecret=<dex-client-secret>
Then install the chart:
helm install mcp-observability-platform ./helm/mcp-observability-platform \
--namespace observability --create-namespace \
--set grafana.url=https://grafana.example.com \
--set oauth.issuer=https://mcp-observability-platform.example.com \
--set oauth.dex.issuerUrl=https://dex.example.com \
--set oauth.dex.clientId=mcp-observability-platform
Example overlays live under helm/mcp-observability-platform/values-*.yaml:
| Values file |
Scenario |
values-memory.yaml |
Dev/test: memory-backed OAuth store, debug logging. |
values-valkey.yaml |
Prod: Valkey-backed OAuth store (durable, shared across replicas). |
values-rbac-minimal.yaml |
Externally-managed ServiceAccount + ClusterRoleBinding. |
values-autoscaling.yaml |
HPA + VPA (Initial) + PDB + NetworkPolicy (ingress + egress). |
Runtime tunables (tool timeout, response cap, OAuth trust config) live
under runtime: / oauth: in values.yaml and are delivered through a
ConfigMap mounted via envFrom. A checksum/config annotation on the
pod template rolls the deployment whenever those change.
Run locally (without Kubernetes)
go mod tidy
make build
./mcp-observability-platform serve
Layout
See docs/ARCHITECTURE.md for the package
layout table and what each package is responsible for.
giantswarm/mcp-oauth — OAuth resource-server library used here
giantswarm/observability-operator — owns the GrafanaOrganization CRD
giantswarm/mcp-prometheus, giantswarm/mcp-kubernetes — sibling MCPs
that set the conventions followed here