mcp-observability-platform

command module

v0.1.0 Latest Latest Go to latest Published: Apr 29, 2026 License: Apache-2.0 Imports: 1 Imported by: 0

README ¶

mcp-observability-platform

Giant Swarm's observability-platform MCP server. Exposes Grafana (plus its Mimir / Loki / Tempo / Alertmanager datasources) to MCP clients, with per-caller tenant and role scoping derived from GrafanaOrganization CRs.

One MCP per Grafana instance. Authentication via MCP OAuth (Dex as IdP). Authorization is resolved by Grafana from its org_mapping (which observability-operator derives from GrafanaOrganization.spec.rbac.{admins,editors,viewers}); the MCP asks Grafana per caller. Role and org-membership changes propagate within ~30s (the per-caller cache TTL).

Roadmap

See docs/roadmap.md for the productionization plan. docs/ARCHITECTURE.md is the orientation doc: request flow, package layout, threat model, and where to add a new tool.

MCP surface

Tools

All tools require role: viewer on the target org (Grafana evaluates org_mapping against the caller's OIDC groups; the operator derives org_mapping from GrafanaOrganization.spec.rbac). Write operations are intentionally out of scope for this MCP.

Most tool handlers delegate to upstream grafana/mcp-grafana — we add a synthetic org argument and gfBinder resolves it to the org's OrgID + datasource UID before delegating. Categories without a usable upstream equivalent (Tempo, Alertmanager v2 alerts, list_orgs) stay local. See internal/tools/doc.go for the per-category rationale.

Orgs & datasources

Tool	Backend	Notes
`list_orgs`	(local CRs)	Minimal projection (name / displayName / orgId / role / tenantTypes)
`list_datasources`	Grafana API	`/api/datasources`; projected to id / uid / name / type
`get_datasource`	Grafana API	`/api/datasources/uid/{uid}` full detail

Dashboards

Tool	Backend	Notes
`search_dashboards`	Grafana API	`/api/search`, grouped by folder, page/pageSize over folders
`search_folders`	Grafana API	`/api/search?type=dash-folder`
`get_dashboard_by_uid`	Grafana API	Full dashboard JSON (usually 100s of KB — prefer the summary)
`get_dashboard_summary`	Grafana API	Title / tags / vars / row+panel tree (NO queries)
`get_dashboard_panel_queries`	Grafana API	Queries for one panel (by id or title substring) or all
`get_dashboard_property`	Grafana API	Sub-tree of the dashboard JSON by RFC 6901 JSON Pointer
`generate_deeplink`	Grafana URL	Builds `/d/{uid}?orgId=…&from=…&to=…&viewPanel=…&var-…`

Metrics (Mimir)

Tool	DS proxy path
`query_prometheus`	`api/v1/query[_range]`
`query_prometheus_histogram`	`histogram_quantile(...)` wrapper around `query_range`
`list_prometheus_metric_names`	`api/v1/label/__name__/values`
`list_prometheus_label_names`	`api/v1/labels`
`list_prometheus_label_values`	`api/v1/label/{label}/values`
`list_prometheus_metric_metadata`	`api/v1/metadata`

Alert rules (Mimir Ruler)

Tool	Backend
`alerting_manage_rules`	Grafana `/api/prometheus/{datasourceUID}/api/v1/rules` (delegated to upstream `mcp-grafana`); bound to the Mimir datasource. Useful operation is `operation=list` — `get`/`versions` require Grafana-managed RuleUIDs and don't work for Mimir-side rules.

Known gaps (tracked in docs/roadmap.md): recording rules are dropped by upstream's projection; Loki rules are not exposed at all. Both Mimir and Loki rulers expose the same Prometheus-shape /prometheus/api/v1/rules endpoint and would unblock once upstream stops filtering recording rules.

Logs (Loki)

Tool	DS proxy path
`query_loki_logs`	`loki/api/v1/query_range` (returns `nextStart` cursor)
`query_loki_patterns`	`loki/api/v1/patterns` — log-pattern detection
`query_loki_stats`	`loki/api/v1/index/stats`
`list_loki_label_names`	`loki/api/v1/labels`
`list_loki_label_values`	`loki/api/v1/label/{label}/values`

Traces (Tempo)

Tool	DS proxy path
`query_traces`	`api/search`
`list_tempo_tag_names`	`api/v2/search/tags`
`list_tempo_tag_values`	`api/v2/search/tag/{tag}/values`

Alerts (Alertmanager)

Tool	DS proxy path
`list_alerts`	`api/v2/alerts` — paged, severity-sorted
`get_alert`	Single alert by fingerprint (derived from `list_alerts`)

This MCP exposes only tools; LLM clients invoke them more reliably than resource URIs or prompts.

Datasource selection is per-org: tools match datasources from status.dataSources[] by name substring (mimir, loki, tempo, alertmanager). The tenant header is already baked into the datasource JSON by observability-operator, so the MCP only picks the right datasource and lets Grafana apply the header.

Caller identity is propagated to Grafana via X-Grafana-User on every downstream request so Grafana's audit log attributes to the OIDC subject rather than the server-admin SA.

Response-size discipline

Tool responses that would exceed TOOL_MAX_RESPONSE_BYTES (default 128 KiB) return a structured error payload:

{
  "error": "response_too_large",
  "bytes": 245760,
  "limit": 131072,
  "message": "response is 245760 bytes, exceeds 131072 byte limit",
  "hint": "narrow the query: add label matchers, aggregate with sum/rate/topk, or shorten the time range"
}

LLM clients can react programmatically instead of silently truncating.

For endpoints where pagination is natural (logs, label-values, rule lists, tag values, dashboards-by-folder, alerts) tools expose page/pageSize or a nextStart cursor so callers can page forward without re-running the whole query.

Metrics

Prometheus metrics served at /metrics on the observability port (METRICS_ADDR, default :9091) — split from the MCP HTTP port so kubelet probes and Prometheus scrapes keep working through the OAuth graceful-drain on shutdown:

Metric	Type	Labels
`mcp_tool_call_total`	counter	`tool`
`mcp_tool_call_errors_total`	counter	`tool`
`mcp_tool_call_duration_seconds`	histogram	`tool`

Per-Grafana-request observation lives on the OTEL span emitted by internal/grafana.client.fetch — no separate aggregate counter.

Plus default Go and process collectors.

Tool error rate is errors_total / total per tool — the standard two-counter pattern. Error spans are marked Error regardless of whether the handler returned a Go error or an IsError result.

Tracing

OpenTelemetry tracing is wired via the standard OTEL_EXPORTER_OTLP_* environment variables. When no endpoint is set, spans go to a no-op tracer and the W3C trace-context propagator is still installed so incoming headers are respected. Spans are emitted per tool call and per Grafana HTTP request.

Instrument middleware also writes a structured tool_call slog line to the app logger (caller, tool, error, duration, trace_id, span_id) so no-OTLP setups still get a queryable record. The cluster log pipeline ships stderr to Loki; an MCP gateway can correlate via the trace IDs.

Transports

Three MCP transports are wired, selected via MCP_TRANSPORT:

Transport	When to use	OAuth	Listens on
`streamable-http` (default)	Remote deployment gated by OAuth, the shipping Helm-chart deployment. Works with `claude mcp add --transport http …`, mcp-inspector, browser clients.	Required (mcp-oauth + Dex)	`$MCP_ADDR` (default `:8080`), `POST /mcp`
`sse`	Remote deployment for MCP clients that still prefer SSE (`text/event-stream`). Identical auth and tool surface as streamable-http.	Required	`$MCP_ADDR`, `GET /sse` + `POST /message`
`stdio`	Local-dev and desktop-client integrations (Claude Desktop's `command` server entry, IDE plugins). No HTTP listener, no OAuth — the client is whoever spawned the process, so authz relies on the caller already having the right Grafana / Kubernetes context.	None	stdin/stdout

OAuth is only meaningful for the network transports; stdio treats the spawning process as fully trusted (same model as kubectl delegating to the user's kubeconfig). Configuration env vars are shared across transports — the Grafana / Dex / OAuth / observability settings apply identically.

Configuration

Env-var driven. Flags override env. See cmd/serve.go.

Env var	Required	Purpose
`GRAFANA_URL`	yes	Grafana base URL (in-cluster)
`GRAFANA_SA_TOKEN`	one-of	Grafana server-admin SA token (see below). Production path.
`GRAFANA_BASIC_AUTH`	one-of	`user:password` for the built-in admin — dev/bootstrap only when SA promotion is unavailable. Setting both `GRAFANA_SA_TOKEN` and this var is a startup error.
`OAUTH_DEX_ISSUER_URL`	yes	Dex issuer (read by `oauthconfig.DexFromEnv`).
`OAUTH_DEX_CLIENT_ID`	yes	Dex OAuth client
`OAUTH_DEX_CLIENT_SECRET`	yes	Dex OAuth client secret. `*_FILE` variant supported.
`OAUTH_DEX_REDIRECT_URL`	no	Provider callback URL. Defaults to `$OAUTH_ISSUER/oauth/callback`; only set if you need a non-canonical path.
`OAUTH_ISSUER`	yes	Public issuer URL of this MCP
`OAUTH_ALLOW_INSECURE_HTTP`	no	`true` to allow plain-HTTP OAuth flows (local dev only). Loopback issuers (`http://localhost`, `http://127.0.0.1`, `http://[::1]`) are accepted without this flag per RFC 8252.
`OAUTH_ALLOW_PUBLIC_CLIENT_REGISTRATION`	no	`true` to open `/oauth/register` to unauthenticated callers (default `false`). Required for MCP CLI clients (Claude Code, mcp-inspector) that self-register at runtime.
`OAUTH_ENCRYPTION_KEY`	no	AES-256 key for token-at-rest encryption; 44-char base64 (`openssl rand -base64 32`) or 64-char hex (`openssl rand -hex 32`). `*_FILE` variant supported.
`OAUTH_TRUSTED_AUDIENCES`	no	CSV of OAuth client IDs whose tokens are accepted as if minted for this server — enables SSO token forwarding from muster or sibling MCPs. Tokens must still be signed by `OAUTH_DEX_ISSUER_URL`. Empty = own-tokens-only.
`OAUTH_TRUSTED_REDIRECT_SCHEMES`	no	CSV of custom URI schemes accepted during public client registration (e.g. `cursor,vscode`). Loopback HTTPS is always allowed; `javascript`/`data`/`file`/`ftp` are rejected regardless.
`OAUTH_STORAGE_BACKEND`	no	`memory` (default) or `valkey` (read by `oauthconfig.StorageFromEnvWithPrefix("OAUTH_")`)
`OAUTH_VALKEY_ADDR` / `_PASSWORD` / `_TLS`	no	Required when `OAUTH_STORAGE_BACKEND=valkey`. `OAUTH_VALKEY_PASSWORD` accepts the `*_FILE` variant.
`MCP_TRANSPORT`	no	`streamable-http` (default), `sse`, or `stdio`. Stdio has no HTTP surface and bypasses OAuth — developer-loop only.
`MCP_ADDR`	no	Listen address for the MCP HTTP surface — `/mcp`, `/sse`, `/message`, `/oauth/*`, plus the OAuth discovery routes (default `:8080`). Ignored when `MCP_TRANSPORT=stdio`.
`METRICS_ADDR`	no	Listen address for the observability surface — `/metrics`, `/healthz`, `/readyz` (default `:9091`). Split from `MCP_ADDR` so kubelet probes and Prometheus scrapes keep working through the OAuth graceful-drain on shutdown.
`TOOL_MAX_RESPONSE_BYTES`	no	Cap on tool response body (default 131072; 0 = disabled)
`TOOL_TIMEOUT`	no	Per-tool-call deadline (default `30s`; `0` = disabled). Go duration syntax (`500ms`, `2m`). A tool exceeding the deadline returns an IsError result with timeout text.
`OTEL_EXPORTER_OTLP_ENDPOINT`	no	OTLP endpoint for span export; spans are no-op when unset
`OTEL_EXPORTER_OTLP_PROTOCOL`	no	`http/protobuf` (default) or `grpc`
`POD_NAME` / `POD_NAMESPACE` / `NODE_NAME`	no	Downward-API attributes added to OTEL resource when set
`DEBUG`	no	`true` to enable debug logging
`LOG_FORMAT`	no	`json` or `text`. Defaults to `json` when `KUBERNETES_SERVICE_HOST` is set, else `text`.
`MCP_ALLOW_STDIO_IN_CLUSTER`	no	`true` to permit `MCP_TRANSPORT=stdio` inside Kubernetes (in-cluster integration tests only — stdio bypasses OAuth).

⚠️ Tool-call arguments are logged verbatim in the tool_call audit slog line and OTEL span attributes for forensics. Do not register tools that take secrets as arguments — look credentials up server-side from the caller identity instead.

Grafana service-account token

Phase 1 uses a single server-admin Grafana service account. In Grafana:

Log in as a server admin.
Go to Administration → General → Service accounts → Add.
Assign the Grafana Admin server role. (Not the org-level Admin — that would make the SA org-scoped and X-Grafana-Org-Id would be ignored.)
Generate a token.
Put it in a Kubernetes Secret under the key serviceAccountToken and reference that secret via grafana.existingSecret in values.yaml.

The MCP performs a startup self-check calling GET /api/orgs; if the token is not server-admin it fails to start.

Known phase-1 blast-radius limitation: one compromised MCP pod exposes every Grafana org. Phase 2 narrows this by switching to per-org SA tokens provisioned by the observability-operator (tracked in the plan).

Install

Create the required secrets first. The serviceAccountToken key holds the Grafana server-admin service-account token (not a Kubernetes one — the chart creates the K8s ServiceAccount itself via templates/serviceaccount.yaml).

kubectl -n observability create secret generic mcp-observability-platform-grafana \
  --from-literal=serviceAccountToken=<grafana-sa-token>

kubectl -n observability create secret generic mcp-observability-platform-oauth \
  --from-literal=clientSecret=<dex-client-secret>

Then install the chart:

helm install mcp-observability-platform ./helm/mcp-observability-platform \
  --namespace observability --create-namespace \
  --set grafana.url=https://grafana.example.com \
  --set oauth.issuer=https://mcp-observability-platform.example.com \
  --set oauth.dex.issuerUrl=https://dex.example.com \
  --set oauth.dex.clientId=mcp-observability-platform

Example overlays live under helm/mcp-observability-platform/values-*.yaml:

Values file	Scenario
`values-memory.yaml`	Dev/test: memory-backed OAuth store, debug logging.
`values-valkey.yaml`	Prod: Valkey-backed OAuth store (durable, shared across replicas).
`values-rbac-minimal.yaml`	Externally-managed ServiceAccount + ClusterRoleBinding.
`values-autoscaling.yaml`	HPA + VPA (Initial) + PDB + NetworkPolicy (ingress + egress).

Runtime tunables (tool timeout, response cap, OAuth trust config) live under runtime: / oauth: in values.yaml and are delivered through a ConfigMap mounted via envFrom. A checksum/config annotation on the pod template rolls the deployment whenever those change.

Run locally (without Kubernetes)

go mod tidy
make build
./mcp-observability-platform serve

Layout

See docs/ARCHITECTURE.md for the package layout table and what each package is responsible for.

giantswarm/mcp-oauth — OAuth resource-server library used here
giantswarm/observability-operator — owns the GrafanaOrganization CRD
giantswarm/mcp-prometheus, giantswarm/mcp-kubernetes — sibling MCPs that set the conventions followed here

Documentation ¶

There is no documentation for this package.

Source Files ¶

View all Source files

main.go

Directories ¶

Path	Synopsis
cmd
internal
authz Package authz decides which Grafana orgs a caller may act on, and with what Role.	Package authz decides which Grafana orgs a caller may act on, and with what Role.
authz/authztest Package authztest provides a single Authorizer fake shared by the server, tools, and binder tests.	Package authztest provides a single Authorizer fake shared by the server, tools, and binder tests.
grafana Package grafana is the Grafana adapter for this MCP.	Package grafana is the Grafana adapter for this MCP.
observability Package observability owns the Prometheus metrics + OTLP tracing init for this MCP.	Package observability owns the Prometheus metrics + OTLP tracing init for this MCP.
server Package server constructs the MCP server (tools-only surface), composes the tool-handler middleware stack, and provides streamable-HTTP / SSE transport wrappers and readiness probes.	Package server constructs the MCP server (tools-only surface), composes the tool-handler middleware stack, and provides streamable-HTTP / SSE transport wrappers and readiness probes.
server/middleware Package middleware holds the tool-handler cross-cuts (instrumentation, auth gate, response cap, per-call deadline) wired through mcp-go's server.ToolHandlerMiddleware so they run on every call without per-handler boilerplate.	Package middleware holds the tool-handler cross-cuts (instrumentation, auth gate, response cap, per-call deadline) wired through mcp-go's server.ToolHandlerMiddleware so they run on every call without per-handler boilerplate.
tools alerting.go — RoleViewer, DSKindMimir.	alerting.go — RoleViewer, DSKindMimir.

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL

README ¶

mcp-observability-platform

Roadmap

MCP surface

Tools

Response-size discipline

Metrics

Tracing

Transports

Configuration

Grafana service-account token

Install

Run locally (without Kubernetes)

Layout

Related

Documentation ¶

Source Files ¶

Directories ¶