Documentation
¶
Overview ¶
Package monitoring — alert rule and Alertmanager config generators.
This file owns the generation of:
- otel-alerts.yml — Prometheus alert rules for OTEL Collector + Tempo
- alertmanager.yml — Alertmanager routing and receiver configuration
Both files are written into the target project's monitoring/ directory by nself build when MONITORING_ENABLED=true and MONITORING_TRACING_ENABLED=true.
On-call integration is controlled by NSELF_ONCALL_PROVIDER (pagerduty, opsgenie, or email). Set the corresponding key env var:
NSELF_ONCALL_PROVIDER=pagerduty ALERTMANAGER_PD_ROUTING_KEY=<key> NSELF_ONCALL_PROVIDER=opsgenie ALERTMANAGER_OG_API_KEY=<key> NSELF_ONCALL_PROVIDER=email ALERTMANAGER_ONCALL_EMAIL=<addr>
When NSELF_ONCALL_PROVIDER is unset or empty the receiver falls back to email using ALERTMANAGER_ONCALL_EMAIL (default: support@nself.org).
Package monitoring generates monitoring-stack configuration files for the nSelf CLI. Prometheus, Loki, Promtail, and Alertmanager configs are all built here and written into the target project's `monitoring/` directory by `nself build` when MONITORING_ENABLED=true.
This package owns *config generation*, not the docker-compose service definitions (those live in docker-compose.monitoring.yml — or, soon, the nself-monitoring free plugin).
Index ¶
- func OTELCollectorComposeService(cfg *OTELCollectorConfig) map[string]interface{}
- func RenderAlertmanagerYAML(cfg *AlertmanagerConfig) ([]byte, error)
- func RenderLokiYAML(cfg *LokiConfig) ([]byte, error)
- func RenderOTELAlertsYAML(cfg *OTELAlertRulesConfig) ([]byte, error)
- func RenderOTELCollectorYAML(cfg *OTELCollectorConfig) ([]byte, error)
- func RenderPrometheusYAML(cfg *PrometheusConfig) ([]byte, error)
- func RenderPromtailYAML(cfg *PromtailConfig) ([]byte, error)
- func RenderTempoYAML(cfg *TempoConfig) ([]byte, error)
- func TempoComposeService(cfg *TempoConfig) map[string]interface{}
- type AlertReceiver
- type AlertmanagerConfig
- type LokiConfig
- type OTELAlertRulesConfig
- type OTELCollectorConfig
- type OncallProvider
- type PrometheusConfig
- type PromtailConfig
- type ScrapeTarget
- type TempoConfig
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func OTELCollectorComposeService ¶ added in v1.0.12
func OTELCollectorComposeService(cfg *OTELCollectorConfig) map[string]interface{}
OTELCollectorComposeService returns the docker-compose service definition for the OTEL Collector.
The service block:
- Uses otel/opentelemetry-collector-contrib:latest
- Binds the config file from the monitoring/ directory
- Exposes OTLP gRPC (:4317) and OTLP HTTP (:4318) for apps to push spans
- Depends on Tempo being up before starting
func RenderAlertmanagerYAML ¶ added in v1.0.13
func RenderAlertmanagerYAML(cfg *AlertmanagerConfig) ([]byte, error)
RenderAlertmanagerYAML returns the rendered alertmanager.yml bytes for cfg.
func RenderLokiYAML ¶
func RenderLokiYAML(cfg *LokiConfig) ([]byte, error)
RenderLokiYAML returns loki.yml bytes for cfg.
func RenderOTELAlertsYAML ¶ added in v1.0.13
func RenderOTELAlertsYAML(cfg *OTELAlertRulesConfig) ([]byte, error)
RenderOTELAlertsYAML returns the rendered otel-alerts.yml bytes for cfg.
func RenderOTELCollectorYAML ¶ added in v1.0.12
func RenderOTELCollectorYAML(cfg *OTELCollectorConfig) ([]byte, error)
RenderOTELCollectorYAML returns the rendered otel-collector.yml bytes.
func RenderPrometheusYAML ¶
func RenderPrometheusYAML(cfg *PrometheusConfig) ([]byte, error)
RenderPrometheusYAML returns the rendered prometheus.yml bytes for cfg. Targets are emitted in stable order (alphabetical by job name) so repeated builds produce byte-identical output — critical for snapshot tests and for avoiding noisy diffs in generated config.
func RenderPromtailYAML ¶
func RenderPromtailYAML(cfg *PromtailConfig) ([]byte, error)
RenderPromtailYAML returns promtail.yml bytes for cfg.
func RenderTempoYAML ¶ added in v1.0.12
func RenderTempoYAML(cfg *TempoConfig) ([]byte, error)
RenderTempoYAML returns the rendered tempo.yml bytes.
func TempoComposeService ¶ added in v1.0.12
func TempoComposeService(cfg *TempoConfig) map[string]interface{}
TempoComposeService returns the docker-compose service definition for Tempo as a map suitable for YAML marshalling into the monitoring compose fragment.
The service block:
- Uses grafana/tempo:latest
- Binds the config file from the monitoring/ directory generated by nself build
- Exposes HTTP (:3200) and OTLP gRPC (:4317)
- Persists trace data to a named volume
Types ¶
type AlertReceiver ¶ added in v1.0.13
type AlertReceiver struct {
// Name is the receiver identifier referenced by routing rules.
Name string
// Provider selects the integration backend for this receiver.
// One of OncallPagerDuty, OncallOpsgenie, OncallEmail.
Provider OncallProvider
// EmailTo is the address used when Provider is OncallEmail.
// Reads ALERTMANAGER_ONCALL_EMAIL at runtime when empty.
EmailTo string
// PDRoutingKey is the PagerDuty Events API v2 routing key used when
// Provider is OncallPagerDuty. Reads ALERTMANAGER_PD_ROUTING_KEY.
PDRoutingKey string
// OGAPIKey is the Opsgenie API key used when Provider is OncallOpsgenie.
// Reads ALERTMANAGER_OG_API_KEY.
OGAPIKey string
}
AlertReceiver defines an Alertmanager notification receiver.
type AlertmanagerConfig ¶ added in v1.0.13
type AlertmanagerConfig struct {
// SMTPHost is the SMTP relay host:port for email notifications.
SMTPHost string
// SMTPFrom is the envelope-from address.
SMTPFrom string
// Receivers is the list of notification receivers.
Receivers []AlertReceiver
// GroupWait is how long Alertmanager buffers alerts before sending the
// first notification. Default: 30s.
GroupWait string
// GroupInterval is the interval between notifications for the same group.
// Default: 5m.
GroupInterval string
// RepeatInterval is the interval before re-sending a resolved or ongoing
// alert. Default: 4h.
RepeatInterval string
// MaintenanceWindowStart is the start time of the weekly maintenance window
// in HH:MM format (24h, UTC). Used to silence non-critical alerts during
// planned maintenance. Default: "02:00" (2 AM UTC Saturday).
MaintenanceWindowStart string
// MaintenanceWindowEnd is the end time of the maintenance window.
// Default: "04:00" (4 AM UTC Saturday).
MaintenanceWindowEnd string
// MaintenanceWindowDay is the day of week for the maintenance window.
// Default: "saturday".
MaintenanceWindowDay string
}
AlertmanagerConfig captures everything needed to render alertmanager.yml.
func DefaultAlertmanagerConfig ¶ added in v1.0.13
func DefaultAlertmanagerConfig() *AlertmanagerConfig
DefaultAlertmanagerConfig returns nSelf's out-of-the-box Alertmanager settings. The on-call receiver is selected from NSELF_ONCALL_PROVIDER:
- "pagerduty" — PagerDuty Events API v2 via ALERTMANAGER_PD_ROUTING_KEY
- "opsgenie" — Opsgenie Alerts API via ALERTMANAGER_OG_API_KEY
- "email" / "" (default) — SMTP via ALERTMANAGER_ONCALL_EMAIL
Warning and info severity routes go to the null receiver (logged, no notification). Maintenance window: Saturday 02:00–04:00 UTC.
type LokiConfig ¶
type LokiConfig struct {
// RetentionPeriod is how long logs are kept. Loki accepts Go-style
// durations (e.g. "720h" = 30d, "2160h" = 90d).
RetentionPeriod string
// MaxChunkAge caps how long a chunk can stay open before forced flush.
MaxChunkAge string
// MultiTenantEnabled, when true, runs Loki in multi-tenant mode with
// per-tenant isolation. Each nSelf project becomes a tenant.
MultiTenantEnabled bool
// RuleEngineEnabled turns on the Loki rule engine for log-based alerts.
RuleEngineEnabled bool
}
LokiConfig captures the fields needed to render loki.yml.
func DefaultLokiConfig ¶
func DefaultLokiConfig() *LokiConfig
DefaultLokiConfig returns the settings nSelf uses out of the box: 30-day retention (720h), single-tenant, rule engine on. Retention was extended from 168h (7d) to 720h (30d) in S224 to satisfy production debugging requirements.
type OTELAlertRulesConfig ¶ added in v1.0.13
type OTELAlertRulesConfig struct {
// SpanIngestDropRatio is the fraction below which the 5m span ingest rate
// must fall relative to the 30m baseline to trigger OtelSpanIngestDrop.
// Default: 0.5 (50% drop).
SpanIngestDropRatio float64
// DownDuration is the minimum time a target must be absent before the
// OtelCollectorDown / TempoDown alerts fire. Default: 5m.
DownDuration string
// SpanDropDuration is the minimum time the span ingest ratio must be low
// before OtelSpanIngestDrop fires. Default: 10m.
SpanDropDuration string
// ExportErrorDuration is the minimum time export errors must persist before
// TempoScrapeErrors fires. Default: 5m.
ExportErrorDuration string
}
OTELAlertRulesConfig carries parameterised thresholds for the OTEL alert rules file. Callers may override these to tighten or relax alert sensitivity.
func DefaultOTELAlertRulesConfig ¶ added in v1.0.13
func DefaultOTELAlertRulesConfig() *OTELAlertRulesConfig
DefaultOTELAlertRulesConfig returns conservative nSelf defaults: 50% span drop triggers a warning after 10 minutes; down alerts after 5 minutes.
type OTELCollectorConfig ¶ added in v1.0.12
type OTELCollectorConfig struct {
// TempoEndpoint is the OTLP HTTP endpoint the collector forwards to.
// Default: http://tempo:4318.
TempoEndpoint string
// OTLPGRPCPort is the gRPC port the collector exposes to apps.
// Default: 4317.
OTLPGRPCPort int
// OTLPHTTPPort is the HTTP port the collector exposes to apps.
// Default: 4318.
OTLPHTTPPort int
}
OTELCollectorConfig captures the fields needed to generate the OTEL Collector configuration. The collector receives OTLP from instrumented services and forwards spans to Tempo.
func DefaultOTELCollectorConfig ¶ added in v1.0.12
func DefaultOTELCollectorConfig() *OTELCollectorConfig
DefaultOTELCollectorConfig returns nSelf's out-of-the-box OTEL Collector config.
type OncallProvider ¶ added in v1.1.3
type OncallProvider string
OncallProvider identifies which on-call integration backend to use.
const ( // OncallPagerDuty routes critical alerts via PagerDuty Events API v2. OncallPagerDuty OncallProvider = "pagerduty" // OncallOpsgenie routes critical alerts via Opsgenie Alerts API. OncallOpsgenie OncallProvider = "opsgenie" // OncallEmail routes critical alerts via SMTP. This is the default when // NSELF_ONCALL_PROVIDER is unset or not recognised. OncallEmail OncallProvider = "email" )
type PrometheusConfig ¶
type PrometheusConfig struct {
// ScrapeInterval is the global default (e.g. "15s").
ScrapeInterval string
// EvaluationInterval is how often Prometheus evaluates alert rules.
EvaluationInterval string
// ExternalLabels apply to every time series (useful for multi-cluster).
ExternalLabels map[string]string
// AlertmanagerURL is the Alertmanager host:port; empty disables alerting.
AlertmanagerURL string
// RuleFiles are paths (inside the container) to load alert rules from.
RuleFiles []string
// Targets is the full list of scrape targets.
Targets []ScrapeTarget
}
PrometheusConfig captures everything needed to render prometheus.yml.
func Defaults ¶
func Defaults() *PrometheusConfig
Defaults returns a PrometheusConfig with nSelf's out-of-the-box settings. Callers append their own targets via `cfg.Targets = append(...)` before rendering.
type PromtailConfig ¶
type PromtailConfig struct {
// LokiURL is the Loki push endpoint. Inside the docker network this is
// typically "http://loki:3100/loki/api/v1/push".
LokiURL string
// TenantID is the X-Scope-OrgID sent with every push. Empty in
// single-tenant mode; required when Loki is multi-tenant.
TenantID string
// ProjectName is the nSelf project name; attached as a static label on
// every line so operators can distinguish multiple projects on one host.
ProjectName string
}
PromtailConfig captures the fields needed to render promtail.yml. Promtail reads Docker container logs and ships them to Loki with labels that let Grafana filter by plugin, tenant, and service.
func DefaultPromtailConfig ¶
func DefaultPromtailConfig(projectName string) *PromtailConfig
DefaultPromtailConfig returns the settings nSelf uses out of the box.
type ScrapeTarget ¶
type ScrapeTarget struct {
// JobName uniquely identifies the target in Prometheus. Use the plugin
// name (e.g. "ai", "mux") or the service name (e.g. "postgres", "nginx").
JobName string
// ServiceName is the docker-compose service name resolved via the Docker
// network DNS. Prometheus scrapes <ServiceName>:<Port><Path>.
ServiceName string
// Port is the plugin's HTTP port as declared in plugin.json.
Port int
// Path is the metrics endpoint. Defaults to /metrics if empty.
Path string
// Interval overrides the global scrape_interval when set. Empty = global.
Interval string
// Labels are additional static labels attached to every scraped series.
// Common use: {"plugin": "ai", "tier": "pro"}.
Labels map[string]string
// BearerToken, when set, is sent as an Authorization: Bearer header.
// Used by Hasura /v1/metrics which requires HASURA_GRAPHQL_METRICS_SECRET.
BearerToken string
}
ScrapeTarget describes one plugin or service that Prometheus should scrape. Every nSelf plugin exposes /metrics on its HTTP port; admins may add custom services with the same contract via ScrapeTargets in the monitoring config.
func BuiltinTargets ¶
func BuiltinTargets() []ScrapeTarget
BuiltinTargets returns the monitoring targets that ship with every nSelf stack: Prometheus itself, node-exporter, cadvisor, postgres-exporter, plus Hasura /v1/metrics, MinIO cluster metrics, and Auth /metrics. Plugin targets are appended by the caller.
func TargetFromPlugin ¶
func TargetFromPlugin(name string, port int, tier string) ScrapeTarget
TargetFromPlugin builds a ScrapeTarget for a plugin declared with the given name, port, and tier ("free" or "pro"). It sets sensible labels so the Grafana dashboards can filter by plugin and tier.
func TempoScrapeTarget ¶ added in v1.0.12
func TempoScrapeTarget(cfg *TempoConfig) ScrapeTarget
TempoScrapeTarget returns the Prometheus ScrapeTarget for the Tempo service. Tempo exposes its own metrics at /metrics on the HTTP port.
type TempoConfig ¶ added in v1.0.12
type TempoConfig struct {
// RetentionPeriod is how long traces are kept. Accepts Go-style durations
// (e.g. "168h" = 7d, "720h" = 30d). Default: 168h (7 days).
RetentionPeriod string
// StoragePath is the bind-mount path inside the Tempo container where
// trace data is persisted. Default: /var/tempo.
StoragePath string
// HTTPPort is the Tempo HTTP port used for UI + /api/v1/push ingress.
// Default: 3200.
HTTPPort int
// OTLPGRPCPort is the gRPC port Tempo listens on for OTLP spans from the
// OTEL Collector. Default: 4317.
OTLPGRPCPort int
}
TempoConfig captures the fields needed to render tempo.yml and the Tempo and OTEL Collector docker-compose service blocks.
Tempo is Grafana's distributed tracing backend. It receives spans over OTLP (from the OTEL Collector) and makes them queryable from Grafana via the Tempo data source.
OTEL Collector sits in front of Tempo and acts as a fan-out router: app services (Hasura, plugins, custom services) emit spans to the collector via OTLP gRPC (:4317) or OTLP HTTP (:4318), and the collector forwards them to Tempo's OTLP HTTP ingress (:4318).
func DefaultTempoConfig ¶ added in v1.0.12
func DefaultTempoConfig() *TempoConfig
DefaultTempoConfig returns nSelf's out-of-the-box Tempo settings: 7-day retention, local filesystem storage, standard ports.