Documentation
¶
Overview ¶
Package dr provides disaster recovery operations: drills, standby promotion, rollback, and split-brain fencing.
Index ¶
- Constants
- Variables
- func Fence(ctx context.Context, cfg *config.Config) error
- func FormatDrillResult(result *DrillResult, format string) (string, error)
- func InstallSystemdUnits(opts SystemdInstallOptions) error
- func PromoteStandby(ctx context.Context, cfg *config.Config, opts PromoteOptions) error
- func ReconfigureDNS(ctx context.Context, cfg *config.Config, newIP string) error
- func RenderCloudInit(p CloudInitParams) (string, error)
- func Rollback(ctx context.Context, cfg *config.Config) error
- type CloudInitParams
- type DrillOptions
- type DrillReport
- type DrillResult
- type DrillScenarios
- type PromoteOptions
- type Scenario
- type SystemdInstallOptions
- type SystemdUnitFiles
Constants ¶
const DrillAlertRuleYAML = `` /* 353-byte string literal not displayed */
DrillAlertRuleYAML is the exact Alertmanager/Prometheus rule that fires when a monthly drill produces a non-"pass" result. The content is deployed to `web/backend/nself/monitoring/alerts/dr.rules.yml` and evaluated by the Prometheus instance on nclaw-prod.
const DrillReportTableDDL = `` /* 640-byte string literal not displayed */
DrillReportTableDDL is the idempotent Postgres DDL for the report table. It is applied once on nclaw-prod during cron install.
const DrillResultMetricName = "nself_dr_drill_result"
DrillResultMetricName is the Prometheus metric name emitted by a drill run. Labels: drill_id (unique per run), result ("pass" or "fail"), rto_sec, rpo_sec.
Variables ¶
var RequiredPluginChecks = []string{"claw", "ai", "mux"}
RequiredPluginChecks lists plugins whose health MUST be present in every drill report. Additional installed plugins are appended at runtime.
Functions ¶
func FormatDrillResult ¶
func FormatDrillResult(result *DrillResult, format string) (string, error)
FormatDrillResult renders a drill result as JSON or table.
func InstallSystemdUnits ¶
func InstallSystemdUnits(opts SystemdInstallOptions) error
InstallSystemdUnits renders and writes unit files, then runs `systemctl daemon-reload` and enables the drill timer. Requires root.
func PromoteStandby ¶
PromoteStandby promotes the warm standby to primary and updates DNS.
func ReconfigureDNS ¶
ReconfigureDNS updates Cloudflare A records to point to a new IP.
func RenderCloudInit ¶
func RenderCloudInit(p CloudInitParams) (string, error)
RenderCloudInit returns the cloud-init user-data YAML for a drill VM. The output is deterministic for a given params set so that operators can diff the rendered YAML against the last known good template when a drill fails with a cloud-init error (see the fail-fix playbook).
Types ¶
type CloudInitParams ¶
type CloudInitParams struct {
DrillID string
BackupID string
B2Bucket string
B2KeyID string
B2AppKey string
AgeKeyMaterial string // contents of age-key.txt; embedded, never logged
SSHPublicKey string
ReporterURL string // nclaw-prod API endpoint that receives the report
ReporterToken string
NselfVersion string // e.g. v1.0.3; empty means latest
}
CloudInitParams feeds the drill VM user-data template. The resulting cloud-init YAML installs Docker and the nSelf CLI, then runs the drill entrypoint which restores the latest backup and executes the smoke suite.
type DrillOptions ¶
DrillOptions holds flags for `nself dr drill`.
type DrillReport ¶
type DrillReport struct {
DrillID string `json:"drill_id"`
StartedAt time.Time `json:"started_at"`
FinishedAt time.Time `json:"finished_at"`
VMID string `json:"vm_id"`
BackupID string `json:"backup_id"`
RTOActualSec int64 `json:"rto_actual_sec"`
RPOActualSec int64 `json:"rpo_actual_sec"`
Result string `json:"result"` // pass | fail
Scenarios DrillScenarios `json:"scenarios"`
CostEUR float64 `json:"cost_eur"`
}
DrillReport is the persisted JSON schema for a monthly DR drill run.
It is inserted into the `dr_drill_report` PG table on nclaw-prod via an authenticated API call, and is also the payload posted to the #ops-dr Telegram channel. Field names match the spec in p88-block-g section 4.4.
func NewDrillReport ¶
func NewDrillReport(drillID, vmID, backupID string, startedAt time.Time) *DrillReport
NewDrillReport builds a new DrillReport with zero values for all scenario checks and a pre-populated PluginHealth map seeded with required plugins.
func (*DrillReport) Finalize ¶
func (r *DrillReport) Finalize(finishedAt time.Time)
Finalize stamps the finish time and computes pass/fail across all scenarios. A report passes iff every scenario check and every plugin health check is true, and both RTO/RPO values were recorded (> 0).
func (*DrillReport) Marshal ¶
func (r *DrillReport) Marshal() ([]byte, error)
MarshalJSON renders the report in the exact field order defined by the spec.
func (*DrillReport) Validate ¶
func (r *DrillReport) Validate() error
Validate ensures the report has every required field populated. It is used by the storage layer to reject malformed reports before they hit PG.
type DrillResult ¶
type DrillResult struct {
ID string `json:"id"`
Scenario Scenario `json:"scenario"`
StartedAt time.Time `json:"started_at"`
FinishedAt time.Time `json:"finished_at"`
Status string `json:"status"` // success, failed
RowCountDelta map[string]int64 `json:"row_count_delta"`
Details map[string]interface{} `json:"details"`
}
DrillResult holds the outcome of a DR drill.
func Drill ¶
func Drill(ctx context.Context, cfg *config.Config, opts DrillOptions) (*DrillResult, error)
Drill executes a disaster recovery drill by provisioning a fresh VM, restoring from backup, and verifying data integrity.
type DrillScenarios ¶
type DrillScenarios struct {
PGRestore bool `json:"pg_restore"`
HasuraUp bool `json:"hasura_up"`
MinIOMetadata bool `json:"minio_metadata"`
PluginHealth map[string]bool `json:"plugin_health"`
}
DrillScenarios captures per-check pass/fail across the smoke suite. All boolean fields are true on success. PluginHealth maps plugin name to health status and MUST include claw, ai, and mux at minimum.
type PromoteOptions ¶
PromoteOptions holds flags for `nself dr promote-standby`.
type SystemdInstallOptions ¶
type SystemdInstallOptions struct {
// Schedule is the desired cadence. Only "monthly" is supported today.
// Empty means monthly (OnCalendar=*-*-01 05:00:00).
Schedule string
// HetznerProject identifies the Hetzner project that owns the drill VM.
// Used to select the correct API token from the env file (for example
// "camarata" selects HETZNER_CAMARATA_TOKEN).
HetznerProject string
// VMType is the Hetzner server type used for the drill VM. Defaults to
// "cx22" (smallest shared-CPU tier in fsn1) to keep drill cost < €0.05.
VMType string
// SSHKey is the absolute path to the public SSH key injected into the
// drill VM via cloud-init. Defaults to /root/.config/nself/dr-key.pub.
SSHKey string
// Region is the Hetzner location for provisioning. Defaults to fsn1.
Region string
UnitDir string // default /etc/systemd/system
EnvFile string // default /etc/nself/dr.env
BinaryPath string // default /usr/local/bin/nself
ProjectDir string // default /opt/nself
DryRun bool
}
SystemdInstallOptions controls install of the monthly DR drill timer.
The unit runs `nself dr drill --now` on the first of each month at 05:00 UTC, provisioning a throwaway Hetzner VM, restoring the latest backup, running a smoke suite, recording timings to the dr_drill_report PG table, and destroying the VM.
type SystemdUnitFiles ¶
SystemdUnitFiles maps filename -> unit contents.
func RenderSystemdUnits ¶
func RenderSystemdUnits(opts SystemdInstallOptions) (SystemdUnitFiles, error)
RenderSystemdUnits produces the unit files for the monthly DR drill without writing to disk. Returns a map suitable for writing into /etc/systemd/system.