switchover

package
v0.28.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 30, 2021 License: Apache-2.0 Imports: 19 Imported by: 0

README

DB Switchover

Switchover functionality intends to replicate data to a new empty DB and coordinate all GoAlert instances to switch their active DB without downtime, loss of data, and minimal latency impact.

High-Level Flow

  1. GoAlert instances start in "switchover mode", and know about both DBs
  2. A control shell validates config & state
  3. New DB is migrated to be structuraly identical to the old one
  4. Old DB is instrumented with a changelog
  5. An initial sync is performed of all data old -> new
  6. Timetable is broadcast to begin switchover
  7. Pause all DB queries
  8. Replicate changes since last sync
  9. Unpause and use new DB

If at any time a node is introduced, config changes, or a deadline is exceeded: nodes will broadcast an abort event and resume normal operation. All operations will end by the deadline and either the old DB or the new one (with all changes included) will be used by all nodes.

Switchover Mode

When in switchover mode, GoAlert instances will operate with a wrapped DB driver that will determine which DB to use for each connection that enters the pool.

GoAlert starts in "Switchover Mode" when --db-url-next is set.

New Connections
  1. Connect to the old DB
  2. Acquire a shared advisory lock
  3. Check switchover_state for use_next_db
  4. If set, close and return connection to new db
  5. If not set, return current connection

When the final sync is performed an exclusive advisory lock is acquired in the transaction. Since this conflicts with the shared lock, it ensures the final sync is performed without any running queries on the old DB. When the transaction ends switchover_state is checked and old or new will be used for all connections, depending on the success of the final sync.

Node/Instance States

Viewed from logs or with the status command from the switchover-shell.

State Description
starting Node has reset or is still starting up.
ready Node is idle and ready for instructions.
armed Node has recieved switchover timetable and is waiting for confirmation from other nodes.
armed-waiting Node is waiting for the pause phase (all known nodes confirmed)
pausing Node is waiting for the engine to finish pausing.
paused-waiting Node is paused. Engine will not run, and idle connections are disabled.
complete Normal operation resumed, next db is in use (use_db_next is set).
aborted Something has triggered an abort. Node has resumed normal operation.

Performing a Switchover

To perform a switchover:

  1. Set/configure --db-url-next for all GoAlert instances
  2. Run goalert switchover-shell with --db-url and --db-url-next set

From the switchover shell:

  1. Run reset and wait for all nodes to be ready (use status or status -w)
  2. Using status validate that there are no problems. You should see "No Problems Found" printed at the bottom, or a list with possible remediations.
  3. Enable change tracking (for logical replication) with enable
  4. Optionally run sync (it will be run as part of execute)
  5. Run execute and confirm to perform the switchover
  6. Configure all GoAlert instances to use the new --db-url and un-set --db-url-next

The execute command will ask for confirmation of the proposed timetable:

Switch-Over Details
  Pause API Requests: no       # Pause API requests for the full duration, instead of just the final sync
  Consensus Timeout : 3s       # Deadline for all nodes to confirm they got the timetable and are ready
  Pause Starts After: 5s       # How long to wait before begining the pause
  Pause Timeout     : 10s      # Max time to wait for all nodes to pause before aborting
  Absolute Max Pause: 13s      # The maximum possible pause time of the engine (and API requests if set above)
  Avail. Sync Time  : 1s - 11s # Indicates the possible final sync time alloted with this configuration
  Max Alloted Time  : 18s      # Max time from begining to end of the switchover process

Ready?

   Cancel
 ❯ Go!

Shell commands also support -h for extra information and options. Use CTRL+C to cancel an operation (like status -w or sync)

To completely reset in the event of an issue:

  1. Run disable to remove triggers
  2. Run reset-dest to truncate all tables in --db-url-next
  3. Run reset to reset all nodes

If execute fails (e.g. due to a deadline) it should be safe to retry.

Documentation

Index

Constants

View Source
const (
	StateChannel   = "goalert_switchover_state"
	ControlChannel = "goalert_switchover_control"
	DBIDChannel    = "goalert_switchover_db_id"
)

Postgres channel names

View Source
const (
	StateStarting  = State("starting")
	StateReady     = State("ready")
	StateArmed     = State("armed")
	StateArmWait   = State("armed-waiting")
	StatePausing   = State("pausing")
	StatePaused    = State("paused")
	StatePauseWait = State("paused-waiting")
	StateComplete  = State("complete")
	StateAbort     = State("aborted")
)

Possible states

Variables

This section is empty.

Functions

func CalcDBOffset

func CalcDBOffset(ctx context.Context, db *sql.DB) (time.Duration, error)

Types

type App

type App interface {
	Pause(context.Context) error
	Resume()
	Status() lifecycle.Status
}

type DeadlineConfig

type DeadlineConfig struct {
	BeginAt          time.Time     // The start-time of the Switch-Over.
	ConsensusTimeout time.Duration // Amount of time to wait for consensus amongst all nodes before aborting.
	PauseDelay       time.Duration // How long to wait after starting before beginning the global pause.
	PauseTimeout     time.Duration // Timeout to achieve global pause before aborting.
	MaxPause         time.Duration // Absolute maximum amount of time for any operation to be delayed due to the Switch-Over.
	NoPauseAPI       bool          // Allow HTTP/API requests during Pause phase.
}

DeadlineConfig controls the timeing of a Switch-Over operation.

func ConfigFromContext

func ConfigFromContext(ctx context.Context) DeadlineConfig

ConfigFromContext returns the DeadlineConfig associated with the current context.

func DefaultConfig

func DefaultConfig() DeadlineConfig

DefaultConfig returns the default deadline configuration.

func ParseDeadlineConfig

func ParseDeadlineConfig(s string, offset time.Duration) (*DeadlineConfig, error)

ParseDeadlineConfig will parse deadline configuration (given by Serialize) from a string. Offset should be the time difference between the local clock and the central clock (i.e. Postgres).

func (DeadlineConfig) AbsoluteDeadline

func (cfg DeadlineConfig) AbsoluteDeadline() time.Time

AbsoluteDeadline will calculate the absolute deadline of the entire switchover operation.

func (DeadlineConfig) ConsensusDeadline

func (cfg DeadlineConfig) ConsensusDeadline() time.Time

ConsensusDeadline will return the deadline for consensus amonst all nodes.

func (DeadlineConfig) PauseAt

func (cfg DeadlineConfig) PauseAt() time.Time

PauseAt will return the time global pause begins.

func (DeadlineConfig) PauseDeadline

func (cfg DeadlineConfig) PauseDeadline() time.Time

PauseDeadline will return the deadline to achieve global pause.

func (DeadlineConfig) Serialize

func (cfg DeadlineConfig) Serialize(offset time.Duration) string

Serialize returns a textual representation of DeadlineConfig that can be transmitted to other nodes. Offset should be time difference between the local clock and the central clock (i.e. Postgres).

type Handler

type Handler struct {
	// contains filtered or unexported fields
}

func NewHandler

func NewHandler(ctx context.Context, oldC, newC driver.Connector, oldURL, newURL string) (*Handler, error)

func (*Handler) Abort

func (h *Handler) Abort()

func (*Handler) CheckDBID added in v0.23.0

func (h *Handler) CheckDBID(id string) bool

CheckDBID will return true if the ID matches the current db-next ID. If there is no current ID, then it is set and returns true.

func (*Handler) CheckDBNextID added in v0.23.0

func (h *Handler) CheckDBNextID(id string) bool

CheckDBNextID will return true if the ID matches the current db-next ID. If there is no current ID, then it is set and returns true.

func (*Handler) Connect

func (h *Handler) Connect(ctx context.Context) (c driver.Conn, err error)

func (*Handler) DB

func (h *Handler) DB() *sql.DB

func (*Handler) Driver

func (h *Handler) Driver() driver.Driver

func (*Handler) SetApp

func (h *Handler) SetApp(app App)

func (*Handler) Status

func (h *Handler) Status() *Status

type State

type State string

State indicates the current state of a node.

func (State) IsActive

func (s State) IsActive() bool

IsActive will return true if the state represents an on-going change-over event.

type Status

type Status struct {
	NodeID string
	State  State
	Offset time.Duration
	At     time.Time

	ActiveRequests int

	DBID     string
	DBNextID string
}

Status represents the status of an individual node.

func ParseStatus

func ParseStatus(str string) (*Status, error)

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL