carousul

command module
v0.0.0-...-d3a60df Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 23, 2019 License: MIT Imports: 9 Imported by: 0

README

Carousul

Cassandra Anti-entropy Repair ou Consul

Program for performing anti-entropy repair with the nodetool repair command and Consul for distributed locking.

Flags

keyspace: Cassandra keyspace to repair. Carousul only takes responsibility for one keyspace at a time.

lockprefix: Consul KV prefix indicating where locks are to be created.

textfiledir: Prometheus node exporter textfile directory. A file in text-based exposition format will be written here.

Usage

Carousul is to be executed simultaneously from all nodes of a single datacenter. Each node will perform the repair operation one-by-one.

./carousul -keyspace=keyspace -lockprefix=prefix -textfiledir=/metrics

Metrics

The following metrics are written for collection:

  • cassandra_repair_success: 0 or 1 indicating success or failure.
  • cassandra_repair_duration_lock_milliseconds: how long it took to obtain the lock.
  • cassandra_repair_duration_repair_milliseconds: how long the actual repair took.
  • cassandra_repair_duration_total_milliseconds: total duration of program execution.

Repair Considerations

Full vs. Incremental

Only full repairs are done.

There is some conflicting recommendations on this topic. Using incremental repairs are compelling because they reduce repair time significantly. They were claimed to be "more efficient" when they became available in Cassandra 2.1. However, they do not maintain data integrity. It is also stated to be "not recommended" in the repair command docs. Therefore, only full repairs are done.

Partitioner Range

Repairs only the primary partition ranges of the node being repaired. This prevents Cassandra from repairing the same range of data several times. It is also the recommended approach for routine maintenance.

Parallel vs. DC-Parallel vs. Sequential

DC-Parallel repair is used.

Sequential repair requires that each node of a cluster run a repair command one after the other. This repair strategy entails maximum operational overhead.

Parallel repair repairs all nodes in all datacenters at the same time. This repair strategy entails maximum performance impact.

DC-Parallel combines them by running sequential repairs across datacenters in parallel. This means that a complete repair can be accomplished by running repairs on each node of just a single datacenter one at a time.

Compared to sequential repair, this is less operationally complex. It is much easier to automate dc-parallel repair in a single datacenter as opposed to sequential repair across all datacenters. Especially since only one node in the entire cluster can be repaired at a time.

Compared to parallel, this is less resource intensive. This is because only nodes that own replica data in common with the coordinator node's primary partition range will be doing work.

Distributed Locking

Distributed locking is implemented in order to ensure that this program is executed one node at a time. The program is to be run simultaneously on the nodes of a "coordinating" datacenter. The coordinating datacenter is the only datacenter that needs to run repair because "dc-parallel" repair is implemented. The nodes will proceed to compete for a lock. Repair will not happen until the lock is obtained and each node will eventually obtain the lock. When all nodes of the coordinating datacenter get a turn, the targeted keyspace will have been fully repaired.

Consul is used to achieve this. Consul sessions are the basis for the approach. Sessions address the following concerns:

Lock Release on Failure

Sessions have a TTL. A lock will be released when its session expires.

Repair Exceeds TTL of Session

As long as the program is alive, its session will be automatically renewed. This is part of func (*Lock) Lock in Consul's API client.

Unexpected Session Expiration

If a session is invalidated before a repair is completed, the repair will be interrupted. While the interruption results in a failed repair, the rest of the cluster will be able to continue safely.

Documentation

The Go Gopher

There is no documentation for this package.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL