memberlist

package
v0.6.0-rc1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 19, 2016 License: MPL-2.0, Apache-2.0 Imports: 21 Imported by: 0

README

memberlist

memberlist is a Go library that manages cluster membership and member failure detection using a gossip based protocol.

The use cases for such a library are far-reaching: all distributed systems require membership, and memberlist is a re-usable solution to managing cluster membership and node failure detection.

memberlist is eventually consistent but converges quickly on average. The speed at which it converges can be heavily tuned via various knobs on the protocol. Node failures are detected and network partitions are partially tolerated by attempting to communicate to potentially dead nodes through multiple routes.

Building

If you wish to build memberlist you'll need Go version 1.2+ installed.

Please check your installation with:

go version

Usage

Memberlist is surprisingly simple to use. An example is shown below:

/* Create the initial memberlist from a safe configuration.
   Please reference the godoc for other default config types.
   http://godoc.org/github.com/hashicorp/memberlist#Config
*/
list, err := memberlist.Create(memberlist.DefaultLocalConfig())
if err != nil {
	panic("Failed to create memberlist: " + err.Error())
}

// Join an existing cluster by specifying at least one known member.
n, err := list.Join([]string{"1.2.3.4"})
if err != nil {
	panic("Failed to join cluster: " + err.Error())
}

// Ask for members of the cluster
for _, member := range list.Members() {
	fmt.Printf("Member: %s %s\n", member.Name, member.Addr)
}

// Continue doing whatever you need, memberlist will maintain membership
// information in the background. Delegates can be used for receiving
// events when members join or leave.

The most difficult part of memberlist is configuring it since it has many available knobs in order to tune state propagation delay and convergence times. Memberlist provides a default configuration that offers a good starting point, but errs on the side of caution, choosing values that are optimized for higher convergence at the cost of higher bandwidth usage.

For complete documentation, see the associated Godoc.

Protocol

memberlist is based on "SWIM: Scalable Weakly-consistent Infection-style Process Group Membership Protocol", with a few minor adaptations, mostly to increase propogation speed and convergence rate.

A high level overview of the memberlist protocol (based on SWIM) is described below, but for details please read the full SWIM paper followed by the memberlist source. We welcome any questions related to the protocol on our issue tracker.

Protocol Description

memberlist begins by joining an existing cluster or starting a new cluster. If starting a new cluster, additional nodes are expected to join it. New nodes in an existing cluster must be given the address of at least one existing member in order to join the cluster. The new member does a full state sync with the existing member over TCP and begins gossiping its existence to the cluster.

Gossip is done over UDP to a with a configurable but fixed fanout and interval. This ensures that network usage is constant with regards to number of nodes, as opposed to exponential growth that can occur with traditional heartbeat mechanisms. Complete state exchanges with a random node are done periodically over TCP, but much less often than gossip messages. This increases the likelihood that the membership list converges properly since the full state is exchanged and merged. The interval between full state exchanges is configurable or can be disabled entirely.

Failure detection is done by periodic random probing using a configurable interval. If the node fails to ack within a reasonable time (typically some multiple of RTT), then an indirect probe is attempted. An indirect probe asks a configurable number of random nodes to probe the same node, in case there are network issues causing our own node to fail the probe. If both our probe and the indirect probes fail within a reasonable time, then the node is marked "suspicious" and this knowledge is gossiped to the cluster. A suspicious node is still considered a member of cluster. If the suspect member of the cluster does not disputes the suspicion within a configurable period of time, the node is finally considered dead, and this state is then gossiped to the cluster.

This is a brief and incomplete description of the protocol. For a better idea, please read the SWIM paper in its entirety, along with the memberlist source code.

Changes from SWIM

As mentioned earlier, the memberlist protocol is based on SWIM but includes minor changes, mostly to increase propogation speed and convergence rates.

The changes from SWIM are noted here:

  • memberlist does a full state sync over TCP periodically. SWIM only propagates changes over gossip. While both eventually reach convergence, the full state sync increases the likelihood that nodes are fully converged more quickly, at the expense of more bandwidth usage. This feature can be totally disabled if you wish.

  • memberlist has a dedicated gossip layer separate from the failure detection protocol. SWIM only piggybacks gossip messages on top of probe/ack messages. memberlist also piggybacks gossip messages on top of probe/ack messages, but also will periodically send out dedicated gossip messages on their own. This feature lets you have a higher gossip rate (for example once per 200ms) and a slower failure detection rate (such as once per second), resulting in overall faster convergence rates and data propogation speeds. This feature can be totally disabed as well, if you wish.

  • memberlist stores around the state of dead nodes for a set amount of time, so that when full syncs are requested, the requester also receives information about dead nodes. Because SWIM doesn't do full syncs, SWIM deletes dead node state immediately upon learning that the node is dead. This change again helps the cluster converge more quickly.

Documentation

Overview

memberlist is a library that manages cluster membership and member failure detection using a gossip based protocol.

The use cases for such a library are far-reaching: all distributed systems require membership, and memberlist is a re-usable solution to managing cluster membership and node failure detection.

memberlist is eventually consistent but converges quickly on average. The speed at which it converges can be heavily tuned via various knobs on the protocol. Node failures are detected and network partitions are partially tolerated by attempting to communicate to potentially dead nodes through multiple routes.

Index

Constants

View Source
const (
	ProtocolVersionMin uint8 = 1
	ProtocolVersionMax       = 2
)

This is the minimum and maximum protocol version that we can _understand_. We're allowed to speak at any version within this range. This range is inclusive.

View Source
const (
	MetaMaxSize = 512 // Maximum size for node meta data

)

Variables

This section is empty.

Functions

This section is empty.

Types

type Broadcast

type Broadcast interface {
	// Invalidates checks if enqueuing the current broadcast
	// invalidates a previous broadcast
	Invalidates(b Broadcast) bool

	// Returns a byte form of the message
	Message() []byte

	// Finished is invoked when the message will no longer
	// be broadcast, either due to invalidation or to the
	// transmit limit being reached
	Finished()
}

Broadcast is something that can be broadcasted via gossip to the memberlist cluster.

type ChannelEventDelegate

type ChannelEventDelegate struct {
	Ch chan<- NodeEvent
}

ChannelEventDelegate is used to enable an application to receive events about joins and leaves over a channel instead of a direct function call.

Care must be taken that events are processed in a timely manner from the channel, since this delegate will block until an event can be sent.

func (*ChannelEventDelegate) NotifyJoin

func (c *ChannelEventDelegate) NotifyJoin(n *Node)

func (*ChannelEventDelegate) NotifyLeave

func (c *ChannelEventDelegate) NotifyLeave(n *Node)

func (*ChannelEventDelegate) NotifyUpdate

func (c *ChannelEventDelegate) NotifyUpdate(n *Node)

type Config

type Config struct {
	// The name of this node. This must be unique in the cluster.
	Name string

	// Configuration related to what address to bind to and ports to
	// listen on. The port is used for both UDP and TCP gossip.
	// It is assumed other nodes are running on this port, but they
	// do not need to.
	BindAddr string
	BindPort int

	// Configuration related to what address to advertise to other
	// cluster members. Used for nat traversal.
	AdvertiseAddr string
	AdvertisePort int

	// ProtocolVersion is the configured protocol version that we
	// will _speak_. This must be between ProtocolVersionMin and
	// ProtocolVersionMax.
	ProtocolVersion uint8

	// TCPTimeout is the timeout for establishing a TCP connection with
	// a remote node for a full state sync.
	TCPTimeout time.Duration

	// IndirectChecks is the number of nodes that will be asked to perform
	// an indirect probe of a node in the case a direct probe fails. Memberlist
	// waits for an ack from any single indirect node, so increasing this
	// number will increase the likelihood that an indirect probe will succeed
	// at the expense of bandwidth.
	IndirectChecks int

	// RetransmitMult is the multiplier for the number of retransmissions
	// that are attempted for messages broadcasted over gossip. The actual
	// count of retransmissions is calculated using the formula:
	//
	//   Retransmits = RetransmitMult * log(N+1)
	//
	// This allows the retransmits to scale properly with cluster size. The
	// higher the multiplier, the more likely a failed broadcast is to converge
	// at the expense of increased bandwidth.
	RetransmitMult int

	// SuspicionMult is the multiplier for determining the time an
	// inaccessible node is considered suspect before declaring it dead.
	// The actual timeout is calculated using the formula:
	//
	//   SuspicionTimeout = SuspicionMult * log(N+1) * ProbeInterval
	//
	// This allows the timeout to scale properly with expected propagation
	// delay with a larger cluster size. The higher the multiplier, the longer
	// an inaccessible node is considered part of the cluster before declaring
	// it dead, giving that suspect node more time to refute if it is indeed
	// still alive.
	SuspicionMult int

	// PushPullInterval is the interval between complete state syncs.
	// Complete state syncs are done with a single node over TCP and are
	// quite expensive relative to standard gossiped messages. Setting this
	// to zero will disable state push/pull syncs completely.
	//
	// Setting this interval lower (more frequent) will increase convergence
	// speeds across larger clusters at the expense of increased bandwidth
	// usage.
	PushPullInterval time.Duration

	// ProbeInterval and ProbeTimeout are used to configure probing
	// behavior for memberlist.
	//
	// ProbeInterval is the interval between random node probes. Setting
	// this lower (more frequent) will cause the memberlist cluster to detect
	// failed nodes more quickly at the expense of increased bandwidth usage.
	//
	// ProbeTimeout is the timeout to wait for an ack from a probed node
	// before assuming it is unhealthy. This should be set to 99-percentile
	// of RTT (round-trip time) on your network.
	ProbeInterval time.Duration
	ProbeTimeout  time.Duration

	// GossipInterval and GossipNodes are used to configure the gossip
	// behavior of memberlist.
	//
	// GossipInterval is the interval between sending messages that need
	// to be gossiped that haven't been able to piggyback on probing messages.
	// If this is set to zero, non-piggyback gossip is disabled. By lowering
	// this value (more frequent) gossip messages are propagated across
	// the cluster more quickly at the expense of increased bandwidth.
	//
	// GossipNodes is the number of random nodes to send gossip messages to
	// per GossipInterval. Increasing this number causes the gossip messages
	// to propagate across the cluster more quickly at the expense of
	// increased bandwidth.
	GossipInterval time.Duration
	GossipNodes    int

	// EnableCompression is used to control message compression. This can
	// be used to reduce bandwidth usage at the cost of slightly more CPU
	// utilization. This is only available starting at protocol version 1.
	EnableCompression bool

	// SecretKey is used to initialize the primary encryption key in a keyring.
	// The primary encryption key is the only key used to encrypt messages and
	// the first key used while attempting to decrypt messages. Providing a
	// value for this primary key will enable message-level encryption and
	// verification, and automatically install the key onto the keyring.
	SecretKey []byte

	// The keyring holds all of the encryption keys used internally. It is
	// automatically initialized using the SecretKey and SecretKeys values.
	Keyring *Keyring

	// Delegate and Events are delegates for receiving and providing
	// data to memberlist via callback mechanisms. For Delegate, see
	// the Delegate interface. For Events, see the EventDelegate interface.
	//
	// The DelegateProtocolMin/Max are used to guarantee protocol-compatibility
	// for any custom messages that the delegate might do (broadcasts,
	// local/remote state, etc.). If you don't set these, then the protocol
	// versions will just be zero, and version compliance won't be done.
	Delegate                Delegate
	DelegateProtocolVersion uint8
	DelegateProtocolMin     uint8
	DelegateProtocolMax     uint8
	Events                  EventDelegate
	Conflict                ConflictDelegate
	Merge                   MergeDelegate

	// LogOutput is the writer where logs should be sent. If this is not
	// set, logging will go to stderr by default.
	LogOutput io.Writer
}

func DefaultLANConfig

func DefaultLANConfig() *Config

DefaultLANConfig returns a sane set of configurations for Memberlist. It uses the hostname as the node name, and otherwise sets very conservative values that are sane for most LAN environments. The default configuration errs on the side on the side of caution, choosing values that are optimized for higher convergence at the cost of higher bandwidth usage. Regardless, these values are a good starting point when getting started with memberlist.

func DefaultLocalConfig

func DefaultLocalConfig() *Config

DefaultLocalConfig works like DefaultConfig, however it returns a configuration that is optimized for a local loopback environments. The default configuration is still very conservative and errs on the side of caution.

func DefaultWANConfig

func DefaultWANConfig() *Config

DefaultWANConfig works like DefaultConfig, however it returns a configuration that is optimized for most WAN environments. The default configuration is still very conservative and errs on the side of caution.

func (*Config) EncryptionEnabled

func (c *Config) EncryptionEnabled() bool

Returns whether or not encryption is enabled

type ConflictDelegate

type ConflictDelegate interface {
	// NotifyConflict is invoked when a name conflict is detected
	NotifyConflict(existing, other *Node)
}

ConflictDelegate is a used to inform a client that a node has attempted to join which would result in a name conflict. This happens if two clients are configured with the same name but different addresses.

type Delegate

type Delegate interface {
	// NodeMeta is used to retrieve meta-data about the current node
	// when broadcasting an alive message. It's length is limited to
	// the given byte size. This metadata is available in the Node structure.
	NodeMeta(limit int) []byte

	// NotifyMsg is called when a user-data message is received.
	// Care should be taken that this method does not block, since doing
	// so would block the entire UDP packet receive loop. Additionally, the byte
	// slice may be modified after the call returns, so it should be copied if needed.
	NotifyMsg([]byte)

	// GetBroadcasts is called when user data messages can be broadcast.
	// It can return a list of buffers to send. Each buffer should assume an
	// overhead as provided with a limit on the total byte size allowed.
	// The total byte size of the resulting data to send must not exceed
	// the limit.
	GetBroadcasts(overhead, limit int) [][]byte

	// LocalState is used for a TCP Push/Pull. This is sent to
	// the remote side in addition to the membership information. Any
	// data can be sent here. See MergeRemoteState as well. The `join`
	// boolean indicates this is for a join instead of a push/pull.
	LocalState(join bool) []byte

	// MergeRemoteState is invoked after a TCP Push/Pull. This is the
	// state received from the remote side and is the result of the
	// remote side's LocalState call. The 'join'
	// boolean indicates this is for a join instead of a push/pull.
	MergeRemoteState(buf []byte, join bool)
}

Delegate is the interface that clients must implement if they want to hook into the gossip layer of Memberlist. All the methods must be thread-safe, as they can and generally will be called concurrently.

type EventDelegate

type EventDelegate interface {
	// NotifyJoin is invoked when a node is detected to have joined.
	// The Node argument must not be modified.
	NotifyJoin(*Node)

	// NotifyLeave is invoked when a node is detected to have left.
	// The Node argument must not be modified.
	NotifyLeave(*Node)

	// NotifyUpdate is invoked when a node is detected to have
	// updated, usually involving the meta data. The Node argument
	// must not be modified.
	NotifyUpdate(*Node)
}

EventDelegate is a simpler delegate that is used only to receive notifications about members joining and leaving. The methods in this delegate may be called by multiple goroutines, but never concurrently. This allows you to reason about ordering.

type Keyring

type Keyring struct {
	// contains filtered or unexported fields
}

func NewKeyring

func NewKeyring(keys [][]byte, primaryKey []byte) (*Keyring, error)

NewKeyring constructs a new container for a set of encryption keys. The keyring contains all key data used internally by memberlist.

While creating a new keyring, you must do one of:

  • Omit keys and primary key, effectively disabling encryption
  • Pass a set of keys plus the primary key
  • Pass only a primary key

If only a primary key is passed, then it will be automatically added to the keyring. If creating a keyring with multiple keys, one key must be designated primary by passing it as the primaryKey. If the primaryKey does not exist in the list of secondary keys, it will be automatically added at position 0.

func (*Keyring) AddKey

func (k *Keyring) AddKey(key []byte) error

AddKey will install a new key on the ring. Adding a key to the ring will make it available for use in decryption. If the key already exists on the ring, this function will just return noop.

func (*Keyring) GetKeys

func (k *Keyring) GetKeys() [][]byte

GetKeys returns the current set of keys on the ring.

func (*Keyring) GetPrimaryKey

func (k *Keyring) GetPrimaryKey() (key []byte)

GetPrimaryKey returns the key on the ring at position 0. This is the key used for encrypting messages, and is the first key tried for decrypting messages.

func (*Keyring) RemoveKey

func (k *Keyring) RemoveKey(key []byte) error

RemoveKey drops a key from the keyring. This will return an error if the key requested for removal is currently at position 0 (primary key).

func (*Keyring) UseKey

func (k *Keyring) UseKey(key []byte) error

UseKey changes the key used to encrypt messages. This is the only key used to encrypt messages, so peers should know this key before this method is called.

type Memberlist

type Memberlist struct {
	// contains filtered or unexported fields
}

func Create

func Create(conf *Config) (*Memberlist, error)

Create will create a new Memberlist using the given configuration. This will not connect to any other node (see Join) yet, but will start all the listeners to allow other nodes to join this memberlist. After creating a Memberlist, the configuration given should not be modified by the user anymore.

func (*Memberlist) Join

func (m *Memberlist) Join(existing []string) (int, error)

Join is used to take an existing Memberlist and attempt to join a cluster by contacting all the given hosts and performing a state sync. Initially, the Memberlist only contains our own state, so doing this will cause remote nodes to become aware of the existence of this node, effectively joining the cluster.

This returns the number of hosts successfully contacted and an error if none could be reached. If an error is returned, the node did not successfully join the cluster.

func (*Memberlist) Leave

func (m *Memberlist) Leave(timeout time.Duration) error

Leave will broadcast a leave message but will not shutdown the background listeners, meaning the node will continue participating in gossip and state updates.

This will block until the leave message is successfully broadcasted to a member of the cluster, if any exist or until a specified timeout is reached.

This method is safe to call multiple times, but must not be called after the cluster is already shut down.

func (*Memberlist) LocalNode

func (m *Memberlist) LocalNode() *Node

LocalNode is used to return the local Node

func (*Memberlist) Members

func (m *Memberlist) Members() []*Node

Members returns a list of all known live nodes. The node structures returned must not be modified. If you wish to modify a Node, make a copy first.

func (*Memberlist) NumMembers

func (m *Memberlist) NumMembers() (alive int)

NumMembers returns the number of alive nodes currently known. Between the time of calling this and calling Members, the number of alive nodes may have changed, so this shouldn't be used to determine how many members will be returned by Members.

func (*Memberlist) ProtocolVersion

func (m *Memberlist) ProtocolVersion() uint8

ProtocolVersion returns the protocol version currently in use by this memberlist.

func (*Memberlist) SendTo

func (m *Memberlist) SendTo(to net.Addr, msg []byte) error

SendTo is used to directly send a message to another node, without the use of the gossip mechanism. This will encode the message as a user-data message, which a delegate will receive through NotifyMsg The actual data is transmitted over UDP, which means this is a best-effort transmission mechanism, and the maximum size of the message is the size of a single UDP datagram, after compression

func (*Memberlist) Shutdown

func (m *Memberlist) Shutdown() error

Shutdown will stop any background maintanence of network activity for this memberlist, causing it to appear "dead". A leave message will not be broadcasted prior, so the cluster being left will have to detect this node's shutdown using probing. If you wish to more gracefully exit the cluster, call Leave prior to shutting down.

This method is safe to call multiple times.

func (*Memberlist) UpdateNode

func (m *Memberlist) UpdateNode(timeout time.Duration) error

UpdateNode is used to trigger re-advertising the local node. This is primarily used with a Delegate to support dynamic updates to the local meta data. This will block until the update message is successfully broadcasted to a member of the cluster, if any exist or until a specified timeout is reached.

type MergeDelegate

type MergeDelegate interface {
	// NotifyMerge is invoked when a merge could take place.
	// Provides a list of the nodes known by the peer.
	NotifyMerge(peers []*Node) (cancel bool)
}

MergeDelegate is used to involve a client in a potential cluster merge operation. Namely, when a node does a TCP push/pull (as part of a join), the delegate is involved and allowed to cancel the join based on custom logic. The merge delegate is NOT invoked as part of the push-pull anti-entropy.

type Node

type Node struct {
	Name string
	Addr net.IP
	Port uint16
	Meta []byte // Metadata from the delegate for this node.
	PMin uint8  // Minimum protocol version this understands
	PMax uint8  // Maximum protocol version this understands
	PCur uint8  // Current version node is speaking
	DMin uint8  // Min protocol version for the delegate to understand
	DMax uint8  // Max protocol version for the delegate to understand
	DCur uint8  // Current version delegate is speaking
}

Node represents a node in the cluster.

type NodeEvent

type NodeEvent struct {
	Event NodeEventType
	Node  *Node
}

NodeEvent is a single event related to node activity in the memberlist. The Node member of this struct must not be directly modified. It is passed as a pointer to avoid unnecessary copies. If you wish to modify the node, make a copy first.

type NodeEventType

type NodeEventType int

NodeEventType are the types of events that can be sent from the ChannelEventDelegate.

const (
	NodeJoin NodeEventType = iota
	NodeLeave
	NodeUpdate
)

type TransmitLimitedQueue

type TransmitLimitedQueue struct {
	// NumNodes returns the number of nodes in the cluster. This is
	// used to determine the retransmit count, which is calculated
	// based on the log of this.
	NumNodes func() int

	// RetransmitMult is the multiplier used to determine the maximum
	// number of retransmissions attempted.
	RetransmitMult int

	sync.Mutex
	// contains filtered or unexported fields
}

TransmitLimitedQueue is used to queue messages to broadcast to the cluster (via gossip) but limits the number of transmits per message. It also prioritizes messages with lower transmit counts (hence newer messages).

func (*TransmitLimitedQueue) GetBroadcasts

func (q *TransmitLimitedQueue) GetBroadcasts(overhead, limit int) [][]byte

GetBroadcasts is used to get a number of broadcasts, up to a byte limit and applying a per-message overhead as provided.

func (*TransmitLimitedQueue) NumQueued

func (q *TransmitLimitedQueue) NumQueued() int

NumQueued returns the number of queued messages

func (*TransmitLimitedQueue) Prune

func (q *TransmitLimitedQueue) Prune(maxRetain int)

Prune will retain the maxRetain latest messages, and the rest will be discarded. This can be used to prevent unbounded queue sizes

func (*TransmitLimitedQueue) QueueBroadcast

func (q *TransmitLimitedQueue) QueueBroadcast(b Broadcast)

QueueBroadcast is used to enqueue a broadcast

func (*TransmitLimitedQueue) Reset

func (q *TransmitLimitedQueue) Reset()

Reset clears all the queued messages

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL