splitstore

package
v1.26.3 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 23, 2024 License: Apache-2.0, MIT Imports: 40 Imported by: 1

README

SplitStore: An actively scalable blockstore for the Filecoin chain

The SplitStore was first introduced in lotus v1.5.1, as an experiment in reducing the performance impact of large blockstores.

With lotus v1.11.1, we introduce the next iteration in design and implementation, which we call SplitStore v1.

The new design (see #6474 evolves the splitstore to be a freestanding compacting blockstore that allows us to keep a small (60-100GB) working set in a hot blockstore and reliably archive out of scope objects in a coldstore. The coldstore can also be a discard store, whereby out of scope objects are discarded or a regular badger blockstore (the default), which can be periodically garbage collected according to configurable user retention policies.

To enable the splitstore, edit .lotus/config.toml and add the following:

[Chainstore]
  EnableSplitstore = true

If you intend to use the discard coldstore, your also need to add the following:

  [Chainstore.Splitstore]
    ColdStoreType = "discard"

In general you should not have to use the discard store, unless you are running a network assistive node (like a bootstrapper or booster) or have very constrained hardware with not enough disk space to maintain a coldstore, even with garbage collection. It is also appropriate for small nodes that are simply watching the chain.

Warning: Using the discard store for a general purpose node is discouraged, unless you really know what you are doing. Use it at your own risk.

Configuration Options

These are options in the [Chainstore.Splitstore] section of the configuration:

  • HotStoreType -- specifies the type of hotstore to use. The only currently supported option is "badger".
  • ColdStoreType -- specifies the type of coldstore to use. The default value is "universal", which will use the initial monolith blockstore as the coldstore. The other possible value is "discard", as outlined above, which is specialized for running without a coldstore. Note that the discard store wraps the initial monolith blockstore and discards writes; this is necessary to support syncing from a snapshot.
  • MarkSetType -- specifies the type of markset to use during compaction. The markset is the data structure used by compaction/gc to track live objects. The default value is "badger", which will use a disk backed markset using badger. If you have a lot of memory (48G or more) you can also use "map", which will use an in memory markset, speeding up compaction at the cost of higher memory usage. Note: If you are using a VPS with a network volume, you need to provision at least 3000 IOPs with the badger markset.
  • HotStoreMessageRetention -- specifies how many finalities, beyond the 4 finalities maintained by default, to maintain messages and message receipts in the hotstore. This is useful for assistive nodes that want to support syncing for other nodes beyond 4 finalities, while running with the discard coldstore option. It is also useful for miners who accept deals and need to lookback messages beyond the 4 finalities, which would otherwise hit the coldstore.
  • HotStoreFullGCFrequency -- specifies how frequenty to garbage collect the hotstore using full (moving) GC. The default value is 20, which uses full GC every 20 compactions (about once a week); set to 0 to disable full GC altogether. Rationale: badger supports online GC, and this is used by default. However it has proven to be ineffective in practice with the hotstore size slowly creeping up. In order to address this, we have added moving GC support in our badger wrapper, which can effectively reclaim all space. The downside is that it takes a bit longer to perform a moving GC and you also need enough space to house the new hotstore while the old one is still live.

Operation

When the splitstore is first enabled, the existing blockstore becomes the coldstore and a fresh hotstore is initialized.

The hotstore is warmed up on first startup so as to load all chain headers and state roots in the current head. This allows us to immediately gain the performance benefits of a smallerblockstore which can be substantial for full archival nodes.

All new writes are directed to the hotstore, while reads first hit the hotstore, with fallback to the coldstore.

Once 5 finalities have ellapsed, and every finality henceforth, the blockstore compacts. Compaction is the process of moving all unreachable objects within the last 4 finalities from the hotstore to the coldstore. If the system is configured with a discard coldstore, these objects are discarded. Note that chain headers, all the way to genesis, are considered reachable. Stateroots and messages are considered reachable only within the last 4 finalities, unless there is a live reference to them.

Compaction

Compaction works transactionally with the following algorithm:

  • We prepare a transaction, whereby all i/o referenced objects through the API are tracked.
  • We walk the chain and mark reachable objects, keeping 4 finalities of state roots and messages and all headers all the way to genesis.
  • Once the chain walk is complete, we begin full transaction protection with concurrent marking; we walk and mark all references created during the chain walk. On the same time, all I/O through the API concurrently marks objects as live references.
  • We collect cold objects by iterating through the hotstore and checking the mark set; if an object is not marked, then it is candidate for purge.
  • When running with a coldstore, we next copy all cold objects to the coldstore.
  • At this point we are ready to begin purging:
    • We sort cold objects heaviest first, so as to never delete the consituents of a DAG before the DAG itself (which would leave dangling references)
    • We delete in small batches taking a lock; each batch is checked again for marks, from the concurrent transactional mark, so as to never delete anything live
  • We then end the transaction and compact/gc the hotstore.

As of #8008 the compaction algorithm has been modified to eliminate sorting and maintain the cold object set on disk. This drastically reduces memory usage; in fact, when using badger as the markset compaction uses very little memory, and it should be now possible to run splitstore with 32GB of RAM or less without danger of running out of memory during compaction.

Garbage Collection

TBD -- see #6577

Utilities

lotus-shed has a splitstore command which provides some utilities:

  • rollback -- rolls back a splitstore installation. This command copies the hotstore on top of the coldstore, and then deletes the splitstore directory and associated metadata keys. It can also optionally compact/gc the coldstore after the copy (with the --gc-coldstore flag) and automatically rewrite the lotus config to disable splitstore (with the --rewrite-config flag). Note: the node must be stopped before running this command.
  • clear -- clears a splitstore installation for restart from snapshot.
  • check -- asynchronously runs a basic healthcheck on the splitstore. The results are appended to <lotus-repo>/datastore/splitstore/check.txt.
  • info -- prints some basic information about the splitstore.

Documentation

Index

Constants

View Source
const (
	// Fraction of garbage in badger vlog for online GC traversal to collect garbage
	AggressiveOnlineGCThreshold = 0.0001
)

Variables

View Source
var (
	// CompactionThreshold is the number of epochs that need to have elapsed
	// from the previously compacted epoch to trigger a new compaction.
	//
	//        |················· CompactionThreshold ··················|
	//        |                                             |
	// =======‖≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡‖------------------------»
	//        |                    |  chain -->             ↑__ current epoch
	//        | archived epochs ___↑
	//                             ↑________ CompactionBoundary
	//
	// === :: cold (already archived)
	// ≡≡≡ :: to be archived in this compaction
	// --- :: hot
	CompactionThreshold = 5 * build.Finality

	// CompactionBoundary is the number of epochs from the current epoch at which
	// we will walk the chain for live objects.
	CompactionBoundary = 4 * build.Finality

	// SyncGapTime is the time delay from a tipset's min timestamp before we decide
	// there is a sync gap
	SyncGapTime = time.Minute

	// SyncWaitTime is the time delay from a tipset's min timestamp before we decide
	// we have synced.
	SyncWaitTime = 30 * time.Second

	// This is a testing flag that should always be true when running a node. itests rely on the rough hack
	// of starting genesis so far in the past that they exercise catchup mining to mine
	// blocks quickly and so disabling syncgap checking is necessary to test compaction
	// without a deep structural improvement of itests.
	CheckSyncGap = true
)
View Source
var (
	// PruneOnline is a prune option that instructs PruneChain to use online gc for reclaiming space;
	// there is no value associated with this option.
	PruneOnlineGC = "splitstore.PruneOnlineGC"

	// PruneMoving is a prune option that instructs PruneChain to use moving gc for reclaiming space;
	// the value associated with this option is the path of the new coldstore.
	PruneMovingGC = "splitstore.PruneMovingGC"

	// PruneRetainState is a prune option that instructs PruneChain as to how many finalities worth
	// of state to retain in the coldstore.
	// The value is an integer:
	// - if it is -1 then all state objects reachable from the chain will be retained in the coldstore.
	//   this is useful for garbage collecting side-chains and other garbage in archival nodes.
	//   This is the (safe) default.
	// - if it is 0 then no state objects that are unreachable within the compaction boundary will
	//   be retained in the coldstore.
	// - if it is a positive integer, then it's the number of finalities past the compaction boundary
	//   for which chain-reachable state objects are retained.
	PruneRetainState = "splitstore.PruneRetainState"

	// PruneThreshold is the number of epochs that need to have elapsed
	// from the previously pruned epoch to trigger a new prune
	PruneThreshold = 7 * build.Finality
)
View Source
var (
	ReifyLimit = 16384
)
View Source
var (
	// WarmupBoundary is the number of epochs to load state during warmup.
	WarmupBoundary = build.Finality
)

Functions

This section is empty.

Types

type BadgerMarkSet added in v1.11.1

type BadgerMarkSet struct {
	// contains filtered or unexported fields
}

func (*BadgerMarkSet) BeginCriticalSection added in v1.14.0

func (s *BadgerMarkSet) BeginCriticalSection() error

func (*BadgerMarkSet) Close added in v1.11.1

func (s *BadgerMarkSet) Close() error

func (*BadgerMarkSet) EndCriticalSection added in v1.14.0

func (s *BadgerMarkSet) EndCriticalSection()

func (*BadgerMarkSet) Has added in v1.11.1

func (s *BadgerMarkSet) Has(c cid.Cid) (bool, error)

func (*BadgerMarkSet) Mark added in v1.11.1

func (s *BadgerMarkSet) Mark(c cid.Cid) error

func (*BadgerMarkSet) MarkMany added in v1.14.0

func (s *BadgerMarkSet) MarkMany(batch []cid.Cid) error

func (*BadgerMarkSet) Visit added in v1.11.2

func (s *BadgerMarkSet) Visit(c cid.Cid) (bool, error)

type BadgerMarkSetEnv added in v1.11.1

type BadgerMarkSetEnv struct {
	// contains filtered or unexported fields
}

func (*BadgerMarkSetEnv) Close added in v1.11.1

func (e *BadgerMarkSetEnv) Close() error

func (*BadgerMarkSetEnv) New added in v1.14.0

func (e *BadgerMarkSetEnv) New(name string, sizeHint int64) (MarkSet, error)

func (*BadgerMarkSetEnv) Recover added in v1.14.0

func (e *BadgerMarkSetEnv) Recover(name string) (MarkSet, error)

type ChainAccessor

type ChainAccessor interface {
	GetTipsetByHeight(context.Context, abi.ChainEpoch, *types.TipSet, bool) (*types.TipSet, error)
	GetHeaviestTipSet() *types.TipSet
	SubscribeHeadChanges(change func(revert []*types.TipSet, apply []*types.TipSet) error)
}

ChainAccessor allows the Splitstore to access the chain. It will most likely be a ChainStore at runtime.

type Checkpoint added in v1.14.0

type Checkpoint struct {
	// contains filtered or unexported fields
}

func NewCheckpoint added in v1.14.0

func NewCheckpoint(path string) (*Checkpoint, error)

func OpenCheckpoint added in v1.14.0

func OpenCheckpoint(path string) (*Checkpoint, cid.Cid, error)

func (*Checkpoint) Close added in v1.14.0

func (cp *Checkpoint) Close() error

func (*Checkpoint) Set added in v1.14.0

func (cp *Checkpoint) Set(c cid.Cid) error

type ColdSetReader added in v1.14.0

type ColdSetReader struct {
	// contains filtered or unexported fields
}

func NewColdSetReader added in v1.14.0

func NewColdSetReader(path string) (*ColdSetReader, error)

func (*ColdSetReader) Close added in v1.14.0

func (s *ColdSetReader) Close() error

func (*ColdSetReader) ForEach added in v1.14.0

func (s *ColdSetReader) ForEach(f func(cid.Cid) error) error

func (*ColdSetReader) Reset added in v1.14.0

func (s *ColdSetReader) Reset() error

type ColdSetWriter added in v1.14.0

type ColdSetWriter struct {
	// contains filtered or unexported fields
}

func NewColdSetWriter added in v1.14.0

func NewColdSetWriter(path string) (*ColdSetWriter, error)

func (*ColdSetWriter) Close added in v1.14.0

func (s *ColdSetWriter) Close() error

func (*ColdSetWriter) Write added in v1.14.0

func (s *ColdSetWriter) Write(c cid.Cid) error

type CompactType added in v1.17.1

type CompactType int

type Config

type Config struct {
	// MarkSetType is the type of mark set to use.
	//
	// The default value is "map", which uses an in-memory map-backed markset.
	// If you are constrained in memory (i.e. compaction runs out of memory), you
	// can use "badger", which will use a disk-backed markset using badger.
	// Note that compaction will take quite a bit longer when using the "badger" option,
	// but that shouldn't really matter (as long as it is under 7.5hrs).
	MarkSetType string

	// DiscardColdBlocks indicates whether to skip moving cold blocks to the coldstore.
	// If the splitstore is running with a noop coldstore then this option is set to true
	// which skips moving (as it is a noop, but still takes time to read all the cold objects)
	// and directly purges cold blocks.
	DiscardColdBlocks bool

	// UniversalColdBlocks indicates whether all blocks being garbage collected and purged
	// from the hotstore should be written to the cold store
	UniversalColdBlocks bool

	// HotstoreMessageRetention indicates the hotstore retention policy for messages.
	// It has the following semantics:
	// - a value of 0 will only retain messages within the compaction boundary (4 finalities)
	// - a positive integer indicates the number of finalities, outside the compaction boundary,
	//   for which messages will be retained in the hotstore.
	HotStoreMessageRetention uint64

	// HotstoreFullGCFrequency indicates how frequently (in terms of compactions) to garbage collect
	// the hotstore using full (moving) GC if supported by the hotstore.
	// A value of 0 disables full GC entirely.
	// A positive value is the number of compactions before a full GC is performed;
	// a value of 1 will perform full GC in every compaction.
	HotStoreFullGCFrequency uint64

	// HotstoreMaxSpaceTarget suggests the max allowed space the hotstore can take.
	// This is not a hard limit, it is possible for the hotstore to exceed the target
	// for example if state grows massively between compactions. The splitstore
	// will make a best effort to avoid overflowing the target and in practice should
	// never overflow.  This field is used when doing GC at the end of a compaction to
	// adaptively choose moving GC
	HotstoreMaxSpaceTarget uint64

	// Moving GC will be triggered when total moving size exceeds
	// HotstoreMaxSpaceTarget - HotstoreMaxSpaceThreshold
	HotstoreMaxSpaceThreshold uint64

	// Safety buffer to prevent moving GC from overflowing disk.
	// Moving GC will not occur when total moving size exceeds
	// HotstoreMaxSpaceTarget - HotstoreMaxSpaceSafetyBuffer
	HotstoreMaxSpaceSafetyBuffer uint64
}

type MapMarkSet added in v1.11.1

type MapMarkSet struct {
	// contains filtered or unexported fields
}

func (*MapMarkSet) BeginCriticalSection added in v1.14.0

func (s *MapMarkSet) BeginCriticalSection() error

func (*MapMarkSet) Close added in v1.11.1

func (s *MapMarkSet) Close() error

func (*MapMarkSet) EndCriticalSection added in v1.14.0

func (s *MapMarkSet) EndCriticalSection()

func (*MapMarkSet) Has added in v1.11.1

func (s *MapMarkSet) Has(cid cid.Cid) (bool, error)

func (*MapMarkSet) Mark added in v1.11.1

func (s *MapMarkSet) Mark(c cid.Cid) error

func (*MapMarkSet) MarkMany added in v1.14.0

func (s *MapMarkSet) MarkMany(batch []cid.Cid) error

func (*MapMarkSet) Visit added in v1.11.2

func (s *MapMarkSet) Visit(c cid.Cid) (bool, error)

type MapMarkSetEnv added in v1.11.1

type MapMarkSetEnv struct {
	// contains filtered or unexported fields
}

func NewMapMarkSetEnv added in v1.11.1

func NewMapMarkSetEnv(path string) (*MapMarkSetEnv, error)

func (*MapMarkSetEnv) Close added in v1.11.1

func (e *MapMarkSetEnv) Close() error

func (*MapMarkSetEnv) New added in v1.14.0

func (e *MapMarkSetEnv) New(name string, sizeHint int64) (MarkSet, error)

func (*MapMarkSetEnv) Recover added in v1.14.0

func (e *MapMarkSetEnv) Recover(name string) (MarkSet, error)

type MarkSet

type MarkSet interface {
	ObjectVisitor
	Mark(cid.Cid) error
	MarkMany([]cid.Cid) error
	Has(cid.Cid) (bool, error)
	Close() error

	// BeginCriticalSection ensures that the markset is persisted to disk for recovery in case
	// of abnormal termination during the critical section span.
	BeginCriticalSection() error
	// EndCriticalSection ends the critical section span.
	EndCriticalSection()
}

MarkSet is an interface for tracking CIDs during chain and object walks

type MarkSetEnv

type MarkSetEnv interface {
	// New creates a new markset within the environment.
	// name is a unique name for this markset, mapped to the filesystem for on-disk persistence.
	// sizeHint is a hint about the expected size of the markset
	New(name string, sizeHint int64) (MarkSet, error)
	// Recover recovers an existing markset persisted on-disk.
	Recover(name string) (MarkSet, error)
	// Close closes the markset
	Close() error
}

func NewBadgerMarkSetEnv added in v1.11.1

func NewBadgerMarkSetEnv(path string) (MarkSetEnv, error)

func OpenMarkSetEnv

func OpenMarkSetEnv(path string, mtype string) (MarkSetEnv, error)

type ObjectVisitor added in v1.11.2

type ObjectVisitor interface {
	Visit(cid.Cid) (bool, error)
}

ObjectVisitor is an interface for deduplicating objects during walks

type SplitStore

type SplitStore struct {
	// contains filtered or unexported fields
}

func Open

func Open(path string, ds dstore.Datastore, hot, cold bstore.Blockstore, cfg *Config) (*SplitStore, error)

Open opens an existing splistore, or creates a new splitstore. The splitstore is backed by the provided hot and cold stores. The returned SplitStore MUST be attached to the ChainStore with Start in order to trigger compaction.

func (*SplitStore) AddProtector added in v1.11.1

func (s *SplitStore) AddProtector(protector func(func(cid.Cid) error) error)

func (*SplitStore) AllKeysChan

func (s *SplitStore) AllKeysChan(ctx context.Context) (<-chan cid.Cid, error)

func (*SplitStore) Check added in v1.11.1

func (s *SplitStore) Check() error

performs an asynchronous health-check on the splitstore; results are appended to <splitstore-path>/check.txt

func (*SplitStore) Close

func (s *SplitStore) Close() error

func (*SplitStore) DeleteBlock

func (s *SplitStore) DeleteBlock(_ context.Context, _ cid.Cid) error

Blockstore interface

func (*SplitStore) DeleteMany

func (s *SplitStore) DeleteMany(_ context.Context, _ []cid.Cid) error

func (*SplitStore) Expose added in v1.11.1

func (s *SplitStore) Expose() bstore.Blockstore

func (*SplitStore) Flush added in v1.23.0

func (s *SplitStore) Flush(ctx context.Context) error

func (*SplitStore) GCHotStore added in v1.23.0

func (s *SplitStore) GCHotStore(opts api.HotGCOpts) error

GCHotstore runs online GC on the chain state in the hotstore according the to options specified

func (*SplitStore) Get

func (s *SplitStore) Get(ctx context.Context, cid cid.Cid) (blocks.Block, error)

func (*SplitStore) GetSize

func (s *SplitStore) GetSize(ctx context.Context, cid cid.Cid) (int, error)

func (*SplitStore) Has

func (s *SplitStore) Has(ctx context.Context, cid cid.Cid) (bool, error)

func (*SplitStore) HashOnRead

func (s *SplitStore) HashOnRead(enabled bool)

func (*SplitStore) HeadChange

func (s *SplitStore) HeadChange(_, apply []*types.TipSet) error

func (*SplitStore) Info added in v1.11.1

func (s *SplitStore) Info() map[string]interface{}

provides some basic information about the splitstore

func (*SplitStore) PruneChain added in v1.17.1

func (s *SplitStore) PruneChain(opts api.PruneOpts) error

PruneChain instructs the SplitStore to prune chain state in the coldstore, according to the options specified.

func (*SplitStore) Put

func (s *SplitStore) Put(ctx context.Context, blk blocks.Block) error

func (*SplitStore) PutMany

func (s *SplitStore) PutMany(ctx context.Context, blks []blocks.Block) error

func (*SplitStore) Start

func (s *SplitStore) Start(chain ChainAccessor, us stmgr.UpgradeSchedule) error

State tracking

func (*SplitStore) View

func (s *SplitStore) View(ctx context.Context, cid cid.Cid, cb func([]byte) error) error

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL