res

package

v1.4.2 Latest Latest Go to latest Published: Jan 15, 2026 License: MIT Imports: 22 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/NVIDIA/aistore

Links

README ¶

Developer Notes

Resilver is implemented as a multi-worker traversal over all mountpaths on a target. Each worker — called a jogger — walks filesystem trees and processes objects it encounters.

All joggers operate independently. There is no central coordinator, no shared queues, and no global ordering. Coordination emerges from deterministic algorithms and object-level locking.

This design ensures:

full coverage (all objects are eventually visited),
scalability (I/O parallelism),
correctness for chunked objects,
and robustness under preemption.

This document is structured as follows:

Multi-Jogger Traversal
Preemption Model
Primary Copy
Object Repair Flow
Copy Cleanup and Restoration
Locking Strategy
Statistics and Accounting
Testing and Validation
See also

Multi-Jogger Traversal

Every mountpath is walked. This is essential.

Chunk manifests and chunks may reside on different mountpaths. Restricting traversal to only newly attached or detached mountpaths would miss objects whose metadata lives elsewhere. The multi-jogger design ensures that every manifest is discovered regardless of where it is stored.

Joggers may encounter the same logical object via different physical files. Correctness does not rely on which jogger sees an object first.

Preemption Model

Resilver is explicitly preemptible.

Any mountpath event aborts the current resilver. The abort is cooperative: joggers notice the abort condition and exit. A new resilver starts immediately with the updated mountpath configuration.

This prevents wasted work and avoids subtle inconsistencies that would arise from completing a run based on outdated volume topology.

Primary Copy

When the main replica is missing but one or more copies exist, multiple joggers may encounter different copies.

To prevent duplicate restoration, resilver deterministically selects a single primary copy. The selection rule is simple: the lexicographically smallest FQN among all existing copies wins.

Only the jogger processing that primary copy performs restoration. All others skip their copies. This requires no locks and no coordination between joggers — the decision is purely local and deterministic.

Object Repair Flow

For each object encountered, resilvering follows this logic:

Determine the object’s HRW mountpath under the current configuration.
If the object is already correct, proceed to copy maintenance if applicable.
If the object is misplaced:
- copy it to the HRW mountpath,
- load the new main replica,
- update metadata.
For mirrored objects:
- remove stale copy metadata,
- create missing copies until the configured count is reached or no eligible mountpaths remain.

Main replica restoration always happens before copy restoration.

Copy Cleanup and Restoration

Resilver first removes metadata entries for copies on unavailable or disabled mountpaths. This step is necessary to avoid counting copies that no longer exist or cannot be accessed.

After cleanup, resilvering counts remaining valid copies and compares against the bucket's mirror configuration. Missing copies are created on eligible mountpaths selected using least-utilization heuristics.

Copy restoration is always performed by the same jogger that restored (or verified) the main replica, ensuring consistency and avoiding races.

Locking Strategy

Resilver must coexist with normal I/O.

Objects are locked during relocation to prevent concurrent modification. However, aggressively blocking on locks would stall resilver under load.

The locking strategy is intentionally tiered:

A non-blocking try-lock is attempted first.
Blocking retries are reserved for important objects:
- main replicas at HRW locations,
- mirrored objects requiring copy maintenance.
Non-HRW copies in non-mirrored buckets are skipped if busy.

This balances progress with minimal disruption.

Statistics and Accounting

Resilver statistics intentionally track repairs, not traversal.

Reported statistics increments only when a main replica is restored to its HRW mountpath. Copy creation does not increment this counter. This makes progress meaningful even when mirroring is disabled and avoids overstating work.

Testing and Validation

Resilver is tested under:

concurrent mountpath operations
repeated preemption
erasure-coded, mirrored and chunked buckets
partial failures and restarts

Stress tests verify convergence rather than step-by-step behavior. The system is considered correct if, after mountpaths stabilize, all objects end up correctly placed with the configured redundancy.

Documentation ¶

Overview ¶

Package res provides local volume resilvering upon mountpath-attach and similar

Copyright (c) 2018-2026, NVIDIA CORPORATION. All rights reserved.

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type Args ¶

type Args struct {
	UUID            string
	Notif           *xact.NotifXact
	Rmi             *fs.Mountpath
	WG              *sync.WaitGroup
	Action          string
	PostDD          func(rmi *fs.Mountpath, action string, xres *xs.Resilver, err error)
	Custom          xreg.ResArgs
	SingleRmiJogger bool
	AdminAPI        bool
}

type Res ¶

type Res struct {
	// contains filtered or unexported fields
}

func New ¶

func New() *Res

func (*Res) Abort ¶ added in v1.4.2

func (res *Res) Abort(err error) (aborted bool)

func (*Res) GetXact ¶ added in v1.4.2

func (res *Res) GetXact() *xs.Resilver

func (*Res) IsActive ¶

func (res *Res) IsActive(multiplier int64) bool

func (*Res) Run ¶ added in v1.4.2

func (res *Res) Run(args *Args, tstats cos.StatsUpdater)

Source Files ¶

View all Source files

resilver.go

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL

README ¶

Developer Notes

Multi-Jogger Traversal

Preemption Model

Primary Copy

Object Repair Flow

Copy Cleanup and Restoration

Locking Strategy

Statistics and Accounting

Testing and Validation

See also

Documentation ¶

Overview ¶

Index ¶

Constants ¶

Variables ¶

Functions ¶

Types ¶

type Args ¶

type Res ¶

func New ¶

func (*Res) Abort ¶ added in v1.4.2

func (*Res) GetXact ¶ added in v1.4.2

func (*Res) IsActive ¶

func (*Res) Run ¶ added in v1.4.2

Source Files ¶