cid

package
v0.0.2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 10, 2021 License: MIT Imports: 5 Imported by: 0

README

What golang Kinds work best to implement CIDs?

There are many possible ways to implement CIDs. This package explores them.

criteria

There's a couple different criteria to consider:

  • We want the best performance when operating on the type (getters, mostly);
  • We want to minimize the number of memory allocations we need;
  • We want types which can be used as map keys, because this is common.

The priority of these criteria is open to argument, but it's probably mapkeys > minalloc > anythingelse. (Mapkeys and minalloc are also quite entangled, since if we don't pick a representation that can work natively as a map key, we'll end up needing a KeyRepr() method which gives us something that does work as a map key, an that will almost certainly involve a malloc itself.)

options

There are quite a few different ways to go:

  • Option A: CIDs as a struct; multihash as bytes.
  • Option B: CIDs as a string.
  • Option C: CIDs as an interface with multiple implementors.
  • Option D: CIDs as a struct; multihash also as a struct or string.
  • Option E: CIDs as a struct; content as strings plus offsets.
  • Option F: CIDs as a struct wrapping only a string.

The current approach on the master branch is Option A.

Option D is distinctive from Option A because multihash as bytes transitively causes the CID struct to be non-comparible and thus not suitable for map keys as per https://golang.org/ref/spec#KeyType . (It's also a bit more work to pursue Option D because it's just a bigger splash radius of change; but also, something we might also want to do soon, because we do also have these same map-key-usability concerns with multihash alone.)

Option E is distinctive from Option D because Option E would always maintain the binary format of the cid internally, and so could yield it again without malloc, while still potentially having faster access to components than Option B since it wouldn't need to re-parse varints to access later fields.

Option F is actually a varation of Option B; it's distinctive from the other struct options because it is proposing literally struct{ x string } as the type, with no additional fields for components nor offsets.

Option C is the avoid-choices choice, but note that interfaces are not free; since "minimize mallocs" is one of our major goals, we cannot use interfaces whimsically.

Note there is no proposal for migrating to type Cid []bytes, because that is generally considered to be strictly inferior to type Cid string.

Discoveries

using interfaces as map keys forgoes a lot of safety checks

Using interfaces as map keys pushes a bunch of type checking to runtime. E.g., it's totally valid at compile time to push a type which is non-comparable into a map key; it will panic at runtime instead of failing at compile-time.

There's also no way to define equality checks between implementors of the interface: golang will always use its innate concept of comparison for the concrete types. This means its effectively never safe to use two different concrete implementations of an interface in the same map; you may add elements which are semantically "equal" in your mind, and end up very confused later when both impls of the same "equal" object have been stored.

sentinel values are possible in any impl, but some are clearer than others

When using *Cid, the nil value is a clear sentinel for 'invalid'; when using type Cid string, the zero value is a clear sentinel; when using type Cid struct per Option A or D... the only valid check is for a nil multihash field, since version=0 and codec=0 are both valid values. When using type Cid struct{string} per Option F, zero is a clear sentinel.

usability as a map key is important

We already covered this in the criteria section, but for clarity:

  • Option A: ❌
  • Option B: ✔
  • Option C: ~ (caveats, and depends on concrete impl)
  • Option D: ✔
  • Option E: ✔
  • Option F: ✔
living without offsets requires parsing

Since CID (and multihash!) are defined using varints, they require parsing; we can't just jump into the string at a known offset in order to yield e.g. the multicodec number.

In order to get to the 'meat' of the CID (the multihash content), we first must parse:

  • the CID version varint;
  • the multicodec varint;
  • the multihash type enum varint;
  • and the multihash length varint.

Since there are many applications where we want to jump straight to the multihash content (for example, when doing CAS sharding -- see the disclaimer about bias in leading bytes), this overhead may be interesting.

How much this overhead is significant is hard to say from microbenchmarking; it depends largely on usage patterns. If these traversals are a significant timesink, it would be an argument for Option D/E. If these traversals are not a significant timesink, we might be wiser to keep to Option B/F, because keeping a struct full of offsets will add several words of memory usage per CID, and we keep a lot of CIDs.

interfaces cause boxing which is a significant performance cost

See BenchmarkCidMap_CidStr and friends.

Long story short: using interfaces anywhere will cause the compiler to implicitly generate boxing and unboxing code (e.g. runtime.convT2E); this is both another function call, and more concerningly, results in large numbers of unbatchable memory allocations.

Numbers without context are dangerous, but if you need one: 33%. It's a big deal.

This means attempts to "use interfaces, but switch to concrete impls when performance is important" are a red herring: it doesn't work that way.

This is not a general inditement against using interfaces -- but if a situation is at the scale where it's become important to mind whether or not pointers are a performance impact, then that situation also is one where you have to think twice before using interfaces.

struct wrappers can be used in place of typedefs with zero overhead

See TestSizeOf.

Using the unsafe.Sizeof feature to inspect what the Go runtime thinks, we can see that type Foo string and type Foo struct{x string} consume precisely the same amount of memory.

This is interesting because it means we can choose between either type definition with no significant overhead anywhere we use it: thus, we can choose freely between Option B and Option F based on which we feel is more pleasant to work with.

Option F (a struct wrapper) means we can prevent casting into our Cid type. Option B (typedef string) can be declared a const. Are there any other concerns that would separate the two choices?

one way or another: let's get rid of that star

We should switch completely to handling Cid and remove *Cid completely. Regardless of whether we do this by migrating to interface, or string implementations, or simply structs with no pointers... once we get there, refactoring to any of the others can become a no-op from the perspective of any downstream code that uses CIDs.

(This means all access via functions, never references to fields -- even if we were to use a struct implementation. Pretend there's a interface, in other words.)

There are probably gofix incantations which can help us with this migration.

Documentation

Index

Constants

View Source
const (
	Raw = 0x55

	DagProtobuf = 0x70
	DagCBOR     = 0x71
	P2pKey      = 0x72

	GitRaw = 0x78

	EthBlock           = 0x90
	EthBlockList       = 0x91
	EthTxTrie          = 0x92
	EthTx              = 0x93
	EthTxReceiptTrie   = 0x94
	EthTxReceipt       = 0x95
	EthStateTrie       = 0x96
	EthAccountSnapshot = 0x97
	EthStorageTrie     = 0x98
	BitcoinBlock       = 0xb0
	BitcoinTx          = 0xb1
	ZcashBlock         = 0xc0
	ZcashTx            = 0xc1
	DecredBlock        = 0xe0
	DecredTx           = 0xe1
)

These are multicodec-packed content types. The should match the codes described in the authoritative document: https://github.com/multiformats/multicodec/blob/master/table.csv

View Source
const EmptyCidStr = CidStr("")

EmptyCidStr is a constant for a zero/uninitialized/sentinelvalue cid; it is declared mainly for readability in checks for sentinel values.

Variables

View Source
var (
	// ErrVarintBuffSmall means that a buffer passed to the cid parser was not
	// long enough, or did not contain an invalid cid
	ErrVarintBuffSmall = errors.New("reading varint: buffer too small")

	// ErrVarintTooBig means that the varint in the given cid was above the
	// limit of 2^64
	ErrVarintTooBig = errors.New("reading varint: varint bigger than 64bits" +
		" and not supported")

	// ErrCidTooShort means that the cid passed to decode was not long
	// enough to be a valid Cid
	ErrCidTooShort = errors.New("cid too short")

	// ErrInvalidEncoding means that selected encoding is not supported
	// by this Cid version
	ErrInvalidEncoding = errors.New("invalid base encoding")
)
View Source
var CodecToStr = map[uint64]string{
	Raw:                "raw",
	DagProtobuf:        "protobuf",
	DagCBOR:            "cbor",
	P2pKey:             "p2p-key",
	GitRaw:             "git-raw",
	EthBlock:           "eth-block",
	EthBlockList:       "eth-block-list",
	EthTxTrie:          "eth-tx-trie",
	EthTx:              "eth-tx",
	EthTxReceiptTrie:   "eth-tx-receipt-trie",
	EthTxReceipt:       "eth-tx-receipt",
	EthStateTrie:       "eth-state-trie",
	EthAccountSnapshot: "eth-account-snapshot",
	EthStorageTrie:     "eth-storage-trie",
	BitcoinBlock:       "bitcoin-block",
	BitcoinTx:          "bitcoin-tx",
	ZcashBlock:         "zcash-block",
	ZcashTx:            "zcash-tx",
	DecredBlock:        "decred-block",
	DecredTx:           "decred-tx",
}

CodecToStr maps the numeric codec to its name

View Source
var Codecs = map[string]uint64{
	"v0":                   DagProtobuf,
	"raw":                  Raw,
	"protobuf":             DagProtobuf,
	"cbor":                 DagCBOR,
	"p2p-key":              P2pKey,
	"git-raw":              GitRaw,
	"eth-block":            EthBlock,
	"eth-block-list":       EthBlockList,
	"eth-tx-trie":          EthTxTrie,
	"eth-tx":               EthTx,
	"eth-tx-receipt-trie":  EthTxReceiptTrie,
	"eth-tx-receipt":       EthTxReceipt,
	"eth-state-trie":       EthStateTrie,
	"eth-account-snapshot": EthAccountSnapshot,
	"eth-storage-trie":     EthStorageTrie,
	"bitcoin-block":        BitcoinBlock,
	"bitcoin-tx":           BitcoinTx,
	"zcash-block":          ZcashBlock,
	"zcash-tx":             ZcashTx,
	"decred-block":         DecredBlock,
	"decred-tx":            DecredTx,
}

Codecs maps the name of a codec to its type

View Source
var EmptyCidStruct = CidStruct{}

EmptyCidStruct is a constant for a zero/uninitialized/sentinelvalue cid; it is declared mainly for readability in checks for sentinel values.

Note: it's not actually a const; the compiler does not allow const structs.

Functions

This section is empty.

Types

type Cid

type Cid interface {
	Version() uint64         // Yields the version prefix as a uint.
	Multicodec() uint64      // Yields the multicodec as a uint.
	Multihash() mh.Multihash // Yields the multihash segment.

	String() string // Produces the CID formatted as b58 string.
	Bytes() []byte  // Produces the CID formatted as raw binary.

	Prefix() Prefix // Produces a tuple of non-content metadata.

}

Cid represents a self-describing content adressed identifier.

A CID is composed of:

  • a Version of the CID itself,
  • a Multicodec (indicates the encoding of the referenced content),
  • and a Multihash (which identifies the referenced content).

(Note that the Multihash further contains its own version and hash type indicators.)

type CidStr

type CidStr string

CidStr is a representation of a Cid as a string type containing binary.

Using golang's string type is preferable over byte slices even for binary data because golang strings are immutable, usable as map keys, trivially comparable with built-in equals operators, etc.

Please do not cast strings or bytes into the CidStr type directly; use a parse method which validates the data and yields a CidStr.

func CidStrParse

func CidStrParse(data []byte) (CidStr, error)

CidStrParse takes a binary byte slice, parses it, and returns either a valid CidStr, or the zero CidStr and an error.

For CidV1, the data buffer is in the form:

<version><codec-type><multihash>

CidV0 are also supported. In particular, data buffers starting with length 34 bytes, which starts with bytes [18,32...] are considered binary multihashes.

The multicodec bytes are not parsed to verify they're a valid varint; no further reification is performed.

Multibase encoding should already have been unwrapped before parsing; if you have a multibase-enveloped string, use CidStrDecode instead.

CidStrParse is the inverse of Cid.Bytes().

func NewCidStr

func NewCidStr(version uint64, codecType uint64, mhash mh.Multihash) CidStr

func (CidStr) Bytes

func (c CidStr) Bytes() []byte

Bytes produces a raw binary format of the CID.

(For CidStr, this method is only distinct from casting because of compatibility with v0 CIDs.)

func (CidStr) Multicodec

func (c CidStr) Multicodec() uint64

func (CidStr) Multihash

func (c CidStr) Multihash() mh.Multihash

func (CidStr) Prefix

func (c CidStr) Prefix() Prefix

Prefix builds and returns a Prefix out of a Cid.

func (CidStr) String

func (c CidStr) String() string

String returns the default string representation of a Cid. Currently, Base58 is used as the encoding for the multibase string.

func (CidStr) Version

func (c CidStr) Version() uint64

type CidStruct

type CidStruct struct {
	// contains filtered or unexported fields
}

CidStruct represents a CID in a struct format.

This format complies with the exact same Cid interface as the CidStr implementation, but completely pre-parses the Cid metadata. CidStruct is a tad quicker in case of repeatedly accessed fields, but requires more reshuffling to parse and to serialize. CidStruct is not usable as a map key, because it contains a Multihash reference, which is a slice, and thus not "comparable" as a primitive.

Beware of zero-valued CidStruct: it is difficult to distinguish an incorrectly-initialized "invalid" CidStruct from one representing a v0 cid.

func CidStructParse

func CidStructParse(data []byte) (CidStruct, error)

CidStructParse takes a binary byte slice, parses it, and returns either a valid CidStruct, or the zero CidStruct and an error.

For CidV1, the data buffer is in the form:

<version><codec-type><multihash>

CidV0 are also supported. In particular, data buffers starting with length 34 bytes, which starts with bytes [18,32...] are considered binary multihashes.

The multicodec bytes are not parsed to verify they're a valid varint; no further reification is performed.

Multibase encoding should already have been unwrapped before parsing; if you have a multibase-enveloped string, use CidStructDecode instead.

CidStructParse is the inverse of Cid.Bytes().

func (CidStruct) Bytes

func (c CidStruct) Bytes() []byte

Bytes produces a raw binary format of the CID.

func (CidStruct) Multicodec

func (c CidStruct) Multicodec() uint64

func (CidStruct) Multihash

func (c CidStruct) Multihash() mh.Multihash

func (CidStruct) Prefix

func (c CidStruct) Prefix() Prefix

Prefix builds and returns a Prefix out of a Cid.

func (CidStruct) String

func (c CidStruct) String() string

String returns the default string representation of a Cid. Currently, Base58 is used as the encoding for the multibase string.

func (CidStruct) Version

func (c CidStruct) Version() uint64

type Prefix

type Prefix struct {
	Version  uint64
	Codec    uint64
	MhType   uint64
	MhLength int
}

Prefix represents all the metadata of a Cid, that is, the Version, the Codec, the Multihash type and the Multihash length. It does not contains any actual content information. NOTE: The use -1 in MhLength to mean default length is deprecated,

use the V0Builder or V1Builder structures instead

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL