Documentation
¶
Overview ¶
Package token implements deterministic (HMAC-SHA256) and probabilistic (Bloom filter) tokenization plus the comparison primitives Equal, DicePerField, Score, and Match.
Most callers want Match — it wraps DicePerField + Score and returns the thresholded decision in one call. Even simpler: package session bundles a Tokenizer with a FieldSet so you don't have to thread the schema through every call.
Index ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func DicePerField ¶
func DicePerField(a, b sriracha.ProbabilisticToken) ([]float64, error)
DicePerField returns the Sørensen–Dice coefficient between corresponding fields of a and b. The result is one score in [0, 1] per field, in FieldSet order. A field with an all-zero filter on either side scores 0.
Returns an error if FieldSetVersion, KeyID, FieldSetFingerprint (when both sides set it), ProbabilisticParams, or field count differ — scores would not be comparable.
Most callers want Match — it wraps DicePerField + Score and returns the thresholded decision.
func Equal ¶
func Equal(a, b sriracha.DeterministicToken) bool
Equal reports whether a and b are bit-identical across every field. It returns false if FieldSetVersion, KeyID, FieldSetFingerprint (when both sides set it), or field count differ. A field that is nil on one side and non-nil (or differently-sized) on the other compares unequal. Per-field byte comparison is constant-time.
func Score ¶
Score returns the weight-normalised aggregate of perField against fs.Fields[i].Weight. Fields with non-positive weight are excluded from both numerator and denominator, so callers can mask out absent fields by zeroing their weight. Returns an error if the lengths do not match or no field has positive weight.
Types ¶
type Calibration ¶
type Calibration struct {
OptimalThreshold float64 `json:"optimal_threshold"`
F1 float64 `json:"f1"`
Precision float64 `json:"precision"`
Recall float64 `json:"recall"`
ROC []ROCPoint `json:"roc"`
}
Calibration is the output of Calibrate: the threshold that maximizes F1 over the labeled pairs, plus the full ROC curve at 0.01 step granularity.
func Calibrate ¶
func Calibrate(pairs []LabeledPair, fs sriracha.FieldSet) (Calibration, error)
Calibrate sweeps thresholds in 0.01 steps from 0.00 to 1.00 (101 points) and reports the threshold that maximizes F1 over pairs. Use this to pick the threshold for production Match calls instead of guessing.
Cost is O(N×101 + N×fields_per_token) Dice operations. For N labeled pairs it computes Match exactly N times and reuses the resulting Score across all 101 thresholds.
Returns an error if pairs is empty, or if any pair fails the underlying Match call (mismatched FieldSetVersion, KeyID, fingerprint, params, etc.).
type LabeledPair ¶
type LabeledPair struct {
A sriracha.ProbabilisticToken `json:"a"`
B sriracha.ProbabilisticToken `json:"b"`
Match bool `json:"match"`
}
LabeledPair is one row of ground-truth: two ProbabilisticTokens believed to be either the same person (Match=true) or different people (Match=false).
type MatchResult ¶
type MatchResult struct {
Score float64 `json:"score"`
PerField []float64 `json:"per_field"`
Paths []sriracha.FieldPath `json:"paths"`
IsMatch bool `json:"is_match"`
ComparableFields int `json:"comparable_fields"`
}
MatchResult holds the output of Match: per-field Dice scores, the weighted aggregate Score in [0, 1], the threshold decision, the FieldSet paths in the same order as PerField, and a count of fields that contributed to the weighted average (excludes both-absent fields and fields with non-positive weight).
func Match ¶
func Match(a, b sriracha.ProbabilisticToken, fs sriracha.FieldSet, threshold float64) (MatchResult, error)
Match is the canonical entry point for probabilistic comparison: it wraps DicePerField + Score and returns the threshold decision in a single call.
Match compares a and b under fs and returns per-field Dice scores, the weighted aggregate, and a threshold decision. Fields with all-zero filters on both sides are treated as absent and drop from the weighted average; asymmetric absence (zero on one side, populated on the other) keeps its score of 0 and counts as a real mismatch signal.
If every field is both-absent (or zero-weighted), the returned MatchResult has Score=0, IsMatch=false, ComparableFields=0 — never an error. The error return is reserved for genuine mismatches: threshold out of range, version / key / fingerprint / params drift, or field-count disagreement between the tokens and fs.
func (MatchResult) ByPath ¶
func (r MatchResult) ByPath() map[sriracha.FieldPath]float64
ByPath returns a fresh map keyed by FieldPath with each path's Dice score. Useful for downstream code that wants to look up scores without scanning.
func (MatchResult) ScoreFor ¶
func (r MatchResult) ScoreFor(path sriracha.FieldPath) (float64, bool)
ScoreFor returns the per-field Dice score for path along with true if the path appears in the result. Paths with zero or negative weight that were dropped from the weighted average still appear here with their raw Dice score.
type ROCPoint ¶
type ROCPoint struct {
Threshold float64 `json:"threshold"`
Precision float64 `json:"precision"`
Recall float64 `json:"recall"`
F1 float64 `json:"f1"`
}
ROCPoint is one threshold and the precision/recall/F1 it produces over the supplied labeled pairs.
type Tokenizer ¶
type Tokenizer interface {
// TokenizeDeterministic tokenizes a RawRecord in deterministic mode (HMAC-SHA256
// per field). The returned token's Fields slice is aligned with fs.Fields:
// each entry is a 32-byte HMAC for a present field, or nil for an absent
// optional field. Missing required fields return an error.
TokenizeDeterministic(record sriracha.RawRecord, fs sriracha.FieldSet) (sriracha.DeterministicToken, error)
// TokenizeProbabilistic tokenizes a RawRecord in probabilistic (Bloom filter)
// mode. The returned token's Fields slice is aligned with fs.Fields:
// present fields contain the populated filter, absent optional fields
// contain an all-zero filter of the same length. Missing required fields
// return an error.
TokenizeProbabilistic(record sriracha.RawRecord, fs sriracha.FieldSet) (sriracha.ProbabilisticToken, error)
// TokenizeField returns the deterministic 32-byte HMAC for a single
// (value, path) pair, after running the same normalization pipeline
// TokenizeDeterministic uses. Useful for stable indexing of one field outside
// the FieldSet flow.
TokenizeField(value string, path sriracha.FieldPath) ([]byte, error)
// Destroy wipes the secret buffer that backs this Tokenizer. Pooled HMAC
// instances created from the secret may still hold derived key material
// (inner/outer pad) on the heap until garbage-collected. The Tokenizer
// must not be used after this call.
Destroy()
}
Tokenizer produces tokens from RawRecords using a shared secret. Call Destroy when finished to wipe the source secret buffer; if you forget, a runtime cleanup wipes it once the Tokenizer becomes unreachable.
Tokenizer is safe for concurrent use by multiple goroutines until Destroy is called; HMAC instances are pooled internally. Calling any tokenize method after Destroy is undefined.
Most callers want a session.Session — it bundles a Tokenizer with a FieldSet so you don't have to thread the schema through every call.
func New ¶
New creates a Tokenizer with the given HMAC secret. The secret is copied into a locked, non-swappable memory region and the source slice is wiped. Returns an error if secret is empty.
A runtime finalizer wipes the locked buffer if the returned Tokenizer becomes unreachable without an explicit Destroy call.