README
¶
Semantic Firewall
Next-Gen Code Integrity & Malware Detection for Go
Protect your codebase from hidden risks.
Semantic Firewall goes beyond traditional static analysis: it understands your code's intent and behavior, not just the text. Instantly spot risky changes, catch malware, and prove your code is safe--no matter how it's refactored or renamed.
No more false positives. No more missed backdoors.
[!CAUTION] Disclaimer: This tool is provided for defensive security research and authorized testing only. The malware scanning features are designed to help security teams detect and analyze malicious code patterns. Do not use this tool to create, distribute, or deploy malware. Users are responsible for ensuring compliance with all applicable laws and organizational policies. The author assumes no liability for misuse.
What is Semantic Firewall?
Semantic Firewall is a new kind of static analysis tool for Go that:
- Fingerprints code by behavior, not by text -- so you can refactor, rename, or reformat without breaking your security gates.
- Detects malware and backdoors -- even if attackers try to hide them with obfuscation or clever tricks.
- Flags risky changes in pull requests -- so you know when something really changes, not just when someone moves code around.
Why is this different?
- Traditional tools (like
git diff, grep, or hash-based scanners) are easily fooled by renaming, whitespace, or code shuffling. - Semantic Firewall understands the structure and intent of your code. It knows when a function is logically the same--even if it looks different.
- It also knows when something dangerous is added, like a hidden network call or a suspicious loop.
Key Features:
- Behavioral fingerprinting: Prove your code hasn't changed in intent, even after big refactors.
- Malware & backdoor detection: Find threats by structure, not just by name or hash.
- Risk-aware diffs: Instantly see if a PR adds risky logic, not just lines changed.
- Obfuscation & entropy analysis: Spot packed or encrypted payloads.
- Zero-config persistent database: Fast, scalable PebbleDB backend.
- gVisor Sandboxing: Isolated execution for untrusted code analysis.
See full feature table
| Feature | Description |
|---|---|
| Loop Equivalence | Proves loops are logically identical, no matter the syntax |
| Semantic Diff | Diffs code by behavior, not by text |
| Malware Indexing | Store and hunt for known malware patterns |
| Obfuscation Detection | Flags suspiciously complex or packed code |
| Dependency Scanning | Scans your code and all its dependencies |
| gVisor Isolation | Sandboxed execution via runsc for defense-in-depth |
Getting Started
go install github.com/BlackVectorOps/semantic_firewall/v3/cmd/sfw@latest
Quick Start
# Check a file for risky changes
sfw check ./main.go
# See what *really* changed between two versions
sfw diff old_version.go new_version.go
# Index a known malware sample
sfw index malware.go --name "Beacon_v1" --severity CRITICAL
# Scan your codebase for malware (fast, O(1) matching)
sfw scan ./suspicious/
# Scan with dependency analysis
sfw scan ./cmd/myapp --deps
# See database stats
sfw stats
Workflow: Lab → Hunt
flowchart LR
subgraph Lab
M1[Known Malware] --> I[sfw index]
I --> DB[(signatures.db)]
end
subgraph Hunt
T[Target Code] --> S[sfw scan]
DB --> S
S --> A[Alerts]
end
%% Nodes: Deep radioactive green with neon border
classDef default fill:#022c22,stroke:#00ff41,stroke-width:1px,color:#ffffff
%% Subgraphs: Subdued slate blue containers
style Lab fill:#1e293b,stroke:#475569,stroke-width:2px,color:#94a3b8,stroke-dasharray:5 5
style Hunt fill:#1e293b,stroke:#475569,stroke-width:2px,color:#94a3b8
%% Database: Distinct cylinder styling
style DB fill:#064e3b,stroke:#00ff41,stroke-width:2px,color:#ffffff
linkStyle default stroke:#00ff41,stroke-width:1px
Show example outputs
Check Output:
{
"file": "./main.go",
"functions": [
{ "function": "main", "fingerprint": "005efb52a8c9d1e3...", "line": 12 }
]
}
Diff Output (Risk-Aware):
{
"summary": {
"semantic_match_pct": 92.5,
"preserved": 12,
"modified": 1,
"renamed_functions": 2,
"high_risk_changes": 1
},
"functions": [
{
"function": "HandleLogin",
"status": "modified",
"added_ops": ["Call <log.Printf>", "Call <net.Dial>"],
"risk_score": 15,
"topology_delta": "Calls+2, AddedGoroutine"
}
],
"topology_matches": [
{
"old_function": "processData",
"new_function": "handleInput",
"similarity": 0.94,
"matched_by_name": false
}
]
}
Scan Output (Malware Hunter):
{
"target": "./suspicious/",
"backend": "pebbledb",
"total_functions_scanned": 47,
"alerts": [
{
"signature_name": "Beacon_v1",
"severity": "CRITICAL",
"matched_function": "executePayload",
"confidence": 0.92,
"match_details": {
"topology_match": true,
"entropy_match": true,
"topology_similarity": 1.0,
"calls_matched": ["net.Dial", "os.Exec"]
}
}
],
"summary": { "critical": 1, "high": 0, "total_alerts": 1 }
}
Why Developers Need Semantic Firewall
- No more noisy diffs: See only what matters--real logic changes, not whitespace or renames.
- Catch what tests miss: Unit tests check correctness. Semantic Firewall checks intent and integrity.
- Malware can't hide: Renaming, obfuscation, or packing? Still detected.
- Refactor with confidence: Prove your big refactor didn't change what matters.
- CI/CD ready: Block PRs that sneak in risky logic.
| Traditional Tools | Semantic Firewall |
|---|---|
git diff |
sfw diff |
| Sees lines changed | Sees logic changed |
| Fooled by renames | Survives refactors |
| Hash-based AV | Behavioral fingerprints |
| YARA/grep | Topology & intent matching |
Use cases:
- Supply chain security: catch backdoors that pass code review
- Safe refactoring: prove your refactor is safe
- CI/CD gates: block risky PRs
- Malware hunting: scan codebases at scale
- Obfuscation detection: flag packed/encrypted code
- Dependency auditing: scan imported packages
Lie Detector Workflow
Worried about sneaky changes or hidden intent?
Supply chain attacks often hide behind boring commit messages like "fix typo" or "update formatting." sfw audit uses an LLM (or deterministic simulation if you dont want to bother with an api key) to compare the developer's claim against the code's reality.
Command:
# Check if the commit message matches the code changes
sfw audit old.go new.go "minor refactor of logging" --api-key sk-...
The Verdict:
{
"inputs": {
"commit_message": "minor refactor of logging"
},
"risk_filter": {
"high_risk_detected": true,
"evidence_count": 1
},
"output": {
"verdict": "LIE",
"evidence": "Commit claims 'minor refactor' but evidence shows addition of high-risk network calls (net.Dial) and goroutines."
}
}
How it works:
- Semantic Diff: Calculates exact structural changes (ignoring whitespace).
- Risk Filter: Isolates high-risk deltas (Network, FS, Concurrency).
- Intent Verification: Asks the AI: "Does 'fix typo' explain adding a reverse shell?"
Typical CI/CD workflow:
jobs:
semantic-firewall:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Run Semantic Firewall
run: |
go install github.com/BlackVectorOps/semantic_firewall/v3/cmd/sfw@latest
sfw diff old.go new.go
sfw scan . --deps
Result:
- PRs that only change formatting or names pass instantly.
- PRs that add risky logic or malware get flagged for review.
| Command | Purpose | Time Complexity | Space |
|---|---|---|---|
sfw check |
Generate semantic fingerprints | O(N) | O(N) |
sfw diff |
Semantic delta via Zipper algorithm | O(I) | O(I) |
sfw index |
Index malware samples into PebbleDB | O(N) | O(1) per sig |
sfw scan |
Hunt malware via topology matching | O(1) exact / O(M) fuzzy | O(M) |
sfw migrate |
Migrate JSON signatures to PebbleDB | O(S) | O(S) |
sfw stats |
Display database statistics | O(1) | O(1) |
sfw audit |
Verify commit intent (AI Lie Detector) | O(I) + API | O(I) |
Where N = source size, I = instructions, S = signatures, M = signatures in entropy range.
Command Details
sfw check [--strict] [--scan --db <path>] [--no-sandbox] <file.go|directory>
Generate semantic fingerprints. Use --strict for validation mode. Use --scan to enable unified security scanning during fingerprinting. Use --no-sandbox to disable gVisor isolation.
sfw diff [--no-sandbox] <old.go> <new.go>
Compute semantic delta using the Zipper algorithm with topology-based function matching. Outputs risk scores and structural deltas.
sfw index <file.go> --name <name> --severity <CRITICAL|HIGH|MEDIUM|LOW> [--category <cat>] [--db <path>]
Index a reference malware sample. Generates topology hash, fuzzy hash, and entropy score.
sfw scan <file.go|directory> [--db <path>] [--threshold <0.0-1.0>] [--exact] [--deps] [--deps-depth <direct|transitive>] [--no-sandbox]
Scan target code for malware signatures. Use --exact for O(1) topology-only matching. Use --deps to scan imported dependencies. Use --no-sandbox to disable gVisor isolation.
sfw migrate --from <json> --to <db>
Migrate legacy JSON database to PebbleDB format for O(1) lookups.
sfw audit <old.go> <new.go> "<commit message>" [--api-key <key>] [--model <model>]
Verify if a commit message matches the structural code changes. Uses an LLM (default: gpt-4o, supports gemini-1.5-pro) to detect deception (e.g., hiding a backdoor in a "typo fix"). API key can also be set via OPENAI_API_KEY or GEMINI_API_KEY environment variables.
gVisor Sandboxing
By default, sfw check, sfw diff, and sfw scan execute untrusted code analysis inside a gVisor sandbox (runsc) for defense-in-depth. This provides:
- Syscall filtering: Only whitelisted system calls are permitted
- Memory isolation: 512MB limit prevents resource exhaustion
- Network isolation: No outbound connections during analysis
- Filesystem isolation: Read-only access to target files only
Requirements:
- gVisor's
runscmust be installed and available in$PATH - On Linux, KVM acceleration is preferred (falls back to ptrace)
Disabling the sandbox:
# For development/debugging or environments without gVisor
sfw check ./main.go --no-sandbox
sfw scan ./suspicious/ --no-sandbox
Note: The sandbox is automatically skipped if
runscis not available, with a warning message.
Signature Database Configuration
The CLI automatically resolves the signature database location (signatures.db) in the following order:
- Explicit Flag:
--db /custom/path/signatures.db - Environment Variable:
SFW_DB_PATH - Local Directory:
./signatures.db - User Home:
~/.sfw/signatures.db - System Strings:
/usr/local/share/sfw/signatures.dbor/var/lib/sfw/signatures.db
This allows you to manage updates independently of the binary.
sfw stats --db <path>
Display database statistics including signature count and index sizes.
Deep Technical Dive (Math, Graphs & Algorithms)
Persistent Signature Database (PebbleDB)
The scanner uses PebbleDB, CockroachDB's embedded LSM-tree key-value store, for signature storage. This enables:
- O(1) exact topology lookups via indexed hash keys
- O(M) fuzzy matching via range scans on entropy indexes
- Atomic writes: no partial updates on crash
- Concurrent reads: safe for parallel scanning
- Zero configuration: single directory, no server required
- Gob+PackedIdx encoding: Efficient binary serialization
Database Schema
Pebble uses a flat key-space, and we simulate buckets by using prefixes:
sig:ID -> JSON blob(full signature)topo:TopologyHash:ID -> ID(O(1) exact match)fuzzy:FuzzyHash:ID -> ID(LSH bucket index)entr:EntropyKey -> ID(range scan index)meta:key -> value(version, stats, maintenance info)
Entropy Key Encoding
Entropy scores are stored as fixed-width keys for proper lexicographic ordering:
Key: "05.1234:SFW-MAL-001"
├──────┤ ├─────────┤
entropy unique ID
This enables efficient range scans: find all signatures with entropy 5.0 ± 0.5.
Database Operations
# View statistics
sfw stats --db signatures.db
{
"signature_count": 142,
"topology_index_count": 142,
"entropy_index_size": 28672,
"file_size_human": "2.1 MB"
}
# Migrate from legacy JSON (one-time operation)
sfw migrate --from old_signatures.json --to signatures.db
# Export for backup/compatibility
# (Programmatic API: scanner.ExportToJSON("backup.json"))
Programmatic Database Access
// Open database with options
opts := pebbledb.PebbleScannerOptions{
MatchThreshold: 0.75, // Minimum confidence for alerts
EntropyTolerance: 0.5, // Fuzzy entropy window
ReadOnly: false, // Set true for scan-only mode
CacheSize: 8 << 20, // 8MB cache
}
scanner, err := pebbledb.NewPebbleScanner("signatures.db", opts)
defer scanner.Close()
// Add signatures (single or bulk)
scanner.AddSignature(sig)
scanner.AddSignatures(sigs) // Atomic batch insert
// Lookup operations
sig, _ := scanner.GetSignature("SFW-MAL-001")
sig, _ := scanner.GetSignatureByTopology(topoHash)
ids, _ := scanner.ListSignatureIDs()
count, _ := scanner.CountSignatures()
// Maintenance
scanner.DeleteSignature("SFW-MAL-001")
scanner.MarkFalsePositive("SFW-MAL-001", "benign library")
scanner.RebuildIndexes() // Recover from corruption
scanner.Compact("signatures-compacted.db")
Malware Scanning: Two-Phase Detection
The Semantic Firewall includes a behavioral malware scanner that matches code by its structural topology, not just strings or hashes. The scanner uses a two-phase detection algorithm:
Phase 1: O(1) Exact Topology Match
1. Extract FunctionTopology from target SSA function
2. Compute topology hash: SHA-256(blockCount || callProfile || controlFlowFlags)
3. PebbleDB lookup: topo:TopologyHash:ID → signature IDs
4. Return exact matches with 100% topology confidence
Phase 2: O(1) Fuzzy Bucket Match (LSH-lite)
1. Compute fuzzy hash: GenerateFuzzyHash(topology) → "B3L1BR2"
2. PebbleDB prefix scan: fuzzy:FuzzyHash:* → candidate IDs
3. Verify call signature overlap and entropy distance
4. Return matches above confidence threshold
Why this survives evasion:
- Renaming evasion fails:
backdoor()→helper()still matches (names aren't part of topology) - Obfuscation-resistant: Variable renaming and code shuffling don't change block/call structure
- O(1) at scale: PebbleDB indexes enable instant lookups even with thousands of signatures
Fuzzy Hash Buckets (LSH-lite)
The fuzzy hash creates locality-sensitive buckets based on quantized structural metrics:
FuzzyHash = "B{log2(blocks)}L{loops}BR{log2(branches)}"
Examples:
B3L1BR2 = 8-15 blocks, 1 loop, 4-7 branches
B4L2BR3 = 16-31 blocks, 2 loops, 8-15 branches
Log2 buckets reduce sensitivity to small changes while preserving structural similarity.
Dependency Scanning
Scan not just your code, but all imported dependencies:
# Scan local code + direct imports
sfw scan ./cmd/myapp --deps --db signatures.db
# Deep scan: include transitive dependencies
sfw scan . --deps --deps-depth transitive --db signatures.db
# Fast exact-match mode for large codebases
sfw scan . --deps --exact --db signatures.db
Output with dependencies:
{
"target": "./cmd/myapp",
"total_functions_scanned": 1247,
"dependencies_scanned": 892,
"scanned_dependencies": [
"github.com/example/suspicious-lib",
"github.com/another/dependency"
],
"alerts": [...],
"summary": { "critical": 0, "high": 1, "total_alerts": 1 }
}
| Flag | Description |
|---|---|
--deps |
Enable dependency scanning |
--deps-depth direct |
Scan only direct imports (default) |
--deps-depth transitive |
Scan all transitive dependencies |
--exact |
O(1) exact topology match only (fastest) |
Note: Dependency scanning requires modules to be downloaded (
go mod download). Stdlib packages are automatically excluded.
Step 1: Index Known Malware (Lab Phase)
# Index a beacon/backdoor sample
sfw index samples/dirty/dirty_beacon.go \
--name "DirtyBeacon" \
--severity CRITICAL \
--category malware \
--db signatures.db
# Output:
{
"message": "Indexed 1 functions from samples/dirty/dirty_beacon.go",
"indexed": [{
"name": "DirtyBeacon_Run",
"topology_hash": "topo:9a8b7c6d5e4f3a2b...",
"fuzzy_hash": "B3L1BR2",
"entropy_score": 5.82,
"identifying_features": {
"required_calls": ["net.Dial", "os.Exec", "time.Sleep"],
"control_flow": { "has_infinite_loop": true, "has_reconnect_logic": true }
}
}],
"backend": "pebbledb",
"total_signatures": 1
}
Step 2: Scan Suspicious Code (Hunter Phase)
# Scan an entire directory
sfw scan ./untrusted_vendor/ --db signatures.db --threshold 0.75
# Fast mode: exact topology match only (O(1) per function)
sfw scan ./large_codebase/ --db signatures.db --exact
Shannon Entropy Analysis
The scanner calculates Shannon entropy for each function to detect obfuscation and packed code:
$$H = -\sum_{i} p(x_i) \log_2 p(x_i)$$
Where $p(x_i)$ is the probability of byte value $x_i$ appearing in the function's string literals.
Entropy Spectrum:
LOW NORMAL HIGH PACKED
◀─────────────────────────────────────────────────────────▶
│ ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░▓▓▓▓▓▓▓▓▓▓▓████████│
0 4.0 6.5 7.5 8.0
│ │ │ │ │
│ Simple funcs │ Normal code │ Obfusc. │ Encrypted│
│ (getters/setters)│ (business │ (base64 │ (packed │
│ │ logic) │ strings)│ payloads)│
| Entropy Range | Classification | Meaning | Example |
|---|---|---|---|
| < 4.0 | LOW | Simple/sparse code | func Get() int { return x } |
| 4.0 - 6.5 | NORMAL | Typical compiled code | Business logic, handlers |
| 6.5 - 7.5 | HIGH | Potentially obfuscated | Base64 blobs, encoded strings |
| > 7.5 | PACKED | Likely packed/encrypted | Encrypted payloads, shellcode |
Functions with HIGH or PACKED entropy combined with suspicious call patterns receive elevated confidence scores.
Call Signature Resolution
The topology extractor resolves call targets to stable identifiers:
| Call Type | Resolution | Example |
|---|---|---|
| Static function | pkg.Func |
net.Dial |
| Interface invoke | invoke:Type.Method |
invoke:io.Reader.Read |
| Builtin | builtin:name |
builtin:len |
| Closure | closure:signature |
closure:func(int) error |
| Go statement | go:target |
go:handler.serve |
| Defer statement | defer:target |
defer:conn.Close |
| Reflection | reflect:Call |
Dynamic dispatch |
| Dynamic | dynamic:type |
Unknown target |
This stable resolution ensures signatures match even when:
- Code is moved between packages
- Interface implementations change
- Method receivers are renamed
Topology Matching
The diff command now uses structural topology matching to detect renamed or obfuscated functions.
How Topology Matching Works:
┌─────────────────────────────────────────────────────────────────────────────┐
│ TOPOLOGY FINGERPRINT EXTRACTION │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ func processData(input []byte) error { ──► TOPOLOGY VECTOR │
│ conn, _ := net.Dial("tcp", addr) │
│ for _, b := range input { ┌──────────────────┐ │
│ conn.Write([]byte{b}) │ Params: 1 │ │
│ } │ Returns: 1 │ │
│ return conn.Close() │ Blocks: 4 │ │
│ } │ Loops: 1 │ │
│ │ Calls: │ │
│ │ net.Dial: 1 │ │
│ func handleInput(data []byte) error { │ Write: 1 │ │
│ c, _ := net.Dial("tcp", server) │ Close: 1 │ │
│ for i := 0; i < len(data); i++ { │ Entropy: 5.2 │ │
│ c.Write([]byte{data[i]}) └──────────────────┘ │
│ } │ │
│ return c.Close() ▼ │
│ } ┌──────────────────────┐ │
│ │ SIMILARITY: 94% │ │
│ Different names, SAME topology ───────────────│ ✓ MATCH DETECTED │ │
│ └──────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
Topology vs Name-Based Matching:
flowchart LR
subgraph old[Old Version]
O1[processData]
O2[sendPacket]
O3[initConn]
end
subgraph new[New Version]
N1[handleInput]
N2[xmit]
N3[setup]
end
O1 --> N1
O2 --> N2
O3 --> N3
linkStyle 0 stroke:#10b981,stroke-width:3px
linkStyle 1 stroke:#10b981,stroke-width:3px
linkStyle 2 stroke:#10b981,stroke-width:3px
style old fill:#1e1b4b,stroke:#8b5cf6,stroke-width:2px
style new fill:#064e3b,stroke:#10b981,stroke-width:2px
Legend: Green lines = topology match (94%+ similarity). Function names changed but structural fingerprints match.
sfw diff old_version.go refactored_version.go
{
"summary": {
"preserved": 8,
"modified": 2,
"renamed_functions": 3,
"topology_matched_pct": 85.7
},
"topology_matches": [
{
"old_function": "processData",
"new_function": "handleInput",
"similarity": 0.94,
"matched_by_name": false
}
]
}
Functions are matched by their structural fingerprint (block count, call profile, control flow features) rather than just name, enabling detection of:
- Renamed functions
- Copy-pasted code with modified names
- Obfuscated variants of known patterns
Library Usage
Fingerprinting
import semanticfw "github.com/BlackVectorOps/semantic_firewall/v3"
src := `package main
func Add(a, b int) int { return a + b }
`
results, err := semanticfw.FingerprintSource("example.go", src, semanticfw.DefaultLiteralPolicy)
if err != nil {
log.Fatal(err)
}
for _, r := range results {
fmt.Printf("%s: %s\n", r.FunctionName, r.Fingerprint)
}
Malware Scanning with PebbleDB
import semanticfw "github.com/BlackVectorOps/semantic_firewall/v3"
// Open the signature database
scanner, err := semanticfw.NewPebbleScanner("signatures.db", semanticfw.DefaultPebbleScannerOptions())
if err != nil {
log.Fatal(err)
}
defer scanner.Close()
// Extract topology from a function
topo := semanticfw.ExtractTopology(ssaFunction)
// O(1) exact topology match
if alert := scanner.ScanTopologyExact(topo, "suspiciousFunc"); alert != nil {
fmt.Printf("ALERT: %s matched %s (confidence: %.2f)\n",
alert.MatchedFunction, alert.SignatureName, alert.Confidence)
}
// Full scan: exact + fuzzy entropy matching
alerts := scanner.ScanTopology(topo, "suspiciousFunc")
for _, alert := range alerts {
fmt.Printf("[%s] %s: %s\n", alert.Severity, alert.SignatureName, alert.MatchedFunction)
}
Topology Extraction
import semanticfw "github.com/BlackVectorOps/semantic_firewall/v3"
// Extract structural features from an SSA function
topo := semanticfw.ExtractTopology(ssaFunction)
fmt.Printf("Blocks: %d, Loops: %d, Entropy: %.2f\n",
topo.BlockCount, topo.LoopCount, topo.EntropyScore)
fmt.Printf("Calls: %v\n", topo.CallSignatures)
fmt.Printf("Entropy Class: %s\n", topo.EntropyProfile.Classification)
fmt.Printf("Fuzzy Hash: %s\n", topo.FuzzyHash)
Unified Pipeline: Check + Scan
Enable security scanning during fingerprinting for a unified integrity + security workflow:
// CLI: sfw check --scan --db signatures.db ./main.go
// Programmatically:
results, _ := semanticfw.FingerprintSourceAdvanced(
path, src, semanticfw.DefaultLiteralPolicy, strictMode)
for _, r := range results {
// Integrity: Get fingerprint
fmt.Printf("Function %s: %s\n", r.FunctionName, r.Fingerprint)
// Security: Scan for malware
fn := r.GetSSAFunction()
topo := semanticfw.ExtractTopology(fn)
alerts := scanner.ScanTopology(topo, r.FunctionName)
for _, alert := range alerts {
fmt.Printf("ALERT: %s\n", alert.SignatureName)
}
}
Signature Structure
type Signature struct {
ID string // "SFW-MAL-001"
Name string // "Beacon_v1_Run"
Description string // Human-readable description
Severity string // "CRITICAL", "HIGH", "MEDIUM", "LOW"
Category string // "malware", "backdoor", "dropper"
TopologyHash string // SHA-256 of topology vector
FuzzyHash string // LSH bucket key "B3L1BR2"
EntropyScore float64 // 0.0-8.0
EntropyTolerance float64 // Fuzzy match window (default: 0.5)
NodeCount int // Basic block count
LoopDepth int // Maximum nesting depth
IdentifyingFeatures IdentifyingFeatures // Behavioral markers
Metadata SignatureMetadata // Provenance info
}
type IdentifyingFeatures struct {
RequiredCalls []string // Must be present (VETO if missing)
OptionalCalls []string // Bonus if present
StringPatterns []string // Suspicious strings
ControlFlow *ControlFlowHints // Structural patterns
}
type ControlFlowHints struct {
HasInfiniteLoop bool // Beacon/C2 indicator
HasReconnectLogic bool // Persistence indicator
}
Architecture & Algorithms
Pipeline Overview
flowchart LR
A[Source] --> B[SSA]
B --> C[Loop Analysis]
C --> D[SCEV]
D --> E[Canonicalization]
E --> F[SHA-256]
B -.-> B1[go/ssa]
C -.-> C1[Tarjans SCC]
D -.-> D1[Symbolic Evaluation]
E -.-> E1[Virtual IR Normalization]
%% Main pipeline nodes: Deep blue with cyan accents
classDef pipeline fill:#0c1929,stroke:#00d4ff,stroke-width:1px,color:#ffffff
class A,B,C,D,E,F pipeline
%% Annotation nodes: Transparent with subtle blue glow
classDef annotation fill:#1e293b,stroke:#475569,stroke-width:1px,color:#94a3b8,stroke-dasharray:3 3
class B1,C1,D1,E1 annotation
linkStyle default stroke:#00d4ff,stroke-width:1px
linkStyle 5,6,7,8 stroke:#475569,stroke-width:1px,stroke-dasharray:3 3
- SSA Construction:
golang.org/x/tools/go/ssaconverts source to Static Single Assignment form with explicit control flow graphs - Loop Detection: Natural loop identification via backedge detection (edge B->H where H dominates B)
- SCEV Analysis: Algebraic characterization of loop variables as closed-form recurrences
- Canonicalization: Deterministic IR transformation: register renaming, branch normalization, loop virtualization
- Fingerprint: SHA-256 of canonical IR string
Scalar Evolution (SCEV) Engine
The SCEV framework (scev.go, 746 LOC) solves the "loop equivalence problem" -- proving that syntactically different loops compute the same sequence of values.
Core Abstraction: Add Recurrences
An induction variable is represented as ${Start, +, Step}_L$, meaning at iteration $k$ the value is:
$$Val(k) = Start + (Step \times k)$$
This representation is closed under affine transformations:
| Operation | Result |
|---|---|
| ${S, +, T} + C$ | ${S+C, +, T}$ |
| $C \times {S, +, T}$ | ${C \times S, +, C \times T}$ |
| ${S_1, +, T_1} + {S_2, +, T_2}$ | ${S_1+S_2, +, T_1+T_2}$ |
IV Detection Algorithm (Tarjan's SCC)
1. Build dependency graph restricted to loop body
2. Find SCCs via Tarjan's algorithm (O(V+E))
3. For each SCC containing a header Phi:
a. Extract cycle: Phi -> BinOp -> Phi
b. Classify: Basic ({S,+,C}), Geometric ({S,*,C}), Polynomial
c. Verify step is loop invariant
4. Propagate SCEV to derived expressions via recursive folding
Trip Count Derivation
For a loop for i := Start; i < Limit; i += Step:
$$TripCount = \left\lceil \frac{Limit - Start}{Step} \right\rceil$$
Computed via ceiling division: (Limit - Start + Step - 1) / Step
The engine handles:
- Upcounting (
i < N) and downcounting (i > N) loops - Inclusive bounds (
i <= N-- add 1 to numerator) - Negative steps (normalized to absolute value)
- Multi-predecessor loop headers (validates consistent start values)
Canonicalization Engine
The canonicalizer (canonicalizer.go, 1162 LOC) transforms SSA into a deterministic string representation via five phases:
Phase 1: Loop & SCEV Analysis
c.loopInfo = DetectLoops(fn)
AnalyzeSCEV(c.loopInfo)
Phase 2: Semantic Normalization
- Invariant Hoisting: Pure calls like
len(s)are virtually moved to preheader - IV Virtualization: Phi nodes for IVs are replaced with SCEV notation
{0, +, 1} - Derived IV Propagation: Expressions like
i*4become{0, +, 4}in output
Phase 3: Register Renaming
Parameters: p0, p1, p2, ...
Free Variables: fv0, fv1, ...
Instructions: v0, v1, v2, ... (DFS order)
Phase 4: Deterministic Block Ordering
Blocks are traversed in dominance-respecting DFS order, ensuring identical output regardless of SSA construction order. Successor ordering is normalized:
>=branches are rewritten to<with swapped successors>branches are rewritten to<=with swapped successors
Phase 5: Virtual Control Flow
Branch normalization is applied virtually (no SSA mutation) via lookup tables:
virtualBlocks map[*ssa.BasicBlock]*virtualBlock // swapped successors
virtualBinOps map[*ssa.BinOp]token.Token // normalized operators
The Semantic Zipper
The Zipper (zipper.go, 568 LOC) computes a semantic diff between two functions -- what actually changed in behavior, ignoring cosmetic differences.
Algorithm: Parallel Graph Traversal
PHASE 0: Semantic Analysis
- Run SCEV on both functions independently
- Build canonicalizers for operand comparison
PHASE 1: Anchor Alignment
- Map parameters positionally: oldFn.Params[i] <-> newFn.Params[i]
- Map free variables if counts match
- Seed entry block via sequential matching (critical for main())
PHASE 2: Forward Propagation (BFS on Use-Def chains)
while queue not empty:
(vOld, vNew) = dequeue()
for each user uOld of vOld:
candidates = users of vNew with matching structural fingerprint
for uNew in candidates:
if areEquivalent(uOld, uNew):
map(uOld, uNew)
enqueue((uOld, uNew))
break
PHASE 2.5: Terminator Scavenging
- Explicitly match Return/Panic instructions via operand equivalence
- Handles cases where terminators aren't reached via normal propagation
PHASE 3: Divergence Isolation
- Added = newFn instructions not in reverse map
- Removed = oldFn instructions not in forward map
Equivalence Checking
Two instructions are equivalent iff:
- Same Go type (
reflect.TypeOf) - Same SSA value type (
types.Identical) - Same operation specific properties (BinOp.Op, Field index, Alloc.Heap, etc.)
- All operands equivalent (recursive, with commutativity handling for ADD/MUL/AND/OR/XOR)
Structural Fingerprinting (DoS Prevention)
To prevent $O(N \times M)$ comparisons on high fanout values, users are bucketed by structural fingerprint:
fp := fmt.Sprintf("%T:%s", instr, op) // e.g., "*ssa.BinOp:+"
candidates := newByOp[fp] // Only compare compatible types
Bucket size is capped at 100 to bound worst case complexity.
Security Hardening
| Threat | Mitigation |
|---|---|
| Untrusted code execution | gVisor sandbox (runsc) isolates analysis |
| Algorithmic DoS (exponential SCEV) | Memoization cache per loop: loop.SCEVCache |
| Quadratic Zipper (5000 identical ADDs) | Fingerprint bucketing + MaxCandidates=100 |
| RCE via CGO | CGO_ENABLED=0 during packages.Load |
| SSRF via module fetch | GOPROXY=off prevents network calls |
| Stack overflow (cyclic graphs) | Visited sets in all recursive traversals |
| NaN comparison instability | Branch normalization restricted to IsInteger | IsString types |
| IR injection (fake instructions in strings) | Struct tags and literals sanitized before hashing |
| TypeParam edge cases | Generic types excluded from branch swap (may hide floats) |
| DB path traversal | Sensitive system directories (/etc, /usr) blocked |
| Resource exhaustion | 512MB memory limit, 64 PIDs max in sandbox |
Complexity Analysis
| Operation | Time | Space |
|---|---|---|
| SSA Construction | $O(N)$ | $O(N)$ |
| Loop Detection | $O(V+E)$ | $O(V)$ |
| SCEV Analysis | $O(L \times I)$ amortized | $O(I)$ per loop |
| Canonicalization | $O(I \times \log B)$ | $O(I + B)$ |
| Zipper | $O(I^2)$ worst, $O(I)$ typical | $O(I)$ |
| Topology Extract | $O(I)$ | $O(C)$ |
| Scan (PebbleDB exact) | $O(1)$ | $O(1)$ |
| Scan (fuzzy entropy) | $O(M)$ | $O(M)$ |
Where $N$ = source size, $V$ = blocks, $E$ = edges, $L$ = loops, $I$ = instructions, $B$ = blocks, $C$ = unique calls, $M$ = signatures in entropy range.
Malware Scanner Architecture
The scanner (pkg/storage/pebbledb/store.go, 1454 LOC) provides two-phase detection with ACID-compliant persistence:
Phase 1: O(1) Exact Topology Match
1. Extract FunctionTopology from target SSA function
2. Compute topology hash: SHA-256(blockCount || callProfile || controlFlowFlags)
3. PebbleDB lookup: topo:TopologyHash:ID → signature ID
4. Return exact matches with 100% topology confidence
Phase 2: O(1) Fuzzy Bucket Match (LSH-lite)
1. Compute fuzzy hash: GenerateFuzzyHash(topo) → "B3L1BR2"
2. PebbleDB prefix scan: fuzzy:FuzzyHash:* → candidate IDs
3. For each candidate:
a. Load signature from sig: prefix
b. Verify call signature overlap
c. Check entropy distance within tolerance
d. Compute composite confidence score
4. Return matches above threshold
PebbleDB Storage Schema
PebbleDB uses a flat key-space with prefixes to simulate logical buckets:
sig:ID → Gob-encoded signature blob (full signature)
topo:Hash:ID → PackedIndexValue (O(1) exact match index)
fuzzy:Hash:ID → PackedIndexValue (LSH bucket index)
entr:Key → ID (range scan index, "05.1234:SFW-MAL-001")
meta:key → value (version, stats, maintenance info)
Entropy Key Encoding
Entropy is stored as a fixed-width key for proper lexicographic ordering:
key := fmt.Sprintf("%08.4f:%s", entropy, id) // "05.8200:SFW-MAL-001"
This enables efficient range scans and ensures uniqueness even for identical entropy values.
False Positive Feedback Loop
The scanner supports learning from mistakes:
// Mark a signature as generating false positives
scanner.MarkFalsePositive("SFW-MAL-001", "benign crypto library")
// This appends a timestamped note to the signature's metadata:
// "FP:2026-01-12T15:04:05Z:benign crypto library"
Confidence Score Calculation
The final confidence score is computed as a weighted average:
confidence = avg(
topologySimilarity, // 1.0 if exact hash match
entropyScore, // 1.0 - (distance / tolerance)
callMatchScore, // len(matched) / len(required)
stringPatternScore, // bonus for matched patterns
)
// VETO: If ANY required call is missing, confidence = 0.0
Topology Matching Algorithm
The topology matcher (topology.go, 673 LOC) enables function matching independent of names:
Feature Vector
type FunctionTopology struct {
FuzzyHash string // LSH bucket key
ParamCount int // Signature: param count
ReturnCount int // Signature: return count
BlockCount int // CFG complexity
InstrCount int // Code size
LoopCount int // Iteration patterns
BranchCount int // Decision points
PhiCount int // SSA merge points
CallSignatures map[string]int // "net.Dial" → 2
BinOpCounts map[string]int // "+" → 5
HasDefer bool // Error handling
HasRecover bool // Panic recovery
HasPanic bool // Failure paths
HasGo bool // Concurrency
HasSelect bool // Channel ops
HasRange bool // Iteration style
EntropyScore float64 // Obfuscation indicator
EntropyProfile EntropyProfile // Detailed entropy analysis
}
Similarity Score
Functions are compared via weighted Jaccard similarity, returning a normalized float from 0.0 to 1.0:
$$Similarity = \frac{\sum_i w_i \cdot match_i}{\sum_i w_i}$$
Output format: CLI and JSON outputs express similarity as a decimal (e.g.,
0.94= 94% match). The default topology match threshold is0.6(60%).
Where weights prioritize:
- Call profile (w=3): Most discriminative feature
- Control flow (w=2): defer/recover/panic/select/go
- Metrics (w=1): Block/instruction counts within 20% tolerance
Risk-Aware Diff Scoring
When comparing function versions, structural changes receive risk scores:
| Change | Risk Points |
|---|---|
| New call | +5 each |
| New loop | +10 each |
| New goroutine | +15 |
| New defer | +3 |
| New panic | +5 |
| Entropy increase >1.0 | +10 |
High cumulative risk scores flag changes that warrant extra review.