GoldenFuzz Fuzzing Framework

Updated 1 January 2026

GoldenFuzz is a hardware fuzzing framework that employs a Golden Reference Model (GRM) to provide instruction-level ground truth for RTL processor verification and bug detection.
It uses a two-stage pipeline separating rapid, low-cost policy refinement with GPT-2-generated instruction blocks from high-impact, coverage-guided DUT-level fuzzing.
The approach delivers significant speedup and improved coverage, rediscovering known vulnerabilities and identifying new high-severity bugs in modern processor designs.

GoldenFuzz is a hardware fuzzing framework that employs a Golden Reference Model (GRM) as a fast oracle for RTL processor verification and vulnerability discovery. It is designed to overcome constraints in prior processor fuzzers, such as limited semantic awareness, inefficient test-case refinement, and prohibitive computational costs due to exhaustive hardware simulation. By introducing a two-stage pipeline that decouples low-cost policy refinement and high-impact coverage-guided bug finding, GoldenFuzz significantly advances the scientific and practical state of processor fuzzing (Wu et al., 25 Dec 2025, Tyagi et al., 2022).

1. Golden Reference Model: Foundation and Oracle Role

GoldenFuzz centers its pipeline on a Golden Reference Model (GRM), which is an ISA-compliant emulator (e.g., Spike for RISC-V, or1ksim for OpenRISC) that acts as a “digital twin” for the Device Under Test (DUT). The GRM provides a cycle-agnostic, instruction-level ground truth, furnishing architectural state tuples $(\mathrm{PC}_{k+1}, \mathrm{ARCH\_REGS}_{k+1}, \mathrm{MEM}_{k+1})$ after each emulated instruction. No translation of the HDL into a high-level language is required; the standard community emulator suffices as the GRM oracle.

During fuzzing, every instruction sequence generated by GoldenFuzz drives both the RTL design in a simulator (e.g., ModelSim, VCS) and the GRM in lockstep. The outputs $(\mathrm{PC}, \mathrm{regFile}, \mathrm{mem})$ are compared after each instruction. Any mismatch is flagged as a hardware bug, sharply isolating the problematic instruction. The GRM’s functional granularity does not catch microarchitectural or performance bugs but is capable of precisely detecting all functional deviations, including illegal instruction decoding, register corruption, and hidden FSM transitions (Wu et al., 25 Dec 2025, Tyagi et al., 2022).

2. Coverage Metrics and Formalization

GoldenFuzz employs coverage metrics that model the complete behavioral envelope of RTL designs, formalized to drive both broad and deep exploration:

Statement coverage: $C_\text{stmt}$ contains one point for each RTL source line.

$\mathrm{Cov}_\text{stmt}(\mathcal{S}) = \frac{|\cup_{X\in\mathcal{S}} \text{hitLines}(X)|}{|C_\text{stmt}|}$

Branch coverage: Each “if/when” construct $b$ with true/false arms; $C_\text{branch} = B\times\{T, F\}$ .

$\mathrm{Cov}_\text{branch}(\mathcal{S}) = \frac{|\{(b,v): \exists X, b\text{ evaluated to }v\text{ in }X\}|}{2|B|}$

Condition coverage: For Boolean sub-expressions $e$ ; $C_\text{cond}$ includes $e\to 0$ , $e\to 1$ if all inputs exercised.

$\mathrm{Cov}_\text{cond}(\mathcal{S}) = \frac{\sum_{e\in E} \sum_{v\in\{0,1\}} \mathbf{1}[e\to v \text{ hit}]}{2|E|}$

Expression coverage: For arbitrary-width expressions, a point for each possible input vector/output combination.
FSM coverage: For $n$ -bit FSM $q$ with $2^n$ states:

$\mathrm{Cov}_\text{fsm}(\mathcal{S}) = \frac{|\text{distinct states}| + |\text{distinct transitions}|}{2^n + (2^n\cdot 2^n)}$

Toggle coverage: For each wire/flip-flop, all binary and tristate toggles:

$\mathrm{Cov}_\text{tgl}(\mathcal{S}) = \frac{\sum_{w\in W} \sum_{w'\neq w} T(w\to w')}{|W|\cdot 6}$

These metrics ensure comprehensive exploration of both logic and “analog-ish” behaviors (e.g., signal transitions and high-impedance/floating wires) (Tyagi et al., 2022).

3. Two-Stage Fuzzing Algorithm and Policy Optimization

GoldenFuzz advances prior approaches by separating test-case refinement and coverage-driven exploration:

Stage 1: GRM-Level Fuzzing (Policy Refinement)

Rapidly generates instruction blocks (IBs) using a GPT-2-based LLM (1.5B parameters, opcode/operand retokenization).
Policy is updated via Direct Preference Optimization (SimPO) on preference pairs (winner/loser blocks) to maximize the validity rate (i.e., blocks that are ISA-valid and do not raise exceptions when run on the GRM).
Low computational overhead: GRM execution is $<$ 0.004 s/test, allowing high-throughput policy refinement.

Stage 2: DUT-Level Coverage-Guided Fuzzing

Concatenates 5 instruction blocks of 6 instructions (total 30 instructions, as ablation determines 5×6 is optimal).
Uses a feedback-driven scoring mechanism for each block transition, with intra-test and inter-test scoring:

$S(b_i \oplus b_{i+1}) = \sum_{x\in\mathcal{G}(b_i, b_{i+1})} w'(x)$

Updates policy by applying SimPO to preference pairs sorted by $S$ -score, directly increasing the likelihood of generating high-coverage blocks.
Differentially compares full traces of DUT and GRM at each instruction, automatically recording mismatches.

A feedback memory $\mathcal{M}$ retains IBs and scoring outcomes to balance exploration versus policy collapse. Key subroutines (e.g., weighted seed selection via ILP (Tyagi et al., 2022)) and mutators are adopted from state-of-the-art fuzzing literature.

4. Experimental Evaluation on RISC-V and OpenRISC Designs

GoldenFuzz is validated across multiple open-source cores:

RocketChip (RISC-V, in-order), BOOM (RISC-V, out-of-order), CVA6 (RISC-V, in-order), Ariane (RISC-V), mor1kx, and or1200 (OpenRISC).
Hardware setups feature high-core-count Xeon servers with model-based simulation backends (ModelSim, Synopsys VCS).

Coverage Results:

On RocketChip, GoldenFuzz surpasses the prior state-of-the-art (Cascade, DifuzzRTL, TheHuzz, ChatFuzz) by +3.7% condition coverage and matches FSM coverage. On BOOM/CVA6, superiority is +4.2% to +6.5% (Wu et al., 25 Dec 2025).
Even with short test cases (30 instructions), coverage exceeds that of 10k-instr testcases from competing tools.
End-to-end speedup $S$ is 5×–10×, attributed to the fast GRM-based refinement phase and parallelized DUT simulation.

Bug Discovery:

All known hardware vulnerabilities in RocketChip, BOOM, and CVA6 rediscovered.
Five new open-source vulnerabilities found in CVA6, four of high criticality (CVSS v3 > 7), plus two in the commercial BA51-H core extension.

A high-level summary of exploit types and quantitative findings for bugs detected with GoldenFuzz is as follows:

Processor	Bug Description	CVSS/Severity
CVA6	MBE/SBE endianness ignored	7.5
CVA6	STI masking error	7.6
CVA6	stval mis-report on HFENCE.GVMA	5.5
CVA6	CSR access control flaw	7.6
BA51-H	Two confidential bugs (CVE-2025-45883/45881)	Unreleased/High

This table reports only the details disclosed in (Wu et al., 25 Dec 2025).

5. Comparative Effectiveness Against Prior Fuzzers and Formal Verification

In direct comparison to random regression, DifuzzRTL, and TheHuzz, GoldenFuzz achieves:

1.98× (vs. random regression) and 3.33× (vs. DifuzzRTL) speedup to equivalent coverage on classic RTL cores (Tyagi et al., 2022).
Outperforms all baselines in both new state coverage and bug-finding ability.

Versus formal verification with Cadence JasperGold:

Numerous bug classes (especially due to FSM or floating-wire toggles) saturate state-space or exceed formal engine capacity within 30 minutes (Tyagi et al., 2022).
Property specification required 2–4 person-days per property and returns only cycle-level bit-vectors, not human-readable instruction triggers.
GoldenFuzz exposes bugs spanning hundreds of modules and cross-cutting state interactions that are not practically tractable with formal property checking.

6. Strengths, Limitations, and Future Directions

Strengths:

The decoupled two-stage fuzzing model yields rapid, computationally efficient policy optimization, then focuses on deep design exploration.
The SimPO-driven GPT-2 policy robustly encodes ISA-level syntax and semantic rules, reducing dead-end test generation.
Blockwise, feedback-coupled test construction and short test sequences increase both convergence and interpretability.
Fully automatic pipeline requiring no hand-written RTL assertions or design-specific rules.

Limitations:

Dependence on the existence of an ISA-compliant and functionally accurate GRM for the target architecture.
White-box coverage instrumentation is required for effective metric collection and scoring (not always available in closed designs).
Manual analysis remains necessary for the triage of DUT vs. GRM mismatches, though automation and ML-based prioritization are suggested as future work.

Future Work:

Extension to ISAs other than RISC-V/OpenRISC (e.g., ARM, x86), contingent on the availability or synthesis of accurate GRMs.
Incorporation of Retrieval-Augmented Generation and formal specification material as context for the LLM.
Development of automated mismatch triage and sample clustering to further streamline manual analysis (Wu et al., 25 Dec 2025).

7. Impact and Significance in Hardware Security

GoldenFuzz bridges coverage-guided processor fuzzing with scalable, learning-driven test program generation, achieving high code and state coverage with low overhead, and delivering new, software-exploitable vulnerabilities in maturing open-source and commercial processors. By closing the gap between random testing, constrained regression, and non-scalable formal flows, GoldenFuzz establishes a new paradigm for rapid, high-impact functional validation and hardware bug discovery in processor design (Wu et al., 25 Dec 2025, Tyagi et al., 2022).