FeatureFuzz: Automated Fuzzing Framework
- FeatureFuzz is a dual-purpose fuzzing framework that generates tunable C programs and uses LLMs to drive semantic compiler testing.
- It allows fine-grained manipulation of program features via 10 configurable parameters, enabling targeted grey-box fuzzing and performance profiling.
- The framework integrates multiple LLMs to extract and recombine natural-language invariants, resulting in enhanced code coverage and unique crash discovery.
FeatureFuzz denotes two distinct but related advancements in the field of fuzzing: (1) an automated program generator for benchmarking fuzzers by fine-grained control over program features, and (2) a compiler fuzzing framework that leverages natural-language representations of semantic invariants mined from real bug reports, composed and instantiated with LLMs, to drive semantic bug discovery. Both innovations emphasize the explicit treatment, manipulation, and recombination of features—well-defined syntactic and semantic program properties—either to calibrate evaluation environments or to directly trigger deep compiler defects.
1. FeatureFuzz as a Programmable Benchmark Generator
Traditional fuzzing benchmarks (e.g., FuzzBench, Magma, LAVA-M) primarily report aggregate coverage or bug counts, rarely elucidating which individual program characteristics contribute to observed fuzzer behavior. FeatureFuzz addresses this deficit by enabling programmatically tunable feature manipulation: it synthesizes small C programs whose internal complexity can be precisely adjusted along axes reflecting control-flow and data-flow properties (Miao, 18 Jun 2025). Each generated program exposes a focused "knob" for stress-testing specific aspects of fuzzer exploration.
The methodology begins with a survey of 25 recent grey-box fuzzing studies to identify features most strongly implicated in fuzzer performance variability. Seven orthogonal program features are operationalized and parameterized, yielding 10 configuration parameters across both control-flow and data-flow domains.
Program Features and Generator Parameters
The seven core features are:
- Control-Flow Features:
- Width (): Number of sibling branches per conditional node.
- Depth (): Maximum nesting of conditionals (control dependence depth).
- Weight (): Per-branch reachability probability; .
- Buggy-Branch Position (): Sibling index of the branch containing the bug.
- Data-Flow Features:
- Magic-Byte Sequence (Start, Length): Position and length of input bytes required for bug exposure.
- Checksum Tests (Count): Number of independent checksums required.
- Nested Magic/Checksum (Depth_df): Nesting depth of data-flow tests.
The generator exposes exactly 10 parameters—a 1:1 mapping for most features—enabling comprehensive, independent, and pairwise feature sweeps. Table 1 summarizes the mapping.
| Parameter | Feature Controlled | Range (Low→High) |
|---|---|---|
| Width () | Sibling branches | 2 → 4 → 8 |
| Depth () | Nesting level | 1 → 3 → 5 |
| Weight () | Reach probability | 2 → 5 → 10 |
| BugPos () | Bug branch index | 1 → → |
| Iteration () | Loop iterations to bug | 1 → 5 → 10 |
| HasDataConstr | Loop with data constraint | false → true |
| Start | Magic-byte start offset | 0 → 16 → 32 |
| Length | Magic-byte length | 1 → 4 → 8 |
| Count | Number of checksums | 1 → 3 → 5 |
| Depth_df | Nested data-flow depth | 1 → 2 → 3 |
Through a factorial and pairwise-covering strategy, 153 programs are synthesized, enabling detailed performance profiling against eleven grey-box fuzzers.
2. FeatureFuzz as an LLM-Driven Compiler Fuzzer
FeatureFuzz also refers to a semantic logic recomposition framework for compiler fuzzing. It departs from mutational or grammar-based fuzzers by capturing and recombining semantic invariants directly linked to real-world compiler defects (He et al., 18 Jan 2026).
Formalization of Features
A feature consists of:
- : Natural-language description of a semantic invariant (e.g., "array index exceeding bounds used in a conditional").
- : Minimal code snippet exemplifying that invariant.
This decoupling allows for high reusability and explicit semantic targeting—preserving properties invariant under heavy compiler optimization.
Three-Stage Workflow
- Feature Extraction: Historical bug reports, PoCs, and fix histories are mined using an LLM (ExtractLLM) to curate an initial pool of independent features. Extraction is anchored to real, observed compiler bugs, biasing the feature set toward proven defect triggers.
- Feature Group Synthesis: Given a subset , a fine-tuned LLM (GroupLLM) predicts glue features and forms coherent, semantically consistent groups . Coherence is implicitly scored by LLM likelihoods and may be formalized via embedding similarity:
- Program Instantiation: InstanLLM emits a complete test program satisfying all feature constraints. This encompasses AST embedding, variable/linkage synthesis, and precise ordering to maintain feature dependencies. Output programs must realize all prescribed invariants simultaneously, maximizing semantic triggers.
Coverage-guided evolutionary loops elevate promising groups for further recomposition.
3. LLM Integration
FeatureFuzz integrates LLMs at each workflow stage to translate between natural-language invariants and code:
- ExtractLLM (Qwen2.5-max): Extracts high-level invariants from historical bug cases.
- GroupLLM (fine-tuned Qwen3-4B): Synthesizes logically consistent feature groups, introducing only auxiliary semantics.
- InstanLLM (Qwen3-32B): Generates final code, strictly constrained to maintain all group semantics.
Natural language serves as the intermediate representation for deep, flow-sensitive invariants, facilitating their preservation where syntactic or grammar-guided approaches would fail.
4. Evaluation Methodology and Empirical Findings
Benchmarking Fuzzer Performance
Under the benchmarking approach (Miao, 18 Jun 2025), each of eleven representative fuzzers was evaluated on the 153-program FeatureFuzz suite. Two key metrics are reported:
- Completion Rate (comp): Fraction of programs for which the bug was triggered within 30 minutes.
- Spearman’s rank correlation (): Between a program parameter and time-to-bug, indicating feature sensitivity.
Key results for control-flow (cf.) features:
| Fuzzer | Depth () | Width () | Weight () |
|---|---|---|---|
| AFL++ | 0.894* | 0.872* | -0.507* |
| AFLFast | 0.517* | 0.010 | -0.307* |
| FairFuzz | 0.513* | 0.141* | -0.364* |
| EcoFuzz | 0.287* | -0.024 | -0.237* |
| RedQueen | 0.878* | — | -0.452* |
(*) .
- Control-flow nesting depth () exerts the greatest effect, strongly decelerating bug discovery across all fuzzers.
- Sibling width () yields little impact (completion 100%), even at high values.
- Branch reachability weight () moderately hinders all fuzzers (negative ).
For data-flow features, longer magic-byte sequences and deeply nested checksum tests degrade performance for mutational fuzzers but are less problematic for those incorporating SMT or taint-tracking solvers.
Compiler Fuzzing Results
Under direct compiler fuzzing (He et al., 18 Jan 2026), FeatureFuzz was evaluated against GCC and LLVM, with state-of-the-art baselines. Key metrics:
- 24h on GCC: 420.2K lines covered (+24.27% over MetaMut); 167 unique crashes (2.78× MetaMut).
- 24h on Clang: 232.3K lines covered (+3.3% over best baseline); 167 unique crashes.
- 72h long campaign: 106 newly discovered bugs in GCC/LLVM, 76 confirmed by compiler developers.
| Fuzzer | GCC Coverage | LLVM Coverage | Unique Crashes (G+L) |
|---|---|---|---|
| MetaMut | 338.1K | 224.0K | 60 |
| Mut4All | 330.5K | 220.5K | 11 |
| Fuzz4All | 300.2K | 210.3K | 12 |
| YARPGen | 295.8K | 205.7K | 0 |
| LegoFuzz | 312.4K | 215.6K | 0 |
| FeatureFuzz | 420.2K | 232.3K | 167 |
FeatureFuzz demonstrates significant improvements in both code coverage and unique crash discovery.
5. Strengths, Limitations, and Design Implications
Strengths
- Direct encoding and recombination of semantic invariants preserves critical bug triggers, robust even under aggressive compiler optimizations (He et al., 18 Jan 2026).
- LLM-driven group synthesis yields high syntactic and semantic coherence, with compilation success exceeding 97%.
- Feature-recombination enables extensive exploration of both front-end and back-end compiler bugs.
Benchmarking demonstrates the capacity to isolate and interpret fuzzer weaknesses against independently tunable feature axes, illuminating actionable research directions (Miao, 18 Jun 2025).
Limitations
- Effectiveness rests on the completeness of historical bug corpora and the extraction accuracy of LLMs; rare or unrecorded invariants may elude feature encoding.
- CPU and GPU resources required for group synthesis and instantiation (GroupLLM, InstanLLM) may be substantial in large-scale or time-constrained settings.
- Not all semantic invariants map neatly into natural-language descriptions.
A plausible implication is that further LLM pretraining and expanded bug datasets will broaden the semantic reach of this methodology.
6. Broader Impact and Recommendations
FeatureFuzz introduces a paradigm shift by treating semantic logic recomposition as central in both fuzzing evaluation and generative test-case construction. As LLMs increase in capability and access to rich bug corpora improves, this approach is poised for generalization across program analysis, synthesis systems, and transformation frameworks where precise semantic triggers are essential for uncovering deep, real-world defects (He et al., 18 Jan 2026).
Recommendations emerging from empirical analysis include standardizing feature-sensitivity profiling in fuzzing research, incorporating program-feature generators such as FeatureFuzz into benchmarking suites, and converging on parameter vocabularies for reproducible, interpretable, and head-to-head fuzzer evaluation (Miao, 18 Jun 2025).