PBFuzz: Agentic Directed Fuzzing for PoV
- PBFuzz is an agentic directed fuzzing framework that synthesizes PoV inputs by integrating LLM-driven semantic reasoning and property-based testing.
- It features a four-phase workflow with PLAN, IMPLEMENT, EXECUTE, and REFLECT stages to refine vulnerability hypotheses and enforce constraints.
- Evaluations on the Magma benchmark show improved coverage and faster time-to-exposure, outperforming conventional greybox and LLM-assisted fuzzers.
PBFuzz is an agentic directed fuzzing framework for Proof-of-Vulnerability (PoV) input generation, introduced to address the dual challenge of reaching and triggering complex vulnerabilities in software targets. It systematically incorporates LLM-driven semantic code reasoning, custom program analysis, persistent memory management to avoid hypothesis drift, and high-throughput property-based testing for efficient constraint satisfaction. PBFuzz demonstrated significant improvements in vulnerability exposure efficiency and coverage on the Magma benchmark, outperforming conventional greybox, LLM-assisted, and random mutation-based approaches (Zeng et al., 4 Dec 2025).
1. Formalization of PoV Generation
In the PoV input generation problem, the objective is to synthesize an input for a program such that the execution of (i) reaches designated vulnerable code locations and (ii) causes the program state at to violate a target safety predicate .
These requirements are formalized as:
A PoV input must satisfy
The “reachability constraints” refer to the requirements on to achieve , and “triggering constraints” refer to the requirements for violating once is reached. This formal separation is necessary as many fuzzers can solve one but not both classes of constraints, particularly in scenarios involving deep semantic dependencies or intricate input formats.
2. Key Challenges in Agentic PoV Generation
PBFuzz was designed to overcome four principal challenges that limit the reach and reliability of autonomous PoV generators:
- Dynamic Semantic-Level Program Reasoning: Automatically extracting context-sensitive, high-level constraints on input and program state, akin to the hypothesis-building process of skilled human analysts.
- Custom, On-Demand Program Analysis: Enabling fine-grained queries for control/data dependencies and execution tracing (e.g., “Which call sites may reach ?”, “Where did execution diverge from a feasible path?”).
- Persistent, Monotonic Memory: Employing a unified, append-only knowledge store to preserve and refine evolving hypotheses, thus protecting against “hypothesis drift” common in iterative AI systems.
- Constraint Solving via Structure-Preserving Property-Based Testing (PBT): Systematic and efficient exploration of semantic parameter spaces while respecting input structure invariants.
These challenges form the design rationale behind PBFuzz’s agentic, multi-layered architecture (Zeng et al., 4 Dec 2025).
3. Architecture and Multi-Phase Workflow
PBFuzz implements a vertically organized four-layer design and executes as a state-machine across four main phases (PLAN, IMPLEMENT, EXECUTE, REFLECT):
- Brain: An LLM-based agent responsible for code reasoning, hypothesis generation, and structured memory management.
- MCP Tools: Stateless analysis servers offering call graph navigation, corpus reachability, execution trace deviation checking, generator APIs, fuzzing orchestration, and debugger integration.
- Workflow Manager: Enforces phase-specific permissions and transitions, mediating agent operations.
- Memory Layer: A monotonic Markdown file (
workflow_state.md) structured in distinct, phase-gated JSON blocks (BugPredicates, Preconditions, TriggerPlans, FuzzPlans, etc.).
The state-machine iterates as follows:
1 2 3 |
PLAN → IMPLEMENT → EXECUTE → { SUCCESS or REFLECT }
↑ ↓
REFLECT ←------------ |
In each loop:
- PLAN: The agent extracts and formalizes semantic reachability and triggering constraints.
- IMPLEMENT: Constraints are transformed into parameterized input generators and a PBT configuration.
- EXECUTE: A two-stage fuzzer assesses concrete and randomly sampled parameter sets.
- REFLECT: Execution failures are analyzed (using
detect_deviation,launch_gdb), hypotheses are updated, and planning resumes.
4. Core Algorithms and Procedures
Semantic Constraint Extraction (PLAN)
The agent initializes a set of BugPredicates: For each site and predicate , program slicing and call-graph traversals distill reachability requirements: Root causes (buffer_overflow, integer_overflow, etc.) are classified and associated with preconditions; TriggerPlans enumerate strategies referencing subsets of preconditions and root causes.
Hypothesis Refinement (PLAN ↔ REFLECT)
After EXECUTE, outcomes (“reached”, “triggered”; breakpoint state) are logged. REFLECT uses deviation analysis to identify violated preconditions; hypotheses are weakened or revised accordingly. When triggers fail, further analysis (back-slicing; debugger inspection) refines preconditions or TriggerPlans. All validated knowledge is only appended or upgraded in memory (monotonicity).
Structure-Preserving PBT (IMPLEMENT, EXECUTE)
The agent constructs a ParameterSpace for semantic parameters (e.g., buffer_size, nesting_depth), specifying type and domains. Python generators (generate(**params) → bytes) ensure test inputs maintain all structural invariants (e.g., XML or specific binary encodings). ConcreteParameters (5–10 agent-selected tuples) are executed sequentially, followed by random and boundary-aware sampling; the PBT engine emphasizes coverage of edge cases without invalidating structure.
5. Data Structures and Hypothesis-Drift Avoidance
All agent state is recorded in a single Markdown file, organized into phase-gated JSON blocks covering:
| Phase | JSON Blocks | Purpose |
|---|---|---|
| PLAN | BugPredicates, Preconditions, RootCauses, TriggerPlans | Store constraints and root strategies |
| IMPLEMENT | ParameterSpace, FuzzPlan, Breakpoints | Manage input gen/fuzzing configuration |
| EXECUTE | Metrics | Log execution and exposure statistics |
Workflow enforcement is managed by the MCP calls (write_workflow_block, transition_phase), guaranteeing phase-appropriate, monotonic updates and precluding deletion of validated constraints. This memory model is essential for preventing the agent from “unlearning” validated knowledge (Zeng et al., 4 Dec 2025).
6. Experimental Evaluation and Comparative Performance
PBFuzz was evaluated on the Magma benchmark (129 CVEs, 361 targets, 9 open-source projects). The experimental protocol compared PBFuzz to directed greybox fuzzers (AFLGo, SelectFuzz), coverage-guided AFL++ (±CmpLog), LLM-assisted fuzzers (G2Fuzz, Llamafuzz), and others. Baselines received 24h × 10 trials per target; PBFuzz ran a single 30-minute trial.
Metrics collected included CVE coverage, Time-to-Exposure (TTE), Time-to-Reach (TTR), reproducibility (10-run consistency), and LLM API token cost.
Results:
- PBFuzz triggered 57/129 CVEs (AFL++/CmpLog: 49; AFLGo: 41; G2Fuzz: 16).
- PBFuzz uniquely triggered 17 CVEs not exposed by other systems.
- Median TTE: PBFuzz, 339 s; AFL++/CmpLog, 8 680 s (25.6× faster).
- PBFuzz’s API cost: ≈2.18 million tokens/vulnerability (≈$1.83).
- On 23 selected CVEs, PBFuzz achieved success in all 10 runs for 18 of them, exceeding baseline consistency.
Time-to-Exposure Summary:
| System | Median TTE (s) | Max TTE (s) |
|---|---|---|
| PBFuzz | 339 | 1,441 |
| AFL++ + CmpLog | 8,680 | 85,322 |
| AFLGo | 6,170 | 85,200 |
| G2Fuzz | 5,400 | 68,400 |
PBFuzz’s structure-preserving approach and hypothesis-driven workflow allowed it to achieve higher coverage in less time and with modest resource consumption compared to mutation-based and pure LLM-based approaches (Zeng et al., 4 Dec 2025).
7. Limitations and Prospects
Identified limitations include:
- Incomplete handling of long-distance, implicit dependencies (e.g., XML attribute propagation), motivating future integration of taint or pointer analyses.
- Occasional failure to enumerate rare encoding variants in certain binary formats (e.g., ASN.1 DER), suggesting a role for domain-specific input heuristics.
- Bias toward well-formed input generation can cause missed exposure of vulnerabilities reliant on format violations, where traditional mutational fuzzers excel.
- Assumption of pre-existing fuzz harnesses and explicit bug predicates; further work could address automated harness and predicate synthesis.
- Potential extension to broader security tasks beyond PoV generation, such as validation of non-exploitable vulnerabilities and integration with end-to-end pipelines (e.g., AIxCC).
PBFuzz establishes a bridge between human-like semantic reasoning and scalable automated vulnerability discovery, demonstrating that agentic, memory-driven workflows with PBT can achieve both deep semantic constraint satisfaction and practical throughput (Zeng et al., 4 Dec 2025).