Papers
Topics
Authors
Recent
Search
2000 character limit reached

PBFuzz: Agentic Directed Fuzzing for PoV

Updated 19 January 2026
  • PBFuzz is an agentic directed fuzzing framework that synthesizes PoV inputs by integrating LLM-driven semantic reasoning and property-based testing.
  • It features a four-phase workflow with PLAN, IMPLEMENT, EXECUTE, and REFLECT stages to refine vulnerability hypotheses and enforce constraints.
  • Evaluations on the Magma benchmark show improved coverage and faster time-to-exposure, outperforming conventional greybox and LLM-assisted fuzzers.

PBFuzz is an agentic directed fuzzing framework for Proof-of-Vulnerability (PoV) input generation, introduced to address the dual challenge of reaching and triggering complex vulnerabilities in software targets. It systematically incorporates LLM-driven semantic code reasoning, custom program analysis, persistent memory management to avoid hypothesis drift, and high-throughput property-based testing for efficient constraint satisfaction. PBFuzz demonstrated significant improvements in vulnerability exposure efficiency and coverage on the Magma benchmark, outperforming conventional greybox, LLM-assisted, and random mutation-based approaches (Zeng et al., 4 Dec 2025).

1. Formalization of PoV Generation

In the PoV input generation problem, the objective is to synthesize an input xx for a program PP such that the execution of P(x)P(x) (i) reaches designated vulnerable code locations LL and (ii) causes the program state at LL to violate a target safety predicate φ\varphi.

These requirements are formalized as: ReachP(x,L)  =  [π execution trace with Lπ]\mathit{Reach}_P(x,L)\;=\;[\exists\,\pi\text{ execution trace with }L\in\pi]

TriggerP(x,L,φ)  =  [σLφ]\mathit{Trigger}_P(x,L,\varphi)\;=\;[\sigma_L \models \varphi]

A PoV input xx must satisfy

ReachP(x,L)=trueTriggerP(x,L,φ)=true.\mathit{Reach}_P(x,L) = \text{true} \quad\wedge\quad \mathit{Trigger}_P(x,L,\varphi) = \text{true}.

The “reachability constraints” refer to the requirements on xx to achieve ReachP\mathit{Reach}_P, and “triggering constraints” refer to the requirements for violating φ\varphi once LL is reached. This formal separation is necessary as many fuzzers can solve one but not both classes of constraints, particularly in scenarios involving deep semantic dependencies or intricate input formats.

2. Key Challenges in Agentic PoV Generation

PBFuzz was designed to overcome four principal challenges that limit the reach and reliability of autonomous PoV generators:

  1. Dynamic Semantic-Level Program Reasoning: Automatically extracting context-sensitive, high-level constraints on input and program state, akin to the hypothesis-building process of skilled human analysts.
  2. Custom, On-Demand Program Analysis: Enabling fine-grained queries for control/data dependencies and execution tracing (e.g., “Which call sites may reach LL?”, “Where did execution diverge from a feasible path?”).
  3. Persistent, Monotonic Memory: Employing a unified, append-only knowledge store to preserve and refine evolving hypotheses, thus protecting against “hypothesis drift” common in iterative AI systems.
  4. Constraint Solving via Structure-Preserving Property-Based Testing (PBT): Systematic and efficient exploration of semantic parameter spaces while respecting input structure invariants.

These challenges form the design rationale behind PBFuzz’s agentic, multi-layered architecture (Zeng et al., 4 Dec 2025).

3. Architecture and Multi-Phase Workflow

PBFuzz implements a vertically organized four-layer design and executes as a state-machine across four main phases (PLAN, IMPLEMENT, EXECUTE, REFLECT):

  • Brain: An LLM-based agent responsible for code reasoning, hypothesis generation, and structured memory management.
  • MCP Tools: Stateless analysis servers offering call graph navigation, corpus reachability, execution trace deviation checking, generator APIs, fuzzing orchestration, and debugger integration.
  • Workflow Manager: Enforces phase-specific permissions and transitions, mediating agent operations.
  • Memory Layer: A monotonic Markdown file (workflow_state.md) structured in distinct, phase-gated JSON blocks (BugPredicates, Preconditions, TriggerPlans, FuzzPlans, etc.).

The state-machine iterates as follows:

1
2
3
PLAN → IMPLEMENT → EXECUTE → { SUCCESS or REFLECT }
               ↑                  ↓
             REFLECT ←------------

In each loop:

  • PLAN: The agent extracts and formalizes semantic reachability and triggering constraints.
  • IMPLEMENT: Constraints are transformed into parameterized input generators and a PBT configuration.
  • EXECUTE: A two-stage fuzzer assesses concrete and randomly sampled parameter sets.
  • REFLECT: Execution failures are analyzed (using detect_deviation, launch_gdb), hypotheses are updated, and planning resumes.

4. Core Algorithms and Procedures

Semantic Constraint Extraction (PLAN)

The agent initializes a set of BugPredicates: BugPredicates{id: BPi,  φi}\texttt{BugPredicates} \ni \{\text{id: }BP_i,\; \varphi_i\} For each site LiL_i and predicate φi\varphi_i, program slicing and call-graph traversals distill reachability requirements: Preconditions={Rj:“must satisfy condition Cj on input”}\texttt{Preconditions} = \{R_j : \text{“must satisfy condition }C_j\text{ on input”}\} Root causes (buffer_overflow, integer_overflow, etc.) are classified and associated with preconditions; TriggerPlans enumerate strategies referencing subsets of preconditions and root causes.

Hypothesis Refinement (PLAN ↔ REFLECT)

After EXECUTE, outcomes (“reached”, “triggered”; breakpoint state) are logged. REFLECT uses deviation analysis to identify violated preconditions; hypotheses are weakened or revised accordingly. When triggers fail, further analysis (back-slicing; debugger inspection) refines preconditions or TriggerPlans. All validated knowledge is only appended or upgraded in memory (monotonicity).

Structure-Preserving PBT (IMPLEMENT, EXECUTE)

The agent constructs a ParameterSpace for semantic parameters (e.g., buffer_size, nesting_depth), specifying type and domains. Python generators (generate(**params) → bytes) ensure test inputs maintain all structural invariants (e.g., XML or specific binary encodings). ConcreteParameters (5–10 agent-selected tuples) are executed sequentially, followed by random and boundary-aware sampling; the PBT engine emphasizes coverage of edge cases without invalidating structure.

5. Data Structures and Hypothesis-Drift Avoidance

All agent state is recorded in a single Markdown file, organized into phase-gated JSON blocks covering:

Phase JSON Blocks Purpose
PLAN BugPredicates, Preconditions, RootCauses, TriggerPlans Store constraints and root strategies
IMPLEMENT ParameterSpace, FuzzPlan, Breakpoints Manage input gen/fuzzing configuration
EXECUTE Metrics Log execution and exposure statistics

Workflow enforcement is managed by the MCP calls (write_workflow_block, transition_phase), guaranteeing phase-appropriate, monotonic updates and precluding deletion of validated constraints. This memory model is essential for preventing the agent from “unlearning” validated knowledge (Zeng et al., 4 Dec 2025).

6. Experimental Evaluation and Comparative Performance

PBFuzz was evaluated on the Magma benchmark (129 CVEs, 361 targets, 9 open-source projects). The experimental protocol compared PBFuzz to directed greybox fuzzers (AFLGo, SelectFuzz), coverage-guided AFL++ (±CmpLog), LLM-assisted fuzzers (G2Fuzz, Llamafuzz), and others. Baselines received 24h × 10 trials per target; PBFuzz ran a single 30-minute trial.

Metrics collected included CVE coverage, Time-to-Exposure (TTE), Time-to-Reach (TTR), reproducibility (10-run consistency), and LLM API token cost.

Results:

  • PBFuzz triggered 57/129 CVEs (AFL++/CmpLog: 49; AFLGo: 41; G2Fuzz: 16).
  • PBFuzz uniquely triggered 17 CVEs not exposed by other systems.
  • Median TTE: PBFuzz, 339 s; AFL++/CmpLog, 8 680 s (25.6× faster).
  • PBFuzz’s API cost: ≈2.18 million tokens/vulnerability (≈$1.83).
  • On 23 selected CVEs, PBFuzz achieved success in all 10 runs for 18 of them, exceeding baseline consistency.

Time-to-Exposure Summary:

System Median TTE (s) Max TTE (s)
PBFuzz 339 1,441
AFL++ + CmpLog 8,680 85,322
AFLGo 6,170 85,200
G2Fuzz 5,400 68,400

PBFuzz’s structure-preserving approach and hypothesis-driven workflow allowed it to achieve higher coverage in less time and with modest resource consumption compared to mutation-based and pure LLM-based approaches (Zeng et al., 4 Dec 2025).

7. Limitations and Prospects

Identified limitations include:

  • Incomplete handling of long-distance, implicit dependencies (e.g., XML attribute propagation), motivating future integration of taint or pointer analyses.
  • Occasional failure to enumerate rare encoding variants in certain binary formats (e.g., ASN.1 DER), suggesting a role for domain-specific input heuristics.
  • Bias toward well-formed input generation can cause missed exposure of vulnerabilities reliant on format violations, where traditional mutational fuzzers excel.
  • Assumption of pre-existing fuzz harnesses and explicit bug predicates; further work could address automated harness and predicate synthesis.
  • Potential extension to broader security tasks beyond PoV generation, such as validation of non-exploitable vulnerabilities and integration with end-to-end pipelines (e.g., AIxCC).

PBFuzz establishes a bridge between human-like semantic reasoning and scalable automated vulnerability discovery, demonstrating that agentic, memory-driven workflows with PBT can achieve both deep semantic constraint satisfaction and practical throughput (Zeng et al., 4 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to PBFuzz.