PoC-Adapt: Semantic-Aware Automated Vulnerability Reproduction with LLM Multi-Agents and Reinforcement Learning-Driven Adaptive Policy

Published 8 Apr 2026 in cs.CR | (2604.06618v2)

Abstract: While recent approaches leverage LLMs and multi-agent pipelines to automatically generate proof-of-concept (PoC) exploits from vulnerability reports, existing systems often suffer from two fundamental limitations: unreliable validation based on surface-level execution signals and high operational cost caused by extensive trial-and-error during exploit generation. In this paper, we present PoC-Adapt, an end-to-end framework for automated PoC generation and verification, architected upon a foundation semantic runtime validation and adaptive policy learning. At the core of PoC-Adapt is a Semantic Oracle that validates exploits by comparing structured pre- and post-execution system states, enabling reliable distinction between true vulnerability exploitation and incidental behavioral changes. To reduce exploration cost, we further introduce an Adaptive Policy Learning mechanism that learns an exploitation policy over semantic states and actions, guiding the exploit agent toward effective strategies with fewer failed attempts. PoC-Adapt is implemented as a multi-agent system comprising specialized agents for root cause analysis, environment building, exploit generation, and semantic validation, coordinated through structured feedback loops. Experimenting on the CWE-Bench-Java and PrimeVul benchmarks shows that PoC-Adapt significantly improves verification reliability by 25% and reduces exploit generation cost compared to prior LLM-based systems, highlighting the importance of semantic validation and learned action policies in automated vulnerability reproduction. Applied to the latest CVE corpus, PoC-Adapt confirmed 12 verified PoC out of 80 reproduce attempts at a cost of $0.42 per generated exploit

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper presents PoC-Adapt, a novel multi-agent system that uses a semantic oracle and DDQN-based adaptive policy to achieve a 25% relative improvement in reproduction success.
The paper employs a five-stage pipeline integrating context retrieval, exploit generation, and semantic verification to mitigate LLM hallucinations and streamline exploit generation.
The paper demonstrates significant efficiency gains by halving time-to-exploit and doubling exploit efficiency, validating its practicality on both controlled and real-world benchmarks.

Semantic-Aware Automated Vulnerability Reproduction via PoC-Adapt

Introduction and Motivation

The exponential increase in disclosed software vulnerabilities, coupled with evolving attack surfaces, has outpaced the capacity of manual vulnerability assessment and reproduction. Modern patch management, risk prioritization, and effective remediation fundamentally rely on verifiable Proof-of-Concept (PoC) exploits—yet the majority of vulnerability reports lack reproducible PoCs, impeding automated analysis and rapid defense. Recent LLM-based approaches to automated exploit generation have demonstrated potential, but remain hampered by unreliable validation (often dependent on superficial signals) and excessive resource consumption stemming from unguided trial-and-error.

PoC-Adapt addresses these deficiencies by architecting an end-to-end multi-agent system that coordinates LLM-driven agents via a semantic state-differencing oracle and a reinforcement learning-based adaptive exploitation policy. The framework targets robust, low-cost, and reliable PoC generation across diverse vulnerability types, focusing on reproducibility, practical deployment, and resilience to LLM hallucinations.

Figure 1: The overall PoC-Adapt architecture, illustrating agentized pipeline stages and data flow.

Framework Architecture

PoC-Adapt structures the PoC synthesis and validation pipeline into five sequential and modular stages: context retrieval, root cause analysis, environment setup, exploit generation, and semantic verification. Role-specialized LLM agents process each stage, passing structured context and feedback, and operate within tightly controlled tool access boundaries according to the principle of least privilege. The staged design supports traceable reasoning, modularity for debugging, and minimizes error propagation across pipeline boundaries.

The core architectural innovation is the Semantic Oracle, enabling the system to move beyond flag or crash-driven validation by capturing and analyzing structured pre- and post-execution system states, ensuring that exploit validation aligns with real-world impact rather than incidental or ambiguous behavioral changes. Feedback from semantic verification propagates upstream through the agent loop, allowing for focused refinement and reduced failure propagation.

Figure 2: The self-verification mechanism enables iterative intra- and inter-agent refinement via semantic feedback.

Figure 3: The adaptive policy learning mechanism integrates DDQN-guided decision making at the exploit generation stage.

The exploit generation process is further optimized by modeling agent decision-making as a Markov Decision Process (MDP), with actions and system context discretized and encoded into a low-dimensional state space. A Double Deep Q-Network (DDQN) manages exploitation policy, trained on replayed execution logs to maximize reproduction success while curbing fruitless token/compute consumption.

Key features of PoC-Adapt’s system-level design include:

Semantic Oracle Verification: Rigorous, agent-mediated pre- and post-state analysis aligned to an explicit impact hypothesis, minimizing false positives and neutralizing LLM hallucinations.
Adaptive Policy-Oriented Agent Coordination: DDQN learns successful action policies from historical trajectories, constraining LLM exploration within promising tool invocation sequences.
Agent Specialization and Closed Feedback Loops: RCA, Planner, Exploiter, and Validator agents interact in well-defined roles with structured outputs, self-verification, and multi-stage feedback propagation.
Figure 4: Policy model training pipeline for offline RL-driven exploitation policy learning.

Experimental Validation

Evaluation Settings and Metrics

Experiments evaluate PoC-Adapt on two benchmarks: FL-Bench-100 (100 curated vulnerabilities for controlled benchmarking) and GHSA-Real80 (80 real-world vulnerabilities from GitHub). Comparative baselines include FaultLine and a random policy agent. Performance metrics encompass reproduction success rate (SR), time-to-exploit (TTE), exploit efficiency (EE), token/monetary cost, and adaptive policy performance.

Numerical Results and Claims

SR on FL-Bench-100: 15% with PoC-Adapt vs. 12% with FaultLine—25% relative improvement.
TTE: Halved (16.33 vs. 35.92 steps, PoC-Adapt vs. FaultLine).
Exploit Efficiency: Doubled for PoC-Adapt (0.025 vs. 0.011).
GHSA-Real80 (real-world cases): 12/80 vulnerabilities reproduced (15% SR) at \$0.42 average cost per exploit; failures concentrated in environment reproduction.
Adaptive Policy: DDQN policy increases SR by 16.7% relative to non-adaptive pipeline, reduces TTE by half, and increases action efficiency by 5x relative to random baselines.
LLM Backend Robustness: Gemini-2.5-Pro yields best SR; DeepSeek-V3 cuts cost/latency but reduces SR; Qwen-3 provides a trade-off, confirming system flexibility.
Figure 5: Stage-wise analysis highlighting failure rates and points of fragility, especially within the environment reproduction phase.

Figure 6: Success rate breakdown by CWE on GHSA-Real80, illustrating highest SR for direct-impact vulnerability classes.

Analysis and Discussion

Strong numerical gains in SR and EE validate the effect of semantic state-differencing and RL-driven exploration. The adaptive exploitation policy particularly excels at minimizing action redundancy and sharply constraining the search space. Critically, the semantic oracle demonstrably diverts the system from typical LLM hallucination failure modes by enforcing observable, meaningful state transitions. Failure analysis emphasizes continuing challenges in automated environment setup, especially under heterogeneity of real-world OSS and infrastructural complexity.

The PoC-Adapt framework exposes several unresolved challenges:

Environment Setup Bottleneck: Operational failures trace primarily to the Planner stage due to configuration-induced complexity.
Oracle Sensitivity Limitations: Subtle vulnerabilities (e.g., deserialization) or medium-severity issues are less reliably captured due to limited semantic impact or ambiguous state manifestation.
Dependence on LLM Capability: Model substitution reveals meaningful performance degradation in lower-tier LLMs.
Replay Buffer Limitations: RL policy efficacy correlates with exploit trajectory diversity; coverage remains bounded by trajectory dataset scale and class distribution.

Practical implications include reduced operational costs, higher verification fidelity, and modularity for integration into CI/CD workflows. Theoretical implications highlight reinforcement learning’s role in taming LLM actuation entropy via policy learning, suggesting broad applicability for similar agentic, tool-using systems.

Future Directions

Future avenues include automating more robust environment setup via subtask decomposition and guided dependency resolution, extending semantic verification with RAG and multimodal agents for richer state observation, integration with UI-driven tools to cover broader exploit modalities, and fine-tuning LLMs on security-specific corpora to mitigate cost and privacy constraints.

Conclusion

PoC-Adapt advances the paradigm of automated PoC exploit generation by fusing multi-agent LLM pipelines with semantic runtime oracles and adaptive policy learning. Empirical results demonstrate improved reliability (25% relative SR gain over SOTA), resource efficiency, and robustness to LLM backend changes. The RL-driven adaptive policy systematizes exploit navigation, reducing wasteful exploration and establishing a foundation for scalable, economically viable vulnerability reproduction. Remaining bottlenecks in environment reproduction and subtle impact detection motivate continued research. Collectively, PoC-Adapt offers a substantial step towards reliable, automated, and semantically verifiable exploit generation that can underpin future vulnerability management at scale.

Markdown Report Issue