CVE-GENIE: Automated CVE Exploit Reproduction
- CVE-GENIE is a multi-agent framework powered by large language models that automates the reproduction and validation of real-world software vulnerabilities.
- It orchestrates environment reconstruction, exploit generation, and verification through modular agents and retrieval-augmented context enrichment.
- The system achieves reproducible CVE attack artifacts with success rates between 51% and 63%, providing robust benchmarks for empirical security research.
CVE-GENIE is a LLM-based, end-to-end multi-agent framework designed for the automated reproduction and validation of real-world software vulnerabilities from Common Vulnerabilities and Exposures (CVE) entries. By orchestrating environment setup, exploit generation, and verification in a fully automated pipeline, CVE-GENIE offers high-fidelity, reproducible CVE attack artifacts useful for benchmarking and empirical security research. The system operates across diverse CWE types, programming languages, and project environments, integrating retrieval-augmented context enrichment to address gaps and inconsistencies in public vulnerability disclosures. CVE-GENIE advances the state of the art in automated exploit reproduction, benchmark curation, and AI-assisted security tooling for vulnerability management (Ullah et al., 1 Sep 2025, Lotfi et al., 28 Sep 2025).
1. System Architecture and EAGER Design Principles
CVE-GENIE employs a modular, multi-agent architecture following the "EAGER" principle: Exploit Generation, Assessment, Generalization, End-to-end Automation, and Rebuilding vulnerable environments. The system is decomposed into four main modules, each built from cooperating agents with explicit responsibilities and tool access:
- Processor: Aggregates required CVE resources via a Data Processor and organizes extracted materials in a structured knowledge base (KB) using the Knowledge Builder.
- Builder: Reconstructs the vulnerable environment with the Pre-Requisite Developer Agent (scans repositories, identifies dependencies) and Setup Developer Agent (installs vulnerable versions, configures services), aided by a Setup Critic Agent for verification and error signaling.
- Exploiter: Generates or adapts proof-of-concept (PoC) exploits. The Exploit Developer Agent writes or reimagines exploits from advisories or code, with the Exploit Critic Agent validating exploit realism and relevance to the CVE.
- CTF Verifier: Produces a verification harness, emulating Capture-The-Flag (CTF) logic, with a Verifier Developer Agent constructing a script that executes the exploit and produces a "flag" if the vulnerability is successfully triggered; validation is performed by the Flag Checker and Verifier Critic Agent.
Each agent operates in a ReAct-style loop of artifact generation, tool invocation (e.g., shell commands, file I/O), and self-critique, with robust checkpointing to enable cost- and time-bounded automation (Ullah et al., 1 Sep 2025).
2. Automated Environment Reconstruction and Exploit Reproduction
The pipeline begins upon receipt of a CVE entry. The Data Processor locates the project's source code and supporting artifacts (e.g., security advisories, commits that fix the vulnerability), extracting affected versions, CWE types, and other metadata. The Pre-Requisite Developer Agent examines the repository to establish the set of files, services, and versions needed for reproduction. The Setup Developer Agent then orchestrates environment bootstrapping, performing installations (e.g., pip install sqlparse==0.4.4) and system configuration, while Setup Critic Agents provide log-based automated troubleshooting (e.g., validating that the correct version is selected).
The Exploit Developer Agent synthesizes or adapts a PoC to trigger the vulnerability. If a public PoC is available, it is reused (with adaptation as needed); otherwise, the agent analyzes the patch and codebase to generate a new exploit. The Exploit Critic Agent verifies execution logs to confirm exploit effectiveness and correct triggering (as in raising a module-specific exception), rejecting artifacts that do not genuinely exercise the vulnerability.
In the CTF Verifier stage, an automated Python script is created to programmatically execute the exploit and assert success by outputting a deterministically specified flag (e.g., verifying a unique error or code path), with thorough traceback analysis ensuring the exploit's origin corresponds to the vulnerable module. Iterative loops provide one-turn feedback for minor errors (e.g., a misplaced assertion), with strict caps on agent step counts and compute costs per CVE (Ullah et al., 1 Sep 2025).
3. Evaluation Results and Benchmarking
CVE-GENIE was evaluated on both small-scale and large-scale benchmarks. In a pilot consisting of 60 post-knowledge-cutoff CVEs, the framework achieved a reproduction success rate of 63.3% (38/60), spanning multiple languages, vulnerability types, and open-source projects. In large-scale automated reproduction across 841 CVEs (2024–2025), CVE-GENIE recreated 428 CVEs (51% success), producing a diverse, verifiable exploit dataset across 267 distinct projects, 141 CWE types, and 22 programming languages.
Resource metrics are tightly tracked: median elapsed time for successful reproductions was 18 minutes per CVE, with a mean cost of US$2.77. A hard budget of US$5 and 45 minutes per CVE is enforced. Failures are most commonly attributable to complex or broken build systems; the ablation paper demonstrated that multi-agent design and system modularity were necessary, as a single-LM monolith failed to reproduce any CVE successfully (Ullah et al., 1 Sep 2025).
| Evaluation Batch | # CVEs Processed | # CVEs Reproduced | Success Rate | Projects Covered | Mean Cost ($) |
|---|---|---|---|---|---|
| Small-scale (pilot) | 60 | 38 | 63.3% | - | - |
| Large-scale | 841 | 428 | 51% | 267 | 2.77 |
4. Technical Approaches: LLM Integration and RAG
CVE-GENIE leverages LLM-based reasoning and Retrieval-Augmented Generation (RAG) to address incomplete, noisy, or inconsistent CVE disclosures. RAG is used to supplement official CVE JSON data with public advisories, Open Source Intelligence (OSINT), CWE taxonomy references, and code snippets, significantly enriching context for environment and exploit generation. The pipeline employs iteratively structured prompts with repeated context to ensure that the LLM maintains detailed problem specifications throughout multi-step processes (Lotfi et al., 28 Sep 2025).
Containerization (primarily via Docker) is used to synthesize test environments, supporting single and multi-container topologies as required (e.g., server-client exploits, dependency-specific attack chains). The LLM is guided through environment creation, dependency isolation, and the generation/refinement of exploit code. Automated artifact validation involves static and dynamic analysis tools (e.g., AddressSanitizer, gdb), and behavior-oriented test cases are programmatically generated and executed to confirm exploit success. The model-agnostic pipeline allows deployment with any state-of-the-art LLM (e.g., GPT-4, Qwen3, Mistral AI, Llama3, Gemini), increasing versatility and reproducibility (Lotfi et al., 28 Sep 2025).
5. Impact and Applications in Software Security
CVE-GENIE creates high-quality, reproducible vulnerability datasets with associated exploit and verifier artifacts, enabling:
- Fuzzer Evaluation: Ground-truth benchmarks of real vulnerabilities for quantifiable comparison of dynamic analysis and fuzz testing techniques.
- Vulnerability Patching and Detection: Reliable datasets for automated patch synthesis, patch verification, and machine learning–based vulnerability discovery.
- AI Security Assessment: Assessment of LLM capabilities for code understanding, exploit synthesis, and complex tool-based reasoning under tight resource constraints.
- Penetration Testing: Automated construction of realistic exploitation chains, aiding red team assessment with verifiable success criteria.
- Academic and Tool Development: Open-source release of full pipeline and artifact datasets supports reproducible research, peer review, and methodologically sound security evaluation.
The system exposes and empirically validates inconsistencies in a proportion of public CVE disclosures, underscoring the necessity for systematic, automated verification in contemporary vulnerability management (Ullah et al., 1 Sep 2025, Lotfi et al., 28 Sep 2025).
6. Limitations and Future Directions
CVE-GENIE currently prioritizes command-line interface (CLI) reproduction; vulnerabilities that require web UI or graphical interaction are beyond the current system's supported feature set. The authors articulate the need for future integration of multimodal LLMs and supplementary tooling to process image/video-based PoCs, extending coverage to additional CVE classes.
Additionally, critic agent feedback sometimes errs on the side of minimizing false positives, causing certain subtle successful exploitations to be missed; improved balance between strictness and recall is identified as a future research direction. The cost per CVE is modest but may be further optimized via more efficient agent orchestration and model fine-tuning.
Automated detection and benchmarking of real-world vulnerabilities can be further advanced by enhancing RAG with better-contextualized external sources, supporting patch recommendation and multi-stage exploitation scenarios, and expanding the coverage of post-exploit validation logic for complex distributed systems.
7. Open-Source and Community Implications
All generated vulnerability environments, exploit/verifier artifacts, and system infrastructure are open-sourced by the authors, driving reproducibility and adoption in the broader cybersecurity community (Lotfi et al., 28 Sep 2025). This enables practitioners and researchers to benchmark tools, validate findings, and accelerate collective progress in automated vulnerability management. The release of such benchmarks is crucial to substantiating claims of security tool effectiveness and evolving best practices in software security evaluation.
CVE-GENIE represents a methodological advance in the automated reproduction, validation, and benchmarking of real-world software vulnerabilities, synthesizing LLM-driven multi-agent orchestration, RAG-based context enrichment, and reproducibility logic to drive progress in vulnerability management, security research, and empirical tool evaluation.