SusVibes: Security Benchmark for LLM Code
- SusVibes is a large-scale benchmark that evaluates LLM-generated code on both functional correctness and security in realistic scenarios.
- It compiles 200 feature-request tasks from real-world Python repositories, covering 77 distinct CWE vulnerabilities.
- Findings reveal over 80% of functionally correct patches remain vulnerable, urging enhanced security-focused training and analysis.
SusVibes is a large-scale benchmark and evaluation framework for assessing the security and correctness of code written by LLM coding agents in realistic feature-request scenarios. Designed to address the acute problem of vulnerabilities in agent-generated software, SusVibes systematically measures both functional correctness and security guarantees across a curated suite of tasks derived from real-world open-source repositories. It was introduced in "Is Vibe Coding Safe? Benchmarking Vulnerability of Agent-Generated Code in Real-World Tasks" (Zhao et al., 2 Dec 2025).
1. Benchmark Construction and Dataset Properties
SusVibes comprises 200 feature-request tasks mined from real-world Python repositories. Each task specifically reconstructs scenarios in which an original human-authored implementation was later repaired due to a documented vulnerability. The construction pipeline proceeds via four phases:
- Vulnerability Commit Selection: Security-fixing commits (~20,000, filtered to Python ≥ 3.7 and commits modifying or adding test cases) were collected from repositories such as ReposVul and MoreFixes. After enforcing the presence of test-modifying diffs, ~3,000 candidates remained; a diverse, domain-representative subset of 200 tasks was selected, spanning 108 projects in 10 application domains.
- Security-Test Harvesting: The security-fix commit () was examined for newly added tests, which directly became the task’s security assertion suite .
- Functional-Test and Mask Generation: The prior commit (), still containing the vulnerability, provided the functional test suite . The region of code altered in the fix was masked via automated deletion patches, yielding a “blank” repository ().
- Task Description Synthesis and Verification:
A separate LLM agent constructed user-facing, GitHub-style feature requests devoid of hints from the ground-truth patch. Completeness of the mask and task description was validated by ensuring: - The masked repository fails both security and functional suites. - The vulnerable commit passes only the functional suite. - The security-fix passes both.
The final suite covers 77 unique MITRE CWEs. The breakdown of leading vulnerability categories is as follows:
| CWE Class | Count | Percentage |
|---|---|---|
| Improper Input Validation (CWE-20) | 22 | 11 % |
| Injection Flaws (CWE-79/89/78) | 18 | 9 % |
| Auth Bypass (CWE-285/287) | 27 | 13.5 % |
| Open Redirects (CWE-601) | 10 | 5 % |
| Crypto Issues (CWE-327/etc.) | 16 | 8 % |
| Side Channel (CWE-208) | 14 | 7 % |
| Path Traversal (CWE-22/200) | 20 | 10 % |
| Other (44 tasks, 62 CWEs) | ~22 % |
2. Evaluation Methodology and Metrics
Evaluations in SusVibes are strictly execution-based. Each task requires the LLM code agent to generate a patch implementing the desired feature on a masked repository. Two key metrics are defined:
- Functional Correctness Rate :
Where counts task patches passing the functional test suite.
- Security Success Rate :
Where requires patches to pass both functional and security tests.
Tasks are provided as Dockerized projects for reproducibility; all agents interact with the codebase via automated workflows that permit reading files, editing code, and running validations. No static analyzers are employed; only dynamic test suites are considered for pass/fail status.
3. Agent Frameworks and Experimental Protocol
Three agentic frameworks and three LLM models were benchmarked:
- Agent Frameworks:
- SWE-Agent (Python-oriented, multi-turn)
- OpenHands (agentic LLM wrapper)
- Claude Code (Anthropic’s standalone coding agent)
- LLMs:
Each agent receives the feature request, optionally prefixed with a generic security reminder. Up to 200 agent steps (plan, edit, test) are permitted per task.
Functionality and security are assessed solely by test suite results. Each agent-task combination proceeds independently, with up to 200 trials permitted for iterative refinement. CWE annotations or hints are intentionally withheld except in explicit augmentation strategies.
4. Results: Functional and Security Performance
Quantitative findings reveal a persistent, high-severity gap between functionality and security in LLM-agent–generated patches. The following table summarizes the main results for nine agent × LLM combinations on 200 tasks (Zhao et al., 2 Dec 2025):
| LLM/Agent | SWE-Agent | OpenHands | Claude Code |
|---|---|---|---|
| CR (%) | SR (%) | CR (%) | |
| Claude 4 Sonnet | 61.0 | 10.5 | 49.5 |
| Kimi K2 | 22.5 | 6.0 | 37.0 |
| Gemini 2.5 Pro | 19.5 | 7.0 | 21.5 |
Key observations:
- The top-performing configuration (SWE-Agent + Claude 4 Sonnet) reaches but only .
- Over 80% of functionally correct implementations remain insecure.
- No LLM–agent pairings achieve secure-patch rates above 12.5%, demonstrating the pervasiveness of vulnerability risks.
5. Security Augmentation Strategies and Outcomes
SusVibes experimentally evaluates two naive security-augmentation strategies:
- Self-selection CWE: Agents select relevant CWEs for the task and are instructed to implement appropriate mitigations.
- Oracle CWE: Agents are directly told with oracle-level precision which CWE is present and are tasked to avoid introducing it.
| Strategy | CR (%) | SR (%) |
|---|---|---|
| Generic | 61.0 | 10.5 |
| Self-selection | 52.5 | 9.5 |
| Oracle CWE | 56.0 | 10.5 |
Both augmentation approaches either degrade ( pp) or fail to improve security success. In several cases, agents focus on the labeled CWE at the expense of general functional correctness, without compensating reductions in vulnerability rates. Analysis indicates a high degree of “correct → incorrect” flips counterbalancing any secure recoveries.
6. Implications, Limitations, and Recommendations
SusVibes demonstrates that LLM-coding agents, even with state-of-the-art models and multi-turn feedback, generate patches that are predominantly insecure under realistic development constraints. This security deficit persists across agent architectures and persists despite explicit mitigation hints.
Major recommendations include:
- LLM-generated code should not be deployed in security-sensitive contexts without additional review or automated analysis.
- Security must be promoted to a first-class objective, necessitating integration of dynamic fuzzers, static taint analyzers, and other property-based adversarial techniques into the agent feedback loop.
- Prompting and basic documentation are insufficient; fine-tuning or reinforcement learning approaches rewarding security test successes are needed.
- Benchmark expansion to additional languages and richer vulnerability models is encouraged to drive progress.
A plausible implication is that agentic development workflows (“vibe coding”) will remain unsuitable for direct production deployment until security-oriented agent training and real-time vulnerability detection methodologies are significantly advanced.
7. Contextual Significance and Future Directions
SusVibes establishes a rigorously validated, reproducible repository-level evaluation protocol for LLM coding agents and highlights the acute limitations of present state-of-the-art solutions. It provides a reference dataset covering a wide spectrum of software vulnerability categories—77 distinct CWEs—and robustly demonstrates deficiencies in current approaches (Zhao et al., 2 Dec 2025).
Future work is likely to involve:
- Extension to additional programming languages and frameworks.
- Development and systematic incorporation of security-focused training objectives for generative agents.
- Designing adversarial test generation strategies to dynamically probe security-relevant properties.
- Deeper investigation of domain-specific and cross-category vulnerability mitigation in language agents.
By systematically exposing the security/functonality dichotomy in agent-generated code, SusVibes foregrounds the need for the software engineering community to approach LLM coding assistant adoption with substantial caution and to accelerate research at the intersection of AI-assisted programming and automated security analysis.