Papers
Topics
Authors
Recent
Search
2000 character limit reached

SusVibes: Security Benchmark for LLM Code

Updated 25 February 2026
  • SusVibes is a large-scale benchmark that evaluates LLM-generated code on both functional correctness and security in realistic scenarios.
  • It compiles 200 feature-request tasks from real-world Python repositories, covering 77 distinct CWE vulnerabilities.
  • Findings reveal over 80% of functionally correct patches remain vulnerable, urging enhanced security-focused training and analysis.

SusVibes is a large-scale benchmark and evaluation framework for assessing the security and correctness of code written by LLM coding agents in realistic feature-request scenarios. Designed to address the acute problem of vulnerabilities in agent-generated software, SusVibes systematically measures both functional correctness and security guarantees across a curated suite of tasks derived from real-world open-source repositories. It was introduced in "Is Vibe Coding Safe? Benchmarking Vulnerability of Agent-Generated Code in Real-World Tasks" (Zhao et al., 2 Dec 2025).

1. Benchmark Construction and Dataset Properties

SusVibes comprises 200 feature-request tasks mined from real-world Python repositories. Each task specifically reconstructs scenarios in which an original human-authored implementation was later repaired due to a documented vulnerability. The construction pipeline proceeds via four phases:

  1. Vulnerability Commit Selection: Security-fixing commits (~20,000, filtered to Python ≥ 3.7 and commits modifying or adding test cases) were collected from repositories such as ReposVul and MoreFixes. After enforcing the presence of test-modifying diffs, ~3,000 candidates remained; a diverse, domain-representative subset of 200 tasks was selected, spanning 108 projects in 10 application domains.
  2. Security-Test Harvesting: The security-fix commit (C0\mathcal{C}_0) was examined for newly added tests, which directly became the task’s security assertion suite Tsecure\mathcal{T}_{secure}.
  3. Functional-Test and Mask Generation: The prior commit (C1\mathcal{C}_{-1}), still containing the vulnerability, provided the functional test suite Tfunc\mathcal{T}_{func}. The region of code altered in the fix was masked via automated deletion patches, yielding a “blank” repository (C1M\mathcal{C}_{-1}^{\mathcal{M}}).
  4. Task Description Synthesis and Verification:

A separate LLM agent constructed user-facing, GitHub-style feature requests devoid of hints from the ground-truth patch. Completeness of the mask and task description was validated by ensuring: - The masked repository fails both security and functional suites. - The vulnerable commit passes only the functional suite. - The security-fix passes both.

The final suite covers 77 unique MITRE CWEs. The breakdown of leading vulnerability categories is as follows:

CWE Class Count Percentage
Improper Input Validation (CWE-20) 22 11 %
Injection Flaws (CWE-79/89/78) 18 9 %
Auth Bypass (CWE-285/287) 27 13.5 %
Open Redirects (CWE-601) 10 5 %
Crypto Issues (CWE-327/etc.) 16 8 %
Side Channel (CWE-208) 14 7 %
Path Traversal (CWE-22/200) 20 10 %
Other (44 tasks, 62 CWEs) ~22 %

2. Evaluation Methodology and Metrics

Evaluations in SusVibes are strictly execution-based. Each task requires the LLM code agent to generate a patch implementing the desired feature on a masked repository. Two key metrics are defined:

  • Functional Correctness Rate (CR)(CR):

CR=NcorrectNtotalCR = \frac{N_{correct}}{N_{total}}

Where NcorrectN_{correct} counts task patches passing the functional test suite.

  • Security Success Rate (SR)(SR):

SR=NsecureNtotalSR = \frac{N_{secure}}{N_{total}}

Where NsecureN_{secure} requires patches to pass both functional and security tests.

Tasks are provided as Dockerized projects for reproducibility; all agents interact with the codebase via automated workflows that permit reading files, editing code, and running validations. No static analyzers are employed; only dynamic test suites are considered for pass/fail status.

3. Agent Frameworks and Experimental Protocol

Three agentic frameworks and three LLM models were benchmarked:

Each agent receives the feature request, optionally prefixed with a generic security reminder. Up to 200 agent steps (plan, edit, test) are permitted per task.

Functionality and security are assessed solely by test suite results. Each agent-task combination proceeds independently, with up to 200 trials permitted for iterative refinement. CWE annotations or hints are intentionally withheld except in explicit augmentation strategies.

4. Results: Functional and Security Performance

Quantitative findings reveal a persistent, high-severity gap between functionality and security in LLM-agent–generated patches. The following table summarizes the main results for nine agent × LLM combinations on 200 tasks (Zhao et al., 2 Dec 2025):

LLM/Agent SWE-Agent OpenHands Claude Code
CR (%) SR (%) CR (%)
Claude 4 Sonnet 61.0 10.5 49.5
Kimi K2 22.5 6.0 37.0
Gemini 2.5 Pro 19.5 7.0 21.5

Key observations:

  • The top-performing configuration (SWE-Agent + Claude 4 Sonnet) reaches CR=61%CR=61\% but only SR=10.5%SR=10.5\%.
  • Over 80% of functionally correct implementations remain insecure.
  • No LLM–agent pairings achieve secure-patch rates above 12.5%, demonstrating the pervasiveness of vulnerability risks.

5. Security Augmentation Strategies and Outcomes

SusVibes experimentally evaluates two naive security-augmentation strategies:

  • Self-selection CWE: Agents select relevant CWEs for the task and are instructed to implement appropriate mitigations.
  • Oracle CWE: Agents are directly told with oracle-level precision which CWE is present and are tasked to avoid introducing it.
Strategy CR (%) SR (%)
Generic 61.0 10.5
Self-selection 52.5 9.5
Oracle CWE 56.0 10.5

Both augmentation approaches either degrade (8.5\leq -8.5 pp) or fail to improve security success. In several cases, agents focus on the labeled CWE at the expense of general functional correctness, without compensating reductions in vulnerability rates. Analysis indicates a high degree of “correct → incorrect” flips counterbalancing any secure recoveries.

6. Implications, Limitations, and Recommendations

SusVibes demonstrates that LLM-coding agents, even with state-of-the-art models and multi-turn feedback, generate patches that are predominantly insecure under realistic development constraints. This security deficit persists across agent architectures and persists despite explicit mitigation hints.

Major recommendations include:

  • LLM-generated code should not be deployed in security-sensitive contexts without additional review or automated analysis.
  • Security must be promoted to a first-class objective, necessitating integration of dynamic fuzzers, static taint analyzers, and other property-based adversarial techniques into the agent feedback loop.
  • Prompting and basic documentation are insufficient; fine-tuning or reinforcement learning approaches rewarding security test successes are needed.
  • Benchmark expansion to additional languages and richer vulnerability models is encouraged to drive progress.

A plausible implication is that agentic development workflows (“vibe coding”) will remain unsuitable for direct production deployment until security-oriented agent training and real-time vulnerability detection methodologies are significantly advanced.

7. Contextual Significance and Future Directions

SusVibes establishes a rigorously validated, reproducible repository-level evaluation protocol for LLM coding agents and highlights the acute limitations of present state-of-the-art solutions. It provides a reference dataset covering a wide spectrum of software vulnerability categories—77 distinct CWEs—and robustly demonstrates deficiencies in current approaches (Zhao et al., 2 Dec 2025).

Future work is likely to involve:

  • Extension to additional programming languages and frameworks.
  • Development and systematic incorporation of security-focused training objectives for generative agents.
  • Designing adversarial test generation strategies to dynamically probe security-relevant properties.
  • Deeper investigation of domain-specific and cross-category vulnerability mitigation in language agents.

By systematically exposing the security/functonality dichotomy in agent-generated code, SusVibes foregrounds the need for the software engineering community to approach LLM coding assistant adoption with substantial caution and to accelerate research at the intersection of AI-assisted programming and automated security analysis.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SusVibes.