SusVibes: Security Benchmark for LLM Code

Updated 25 February 2026

SusVibes is a large-scale benchmark that evaluates LLM-generated code on both functional correctness and security in realistic scenarios.
It compiles 200 feature-request tasks from real-world Python repositories, covering 77 distinct CWE vulnerabilities.
Findings reveal over 80% of functionally correct patches remain vulnerable, urging enhanced security-focused training and analysis.

SusVibes is a large-scale benchmark and evaluation framework for assessing the security and correctness of code written by LLM coding agents in realistic feature-request scenarios. Designed to address the acute problem of vulnerabilities in agent-generated software, SusVibes systematically measures both functional correctness and security guarantees across a curated suite of tasks derived from real-world open-source repositories. It was introduced in "Is Vibe Coding Safe? Benchmarking Vulnerability of Agent-Generated Code in Real-World Tasks" (Zhao et al., 2 Dec 2025).

1. Benchmark Construction and Dataset Properties

SusVibes comprises 200 feature-request tasks mined from real-world Python repositories. Each task specifically reconstructs scenarios in which an original human-authored implementation was later repaired due to a documented vulnerability. The construction pipeline proceeds via four phases:

Vulnerability Commit Selection: Security-fixing commits (~20,000, filtered to Python ≥ 3.7 and commits modifying or adding test cases) were collected from repositories such as ReposVul and MoreFixes. After enforcing the presence of test-modifying diffs, ~3,000 candidates remained; a diverse, domain-representative subset of 200 tasks was selected, spanning 108 projects in 10 application domains.
Security-Test Harvesting: The security-fix commit ( $\mathcal{C}_0$ ) was examined for newly added tests, which directly became the task’s security assertion suite $\mathcal{T}_{secure}$ .
Functional-Test and Mask Generation: The prior commit ( $\mathcal{C}_{-1}$ ), still containing the vulnerability, provided the functional test suite $\mathcal{T}_{func}$ . The region of code altered in the fix was masked via automated deletion patches, yielding a “blank” repository ( $\mathcal{C}_{-1}^{\mathcal{M}}$ ).
Task Description Synthesis and Verification:

A separate LLM agent constructed user-facing, GitHub-style feature requests devoid of hints from the ground-truth patch. Completeness of the mask and task description was validated by ensuring: - The masked repository fails both security and functional suites. - The vulnerable commit passes only the functional suite. - The security-fix passes both.

The final suite covers 77 unique MITRE CWEs. The breakdown of leading vulnerability categories is as follows:

CWE Class	Count	Percentage
Improper Input Validation (CWE-20)	22	11 %
Injection Flaws (CWE-79/89/78)	18	9 %
Auth Bypass (CWE-285/287)	27	13.5 %
Open Redirects (CWE-601)	10	5 %
Crypto Issues (CWE-327/etc.)	16	8 %
Side Channel (CWE-208)	14	7 %
Path Traversal (CWE-22/200)	20	10 %
Other (44 tasks, 62 CWEs)	~22 %

2. Evaluation Methodology and Metrics

Evaluations in SusVibes are strictly execution-based. Each task requires the LLM code agent to generate a patch implementing the desired feature on a masked repository. Two key metrics are defined:

Functional Correctness Rate $(CR)$ :

$CR = \frac{N_{correct}}{N_{total}}$

Where $N_{correct}$ counts task patches passing the functional test suite.

Security Success Rate $(SR)$ :

$SR = \frac{N_{secure}}{N_{total}}$

Where $N_{secure}$ requires patches to pass both functional and security tests.

Tasks are provided as Dockerized projects for reproducibility; all agents interact with the codebase via automated workflows that permit reading files, editing code, and running validations. No static analyzers are employed; only dynamic test suites are considered for pass/fail status.

3. Agent Frameworks and Experimental Protocol

Three agentic frameworks and three LLM models were benchmarked:

Agent Frameworks:
- SWE-Agent (Python-oriented, multi-turn)
- OpenHands (agentic LLM wrapper)
- Claude Code (Anthropic’s standalone coding agent)
LLMs:

Each agent receives the feature request, optionally prefixed with a generic security reminder. Up to 200 agent steps (plan, edit, test) are permitted per task.

Functionality and security are assessed solely by test suite results. Each agent-task combination proceeds independently, with up to 200 trials permitted for iterative refinement. CWE annotations or hints are intentionally withheld except in explicit augmentation strategies.

4. Results: Functional and Security Performance

Quantitative findings reveal a persistent, high-severity gap between functionality and security in LLM-agent–generated patches. The following table summarizes the main results for nine agent × LLM combinations on 200 tasks (Zhao et al., 2 Dec 2025):

LLM/Agent	SWE-Agent	OpenHands	Claude Code
	CR (%)	SR (%)	CR (%)
Claude 4 Sonnet	61.0	10.5	49.5
Kimi K2	22.5	6.0	37.0
Gemini 2.5 Pro	19.5	7.0	21.5

Key observations:

The top-performing configuration (SWE-Agent + Claude 4 Sonnet) reaches $CR=61\%$ but only $SR=10.5\%$ .
Over 80% of functionally correct implementations remain insecure.
No LLM–agent pairings achieve secure-patch rates above 12.5%, demonstrating the pervasiveness of vulnerability risks.

5. Security Augmentation Strategies and Outcomes

SusVibes experimentally evaluates two naive security-augmentation strategies:

Self-selection CWE: Agents select relevant CWEs for the task and are instructed to implement appropriate mitigations.
Oracle CWE: Agents are directly told with oracle-level precision which CWE is present and are tasked to avoid introducing it.

Strategy	CR (%)	SR (%)
Generic	61.0	10.5
Self-selection	52.5	9.5
Oracle CWE	56.0	10.5

Both augmentation approaches either degrade ( $\leq -8.5$  pp) or fail to improve security success. In several cases, agents focus on the labeled CWE at the expense of general functional correctness, without compensating reductions in vulnerability rates. Analysis indicates a high degree of “correct → incorrect” flips counterbalancing any secure recoveries.

6. Implications, Limitations, and Recommendations

SusVibes demonstrates that LLM-coding agents, even with state-of-the-art models and multi-turn feedback, generate patches that are predominantly insecure under realistic development constraints. This security deficit persists across agent architectures and persists despite explicit mitigation hints.

Major recommendations include:

LLM-generated code should not be deployed in security-sensitive contexts without additional review or automated analysis.
Security must be promoted to a first-class objective, necessitating integration of dynamic fuzzers, static taint analyzers, and other property-based adversarial techniques into the agent feedback loop.
Prompting and basic documentation are insufficient; fine-tuning or reinforcement learning approaches rewarding security test successes are needed.
Benchmark expansion to additional languages and richer vulnerability models is encouraged to drive progress.

A plausible implication is that agentic development workflows (“vibe coding”) will remain unsuitable for direct production deployment until security-oriented agent training and real-time vulnerability detection methodologies are significantly advanced.

7. Contextual Significance and Future Directions

SusVibes establishes a rigorously validated, reproducible repository-level evaluation protocol for LLM coding agents and highlights the acute limitations of present state-of-the-art solutions. It provides a reference dataset covering a wide spectrum of software vulnerability categories—77 distinct CWEs—and robustly demonstrates deficiencies in current approaches (Zhao et al., 2 Dec 2025).

Future work is likely to involve:

Extension to additional programming languages and frameworks.
Development and systematic incorporation of security-focused training objectives for generative agents.
Designing adversarial test generation strategies to dynamically probe security-relevant properties.
Deeper investigation of domain-specific and cross-category vulnerability mitigation in language agents.

By systematically exposing the security/functonality dichotomy in agent-generated code, SusVibes foregrounds the need for the software engineering community to approach LLM coding assistant adoption with substantial caution and to accelerate research at the intersection of AI-assisted programming and automated security analysis.

Markdown Report Issue Upgrade to Chat

References (1)

Is Vibe Coding Safe? Benchmarking Vulnerability of Agent-Generated Code in Real-World Tasks (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SusVibes.

SusVibes: Security Benchmark for LLM Code

1. Benchmark Construction and Dataset Properties

2. Evaluation Methodology and Metrics

3. Agent Frameworks and Experimental Protocol

4. Results: Functional and Security Performance

5. Security Augmentation Strategies and Outcomes

6. Implications, Limitations, and Recommendations

7. Contextual Significance and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

SusVibes: Security Benchmark for LLM Code

1. Benchmark Construction and Dataset Properties

2. Evaluation Methodology and Metrics

3. Agent Frameworks and Experimental Protocol

4. Results: Functional and Security Performance

5. Security Augmentation Strategies and Outcomes

6. Implications, Limitations, and Recommendations

7. Contextual Significance and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research