RealSec-bench: Secure Code Generation Benchmark
- RealSec-bench is a benchmark that measures the secure code generation of LLMs using vulnerability tasks from real-world Java repositories.
- It employs a multi-stage pipeline combining static analysis, LLM triage, and human expert validation to curate 105 vulnerability-centric tasks across 19 CWE types.
- Its SecurePass@K metric jointly quantifies functional correctness and vulnerability mitigation, revealing a significant performance gap in current models.
RealSec-bench is a benchmark designed to rigorously evaluate the secure code generation capabilities of LLMs in the context of real-world software repositories. Centered on high-risk, open-source Java projects, it addresses the deficiencies of prior benchmarks that rely on synthetic vulnerabilities and ignore real data-flow complexity. RealSec-bench combines systematic static analysis, LLM-based triage, and human expert validation to provide 105 vulnerability-centric tasks, each grounded in actual repository context and encompassing 19 Common Weakness Enumeration (CWE) types. The benchmark introduces the SecurePass@K metric, which jointly quantifies functional correctness and vulnerability mitigation, exposing a persistent gap between functionally correct and security-compliant code generated by modern LLMs (Wang et al., 30 Jan 2026).
1. Construction Pipeline and Data Sources
RealSec-bench employs a multi-stage, two-phase pipeline to curate tasks from GitHub’s most prominent Java repositories. The process begins with GitHub API queries filtering for the top 4,000 starred projects, further narrowed to Maven-based builds and stratified by topic to ensure diversity. CodeQL is used for systematic static analysis (SAST) in high-recall mode, ranking repositories by frequency of vulnerability alerts. Manual curation then ensures broad CWE-type coverage, yielding 532 candidate repositories and over 20,000 raw findings.
The vulnerability instance creation phase refines these findings:
- Attribute filtering: Only functions with at least one existing unit test are retained, enabling a “fail-to-pass” evaluation design.
- False positive elimination: GPT-4.1 labels potential vulnerabilities as true or false positive, assigns CWE types, and outputs are independently reviewed by two security engineers; disagreements are resolved collaboratively.
- Standardization: GPT-4.1 rewrites all task docstrings into security-neutral Javadoc descriptions, subsequently validated by professional programmers and security experts to eliminate leakage of vulnerability specifics.
- Final human review: Security experts ensure docstring fidelity to intended functionality, absent any hints of underlying flaws.
This pipeline results in a high-fidelity set of 105 executable tasks, each situated within realistic repository context and distributed across 30 source repositories (Wang et al., 30 Jan 2026).
2. Dataset Composition and Vulnerability Spectrum
The benchmark is composed of 105 tasks, each exemplifying vulnerabilities from 19 CWEs. The dominant category is log injection (CWE-117), present in 56.2% of instances; broken or risky cryptography (CWE-327) and CSRF (CWE-352) collectively comprise ~13%. Other complex categories include XXE (CWE-611), deserialization (CWE-502), path traversal (CWE-22), and command injection (CWE-426).
Data-flow complexity is a central feature: 79% of tasks have inter-procedural taint propagation depths of 0–3 hops (including 35.2% with direct, zero-hop flows), while 21% require propagation tracking across up to 34 method calls. For example, a deserialization vulnerability (CWE-502) may require a model to trace a tainted object through several helper methods before detection is possible. This exposure of high-hop, inter-procedural dependencies surpasses the simplistic scenarios predominant in prior benchmarks (Wang et al., 30 Jan 2026).
3. SecurePass@K Metric and Evaluation Methodology
RealSec-bench introduces SecurePass@K, a composite metric designed to jointly assess both functionality and security. For generated samples, SecurePass@K is defined as:
where is true if a sample passes all unit tests, and if it survives a two-stage security adjudication:
- CodeQL scan: Flags candidate vulnerabilities. Absence of alerts deems a sample secure.
- LLM-based panel adjudication: If CodeQL flags an alert, a “voter” panel of three LLMs analyzes the code, the majority vote is synthesized by a “final judge” LLM to produce a definitive verdict.
Thus, SecurePass@K captures the empirical joint probability that an LLM produces a functionally correct, vulnerability-free solution within attempts (Wang et al., 30 Jan 2026).
4. Experimental Protocol and Results
RealSec-bench evaluates five prominent LLMs: gpt-4.1, gpt-4.1-mini, Claude-3.7-Sonnet, Deepseek-V3, and Qwen3-235B, initializing each at temperature 0.7, top-p 1.0, and a context window of 4096 tokens. Three principal prompting strategies are tested:
- Baseline (Origin): One-shot prompt, devoid of security hints.
- Retrieval-Augmented Generation (RAG): Contextual augmentation via sparse retrieval (BM25), dense retrieval (RLCoder), and CodeQL-derived data-flow paths.
- Security-guideline prompting: Embeds five OWASP-inspired directives (input validation, least-privilege, vetted cryptography, safe logging, secure configs).
Quantitative results reveal fundamental difficulties. Across models, mean Pass@1 is approximately 12.95%, Secure@1 is 6.28%, and SecurePass@1 is only 5.14%. Even with , SecurePass@5 remains under 7%. The best single-model SecurePass@1 observed was 4.76% (Claude-3.7-Sonnet baseline) rising to 7.62% under dense retrieval. Notably, cryptographic and concurrency-related tasks yielded SecurePass@1 ≈ 0% for all models (Wang et al., 30 Jan 2026).
| Prompt/Metric | Pass@1 (%) | Secure@1 (%) | SecurePass@1 (%) |
|---|---|---|---|
| Baseline (avg. across models) | 12.95 | 6.28 | 5.14 |
| RAG (BM25/dense avg.) | ~19 | negligible | negligible |
| Security-guideline | Variable | Variable | Generally reduced |
The data demonstrate that while retrieval-based augmentation (e.g., BM25, dense retrievers) increases Pass@1 rates (functional correctness), it fails to lead to tangible security improvements as measured by SecurePass@K. Security-specific prompting slightly improves SecurePass@1 on some models but can sharply reduce overall Pass@1 due to compilation failures and over-constrained code generation (Wang et al., 30 Jan 2026).
5. Key Findings: Security-Functionality Gap and Model Shortcomings
A prominent feature of RealSec-bench is the pronounced gap between functional and secure code generation. RAG reliably improves functional correctness (Pass@1), pushing average rates to ~19%, but confers negligible or zero benefit to SecurePass@1. Even exposure to precise data-flow paths (“oracle” retrievers) does not materially improve vulnerability mitigation rates.
Embedding generalized security best-practices as prompt instructions often causes models to generate uncompilable or overly restrictive code, lowering Pass@1 and having no deterministic effect on security outcomes. The result is a persistent, quantitative bifurcation: no model exceeds 8% SecurePass@1 on any configuration. In high domain expertise tasks, such as those requiring correct cryptographic usage or multi-hop taint tracking, all models perform at floor levels (Wang et al., 30 Jan 2026).
6. Implications and Research Directions
RealSec-bench exposes the dual challenge for LLMs—producing code that is both functionally correct and robustly free of subtle, real-world vulnerabilities. The complexity of real repository contexts, particularly multi-hop, inter-procedural dependency chains, causes most models to fail in ways not observable in prior, synthetic benchmarks. This suggests that straightforward scaling of current LLM methodologies or prompt engineering will not solve the joint requirement for security and correctness in realistic settings.
A plausible implication is that new architectures or training protocols emphasizing security-correctness co-optimization and inter-procedural reasoning are required. The benchmark and its associated methodology are poised to catalyze longitudinal research tracking improvements in secure code generation, the development of advanced retrieval or analysis pipelines, and creation of more robust evaluation criteria tailored to high-assurance software domains (Wang et al., 30 Jan 2026).