Papers
Topics
Authors
Recent
2000 character limit reached

Code Security Benchmarking

Updated 26 December 2025
  • Code security benchmarking is the systematic evaluation of code generation systems that measures vulnerabilities by combining static analysis, dynamic testing, and semantic evaluation.
  • Benchmarks leverage curated datasets mapped to CWE and OWASP taxonomies, ensuring robust coverage across languages and realistic attack scenarios with tools like Semgrep and CodeQL.
  • Empirical findings show high failure rates in secure code generation and underline the need for continuous integration, automated benchmark generation, and multi-dimensional evaluation metrics.

Code security benchmarking is the systematic evaluation of code generation systems—particularly LLMs and coding agents—specifically with respect to their ability to produce, repair, discriminate, and reason about software vulnerabilities. Unlike traditional code evaluation benchmarks that primarily target functional correctness, code security benchmarking surfaces the security defects latent in generated code, quantifies susceptibility to known vulnerability categories, and provides comparable metrics across languages, threat surfaces, and model classes. Modern frameworks combine curated datasets, dynamic and static test oracles, multi-dimensional evaluation methods, and rigorous reporting standards to reveal persistent weaknesses and guide both model alignment and secure software development workflows.

1. Vulnerability Taxonomies and Benchmark Dataset Design

Recent benchmarks adopt vulnerability taxonomies centered on the Common Weakness Enumeration (CWE) and OWASP Top-10, structuring tasks around representative classes such as injection flaws (CWE-89, CWE-78, CWE-79), memory safety violations (CWE-119, CWE-416, CWE-787), misconfigurations (CWE-494, CWE-1104), and credential management errors (CWE-259, CWE-798). For example, SafeGenBench targets 44 CWEs spanning eight categories, each mapped to realistic single-function tasks derived from production development scenarios and reviewed by domain experts (Li et al., 6 Jun 2025). Similarly, SecCodePLT uses a smaller but dynamically extensible set—27 CWEs mapped to security-critical Python tasks paired with functionality and vulnerability test suites generated through a combined expert+mutation pipeline (Yang et al., 14 Oct 2024). Repository-level benchmarks such as SecRepoBench (Dilgren et al., 29 Apr 2025) and A.S.E (Lian et al., 25 Aug 2025) source tasks from real-world CVEs and vulnerability-inducing commits, preserving full repository, build, and dependency context to accurately model genuine attack surfaces.

Modern dataset construction methods reflect several principles:

  • Breadth and depth: Large-scale coverage over many CWE classes, languages (Python, C, C++, Java, Go, PHP, etc.), and task forms (code completion, repair, patching), with each case designed for high fidelity to its ground-truth vulnerability.
  • Adversarial and unbiased selection: Inclusion of real OSS-Fuzz cases, CVEs, and mutated tasks to ensure models are stress-tested beyond synthetic prompts.
  • Multi-turn and repository context: MT-Sec and SecureAgentBench demonstrate that multi-turn (dialogue, patch-refinement) and repo-wide (multi-file, build-system) scenarios substantially increase task realism and difficulty (Rawal et al., 13 Oct 2025, Chen et al., 26 Sep 2025).

2. Evaluation Frameworks: From Static Analysis to Dynamic Oracles

Contemporary code security benchmarks employ hybrid evaluation pipelines leveraging both static and dynamic methods:

  • Static Analysis: Tools such as Semgrep, CodeQL, Weggli, SonarQube, Snyk Code, and Joern are widely integrated. For instance, SafeGenBench uses Semgrep for language-wide CWE matching, mapping severity levels to binary security scores (Li et al., 6 Jun 2025). CASTLE employs ground-truth line and CWE annotation for micro-benchmark evaluation of LLMs and traditional analyzers (Dubniczky et al., 12 Mar 2025).
  • Dynamic Testing: Execution-based oracles, as used in SecCodePLT, CWEval, SEC-bench, and DUALGAUGE, evaluate generated code on functional and adversarial security test suites; code is executed in sandboxes with unit tests, exploit harnesses, AddressSanitizer, and PoC payloads (Yang et al., 14 Oct 2024, Peng et al., 14 Jan 2025, Lee et al., 13 Jun 2025, Pathak et al., 24 Nov 2025).
  • LLM-based Judging: LLMs are co-opted as semantic evaluators for ambiguity, catching subtle architectural or data flow vulnerabilities beyond the reach of static patterns (e.g., SafeGenBench’s DeepSeek-R1 judge) (Li et al., 6 Jun 2025).

Evaluation mechanics typically enforce a “fail-closed” posture—a code candidate is classified as secure only if all judges, static and semantic, pass it. Repository-level benchmarks containerize their evaluation environments (with pinned toolchain and analyzer versions) to guarantee reproducibility and stability across repeated runs (Lian et al., 25 Aug 2025, Dilgren et al., 29 Apr 2025).

3. Core Metrics and Analytical Formulations

Security benchmarking has converged on a set of quantitative and density-style metrics, often expressed in formal notation:

  • Classical Metrics: Precision, recall, and F1-score computed over true/false positives/negatives, benchmarked against static/dynamic oracles:

Precision=TPTP+FP,Recall=TPTP+FN,F1=2×Precision×RecallPrecision+Recall\mathrm{Precision} = \frac{TP}{TP + FP}, \quad \mathrm{Recall} = \frac{TP}{TP + FN}, \quad F_1 = 2 \times \frac{\mathrm{Precision} \times \mathrm{Recall}}{\mathrm{Precision} + \mathrm{Recall}}

  • Joint Functionality/Security Metrics: pass@k and func-sec@k, capturing the probability that at least one of k generations passes both functional and security test suites (Peng et al., 14 Jan 2025, Pathak et al., 24 Nov 2025)
  • Vulnerability Density: Number of vulnerabilities per 1 KLOC or per lines-of-code written:

Dvuln=Number of vulnerable samplesTotal lines of code (thousands)D_\mathrm{vuln} = \frac{\text{Number of vulnerable samples}}{\text{Total lines of code (thousands)}}

  • Success Rates in Multi-dimensional Tasks:

    • Secure-and-correct rate (RcsR_{cs}), functional correctness rate (RfuncR_{func}), and pure security:

    Rcs=1Ni=1N(fisi)R_{\mathrm{cs}} = \frac{1}{N} \sum_{i=1}^N (f_i \cdot s_i)

    where fif_i and sis_i are indicators of functionality/security per-task (Chen et al., 26 Sep 2025).

  • Category- and context-aware indices: Security Vulnerability Detection Rate (SVDR), Patch Correctness Rate (PCR), Build Success Rate (BSR), and Generation Stability Index (GSI), as instantiated in A.S.E for repository-level assessment (Lian et al., 25 Aug 2025).

Metrics are often extended with specialized scores that capture the trade-off between security and usability, e.g., secure@k vs. vulnerable@k (Siddiq et al., 2023), CASTLE Score with severity bonuses, and density/risk-normalized statistics (Dubniczky et al., 12 Mar 2025). Multi-turn and agent benchmarks explicitly capture transitions across interactive steps using joint test suites and per-turn security tracking (Rawal et al., 13 Oct 2025).

4. Empirical Findings: Limits and Patterns in LLM Code Security

Key empirical trends distilled from recent benchmarks:

  • High rates of security failure: Even the best LLMs fail to produce secure code in a majority of scenarios. For example, SafeGenBench observes only 37.4% overall security accuracy in zero-shot (over 60% insecure) (Li et al., 6 Jun 2025); DUALGAUGE secure-pass@1 never exceeds 12% for top models (Pathak et al., 24 Nov 2025); SecRepoBench’s best secure-pass@1 is 28%, a 30-point drop relative to simpler benchmarks (Dilgren et al., 29 Apr 2025).
  • Category disparities: Memory safety is best handled (up to 76% secure in SafeGenBench), while configuration, deserialization, and hard-coded secret classes yield the lowest secure output rates.
  • Model size does not guarantee security: Lighter models sometimes outperform larger ones (Qwen3-8B > CodeLlama-70B on OSS-Bench) (Jiang et al., 18 May 2025).
  • Agentic scaffolds have limited benefit in complex/multi-turn settings: Multi-step agent orchestration (Aider, OpenHands, SWE-agent) provides clear gains in single-turn, but these are diminished or even reversed in repository/multi-turn (MT-Sec) settings due to error accumulation and cross-turn dependency breakdowns (Rawal et al., 13 Oct 2025, Chen et al., 26 Sep 2025, Lee et al., 13 Jun 2025).
  • Prompt engineering offers modest but bounded improvements: Security-focused prefix/suffix prompts and policy reminders raise secure code rates in simple settings (up to +20–30 pp), but have diminishing returns in repository-level benchmarks (Bruni et al., 9 Feb 2025, Dilgren et al., 29 Apr 2025).

Common failure modes are systematic: omission of crucial checks (e.g., missing integrity checks, residual hard-coded credentials), failure to propagate or update security invariants across edits, over- or under-sanitization (leading to broken or insecure programs), and hallucinated or incomplete patches.

5. Benchmark Construction Automation and Scalability

The field is progressing from labor-intensive, static challenge sets to automated benchmark generation. Systems such as AutoBaxBuilder synthesize novel tasks, unit tests, and executable exploits via LLM orchestration loops with plausibility checks and self-critique steps (Arx et al., 24 Dec 2025). This reduces the time and cost of benchmark construction by more than an order of magnitude, with empirical agreement (∼81% precision/recall) relative to human-authored baselines, and the ability to continually expand or increase difficulty.

Benchmark generators built on real codebases (OSS-Bench, SecRepoBench, A.S.E, SecureAgentBench) drive live task creation from daily OSS commits, CVE patches, and fuzzed/crash traces. Exploit and oracle generation, however, remains an open challenge for complex multi-file and configuration-driven vulnerabilities (Jiang et al., 18 May 2025, Dilgren et al., 29 Apr 2025, Lian et al., 25 Aug 2025, Chen et al., 26 Sep 2025, Arx et al., 24 Dec 2025).

6. Best Practices, Integration, and Outlook

Based on recent frameworks and ablation studies, best practices for code security benchmarking include:

The field’s main open challenges are in expanding benchmarks to new languages, scaling automated oracle/test generation, supporting dynamic security requirement extraction from CVEs, and integrating agentic code reasoning and context retrieval for long-horizon, large-repo settings.


References:

(For comprehensive metadata and methodology, see the cited arXiv documents.)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Code Security Benchmarking.