Papers
Topics
Authors
Recent
Search
2000 character limit reached

SecCodeBench-V2 Benchmark

Updated 25 February 2026
  • SecCodeBench-V2 is a benchmark that evaluates secure code generation and vulnerability repair through 98 function-level scenarios with and without security hints.
  • It uses dynamic execution in isolated Docker containers, applying functional and security PoC tests with a weighted Pass@K scoring methodology.
  • Built on Alibaba production scenarios, it covers 22 CWE categories across five programming languages with expert-validated, high-fidelity test cases.

SecCodeBench-V2 is a publicly released, execution-driven benchmark designed to rigorously assess the capabilities of LLMs in generating and repairing secure code under industrial-grade, realistic conditions. Developed with scenarios and vulnerabilities derived from Alibaba Group’s production environments, SecCodeBench-V2 comprehensively targets both code generation and vulnerability repair across five programming languages, facilitated with high-fidelity, expert-validated proof-of-concept test cases and a unified evaluation framework grounded in dynamic execution and standardized scoring methodology (Chen et al., 17 Feb 2026).

1. Benchmark Scope and Scenario Design

SecCodeBench-V2 encompasses 98 function-level scenarios, partitioned equally between two primary task types: secure generation ("gen") and vulnerability repair ("fix"). Each scenario additionally provides hint-augmented variants ("gen-hints", "fix-hints") where explicit security hints are incorporated. The four prompt categories are:

  • gen: Implement a secure function from scratch.
  • gen-hints: Identical to gen, but with an explicit security hint in the prompt.
  • fix: Patch an existing function that contains a known vulnerability.
  • fix-hints: Same as fix, with an additional security hint.

All scenarios adhere to a function-level formulation, requiring the model to fill or patch a designated target function while maintaining fixed signatures, return schemas, and dependency constraints. Each task is embedded in a minimal, language-idiomatic project scaffold—such as Maven with JUnit for Java, pytest for Python, CMake for C/C++, go.mod and Go tests, or npm with Mocha/Jest for Node.js.

The benchmark systematically covers 22 distinct CWE categories, including (but not limited to) CWE-89 (SQL injection), CWE-787 (buffer overflow), CWE-611 (XXE), CWE-918 (SSRF), and CWE-502 (insecure deserialization). Each case is annotated with a severity level—Medium, High, or Critical—following CVSS guidelines and double-annotated by security engineers to ensure fidelity and coverage.

Scenario Type Description Available Variants
gen Secure function generation gen, gen-hints
fix Vulnerability repair fix, fix-hints

2. Test Case Authoring and Validation

Ground truth in SecCodeBench-V2 is established through two layers of executable test cases per scenario:

  • Functional tests: Validate implementation correctness regarding the specified I/O behavior, error handling, and other functional requirements.
  • Security PoC tests: Attempt to actively exploit the modeled vulnerability using relevant payloads (e.g., injection strings, overflow triggers, SSRF requests).

All test cases are written within the standard unit-test frameworks of their respective languages and executed at runtime as part of the validation pipeline. The authoring process involves initial creation and subsequent double-review by independent security experts. Coverage audits guarantee the inclusion of representative exploit strategies per vulnerability, while strict anonymization is enforced to mitigate prompt contamination. The dataset thereby maintains broad, realistic coverage with high ground-truth reliability.

3. Dynamic Execution and Evaluation Pipeline

The benchmark evaluation pipeline for SecCodeBench-V2 is characterized by fully dynamic, containerized execution, designed for both reproducibility and isolation. The process comprises five sequential stages:

  1. Initialization: System loads configuration, parses metadata, and verifies availability of all validator containers.
  2. Prompting: For each (case, scenario), constructs the required prompt and issues it to the target LLM API. Up to three iterative repair rounds are possible, adapting prompts in response to compilation or test failures.
  3. Batch Execution: Model-generated code is compiled and executed in a language-specific validator environment within Docker sandboxes.
  4. Result Analysis: Pass/fail outcomes are recorded, Pass@K statistics computed, and scenario/severity weights applied.
  5. Outputs: Aggregated scores, detailed per-case logs, and a public leaderboard are generated.

Functional tests are always executed prior to security PoCs; only passing implementations proceed to the latter phase. All evaluation is performed within isolated containers to preclude environmental side effects and to facilitate result reproducibility.

For cases where security cannot be conclusively adjudicated by deterministic tests—such as those involving weak cryptography, hardcoded secrets, or information leakage—a multi-model "LLM-as-a-judge" oracle is employed. Here, an odd-sized panel of LLMs evaluates the candidate code and determines the security verdict by majority vote.

4. Scoring Methodology

SecCodeBench-V2 adopts a Pass@K-based scoring protocol with scenario- and severity-weighted aggregation. The canonical case is Pass@1 (K=1), with R (default 10) independent LLM invocations per (test case, scenario) pair. Let zi,s,rz_{i,s,r} denote binary pass/fail for round rr, the score is:

Pass@1(i,s)=1R∑r=1Rzi,s,r\text{Pass@1}(i, s) = \frac{1}{R} \sum_{r=1}^{R} z_{i,s,r}

Severity weights (wisevw^{sev}_i) and scenario weights (wsscnw^{scn}_s) are:

  • Severity: Medium = 1.0, High = 2.0, Critical = 4.0
  • Scenario: gen = 4.0, fix = 4.0, gen-hints = 1.0, fix-hints = 1.0

Scenario-level scores and the overall benchmark score are as follows:

Score(s)=∑i∈DPass@1(i,s)⋅wisev∑i∈Dwisev\text{Score}(s) = \frac{\sum_{i\in D} \text{Pass@1}(i,s) \cdot w^{sev}_{i}}{\sum_{i\in D} w^{sev}_{i}}

Overall=∑s∈SScore(s)⋅wsscn∑s∈Swsscn\text{Overall} = \frac{\sum_{s\in S} \text{Score}(s) \cdot w^{scn}_s}{\sum_{s\in S} w^{scn}_s}

All resulting scores lie in [0, 1], where 1.0 denotes complete success across all scenarios. Language-specific scores are computed analogously on the restricted case set for each programming language.

5. Infrastructure and Reproducibility

SecCodeBench-V2 ensures reproducibility through comprehensive artifact release and standardized containerized evaluation. Key resources are available publicly:

Each language validator is encapsulated with its own Docker image:

Language Base Environment
Java Maven + JDK 11
Python Python 3.10 + pytest
C/C++ GCC and CMake
Go Go toolchain 1.24.5
Node.js Node.js 16+ + npm

A central controller service (Python-based) orchestrates prompt construction, sandbox scheduling, log collation, and scoring via YAML/JSON configuration. Auxilliary utilities support prompt template management, new scenario addition, multi-round evaluation for OpenAI-compatible or host-based APIs, and post-processing for score computation and reporting.

6. Role in Secure Code Generation Research

SecCodeBench-V2 establishes a rigorous, reproducible baseline for evaluating the "security posture" of LLM-based AI coding assistants. By integrating function-level, real-world vulnerability cases, double-expert-reviewed PoC test cases, dynamic language-specific execution, provisions for semantic LLM-based judgment, and a principled scenario/severity weighting in aggregate scoring, it enables holistic, comparable assessments across both models and programming languages.

A notable implication is that SecCodeBench-V2 bridges the gap between academic benchmarking and real-world software security needs, offering practitioners and researchers a framework to both select and advance the state of secure AI-based code generation and repair (Chen et al., 17 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SecCodeBench-V2.