Unified Security & Functionality Evaluation

Updated 5 January 2026

Unified Security + Functionality Evaluation is an assessment framework combining security vulnerability detection and functionality testing to ensure reliable code generation.
The benchmarks integrate dynamic oracles for security and unit tests for functionality across multiple languages and CWEs, enabling precise and reproducible evaluations.
This unified approach addresses the limitations of separate evaluations by ensuring that generated code meets both functional correctness and robust security standards.

Unified Security + Functionality Evaluation (CWEval/SafeGenBench) refers to the rigorous assessment frameworks that simultaneously evaluate both vulnerability detection (security) and specification compliance (functionality) in automatically generated or machine-assisted source code. These methodologies, which include prominent examples such as CWEval, SafeGenBench, CASTLE, DUALGUAGE, and SecureAgentBench, are foundational in advancing empirical research into trustworthy code generation by LLMs, code agents, and static/formal verification tools. The unified approach addresses the historical weakness of separately evaluating functionality and security—a separation that frequently permits code artifacts to pass correctness benchmarks while still manifesting exploitable vulnerabilities, or vice versa.

1. Motivation and Conceptual Foundations

A unified evaluation framework for secure code generation is predicated on the necessity of assessing two critical axes: specification compliance and vulnerability freedom, for the same artifact and within the same benchmarking context. Prior benchmarks (e.g., HumanEval, APPS, SecurityEval) exclusively measure either specification-level correctness or vulnerability presence; the consequence is that functionally “correct” code may be insecure, or “secure” patches may be non-functional. Practical development mandates that code achieves both simultaneously (Pathak et al., 24 Nov 2025, Peng et al., 14 Jan 2025). The adoption of outcome-driven, task-aligned benchmarks reflects a systematic effort to uncover and correct security-functionality trade-offs across diverse architectures and programming languages.

2. Benchmark Design and Dataset Construction

Unified frameworks introduce benchmarks where each task is paired with both functional and security evaluation mechanisms. CWEval assembles 119 problems, spanning 31 CWEs and five languages, with each problem bundled with a precise function signature, an unambiguous specification, and full suites of dynamic oracles (unit tests for functionality, behavioral probes for vulnerabilities such as ASAN instrumentation and SQL injection checks) (Peng et al., 14 Jan 2025). CASTLE offers 250 micro-benchmark C programs, each annotated for the presence and location of at most one vulnerability—enabling line-level, ground-truth evaluation across 25 MITRE Top-25 classes and domains underserved by existing FV tools (Dubniczky et al., 12 Mar 2025). SafeGenBench similarly covers 44 CWE types in twelve languages, with test cases crafted around realistic developer prompts refined for functionality-neutral phrasing, then labeled by security experts (Li et al., 6 Jun 2025). DUALGUAGE-Bench curates 154 tasks, each checked by manually validated dual test suites for rigorous joint coverage (Pathak et al., 24 Nov 2025). SecureAgentBench uniquely focuses on repository-level, multi-file vulnerability scenarios drawn from real-world commits with verifiable exploit PoCs (Chen et al., 26 Sep 2025).

<table> <thead> <tr> <th>Framework</th> <th>Security Evaluation</th> <th>Functionality Evaluation</th> </tr> </thead> <tbody> <tr> <td>CWEval</td> <td>Dynamic oracles (ASAN, injection probes)</td> <td>Bundled unit tests, signature compliance</td> </tr> <tr> <td>CASTLE</td> <td>Line-level vulnerability ground-truth</td> <td>Attachable correctness tests, coverage metrics</td> </tr> <tr> <td>SafeGenBench</td> <td>SAST + LLM judge aggregation</td> <td>Extendable unit tests (recommended in guidelines)</td> </tr> <tr> <td>DUALGUAGE</td> <td>Agentic executor, LLM judge, PoC inputs</td> <td>Coverage-enforced functional tests</td> </tr> <tr> <td>SecureAgentBench</td> <td>PoC exploit run, static analysis (Semgrep)</td> <td>Differential testing against gold solution</td> </tr> </tbody> </table>

The rationale behind unified benchmarks includes coverage of the full CWE taxonomy, unbiased task construction, security-spec semantics without explicit hints, and standalone reproducibility. Micro-benchmarks are used for ground-truth clarity; repository-level cases expose long-context and multi-file risks encountered in genuine development scenarios.

3. Evaluation Protocols and Metric Formulations

Unified frameworks operationalize outcome-driven or dual-judge evaluation strategies. Each code sample is subjected to both:

Functionality Oracles: Pass/fail unit tests, I/O assertions, differential testing against reference solutions.
Security Oracles: Dynamic exploit probes (e.g., timeouts for DoS, ASAN for memory safety), SAST scanning, LLM-based vulnerability checkers, and in repository tasks, PoC exploit validation (Peng et al., 14 Jan 2025, Chen et al., 26 Sep 2025, Li et al., 6 Jun 2025).

Formal metrics are unified at either sample, task, or aggregate levels:

For a candidate $s$ $s$ :
- $\mathbf{F}(s)$ : 1 if all functional tests pass; 0 otherwise
- $\mathbf{S}(s)$ : 1 if all security tests pass; 0 otherwise
Aggregated metrics:
- $\mathrm{func@}k$ , $\mathrm{func\!-\!sec@}k$ , functional and security pass rates (Peng et al., 14 Jan 2025)
- CASTLE Score: single interpretable scale factoring true/false positives, negatives, CWE severity ranking, and penalizing over-reporting ( $-\infty$ to 1250 theoretical max) (Dubniczky et al., 12 Mar 2025)
- SecureAgentBench: $R_{\text{func}}$ , $R_{\text{sec}}$ , $R_{\text{both}}$ on comprehensive solution categories (Chen et al., 26 Sep 2025)
- DUALGUAGE: joint score $S = \alpha C + (1-\alpha)V$ , adjustable for security-functionality priority (Pathak et al., 24 Nov 2025)
- SafeGenBench: $S_\text{total}(c) = \alpha F(c) + (1-\alpha)S_\text{security}(c)$ ; strict regime multiplies two binary outcomes (Li et al., 6 Jun 2025)

Metrics such as pass@k and secure-pass@k, and multi-objective F₁ aggregations, are employed to report empirical success under budgeted sampling regimes.

4. Comparative Findings and Tool Performance

Empirical results consistently show a pronounced gap between raw functional correctness and true security under unified protocols. For example, GPT-4o achieves $\mathrm{func@}10 = 91.4\%$ but only $\mathrm{func-sec@}10 = 60.3\%$ ; larger LLMs slightly improve security but retain a significant delta (Peng et al., 14 Jan 2025). CASTLE demonstrates LLM recall for vulnerability detection up to 90% in micro-benchmarks, with scores $\approx 950$ –$980$, superseding static analyzers and FV tools (e.g., ESBMC, CBMC) except in domains outside FV scope (Dubniczky et al., 12 Mar 2025). DUALGUAGE shows that GPT-5’s pass@1 is $50.7\%$ , but secure-pass@1 falls to $11.7\%$ , a ~77% drop (Pathak et al., 24 Nov 2025). SecureAgentBench reveals only $15.2\%$ correct-and-secure solutions in repo-level, multi-file cases—the majority of functional fixes are still insecure or introduce new vulnerabilities (Chen et al., 26 Sep 2025). SafeGenBench results indicate 37.4% secure code snippets under zero-shot for LLMs, with improvements from prompt engineering (Li et al., 6 Jun 2025).

These findings imply that bare functionality scores can obfuscate real risk, while security-only metrics omit practical utility. Unified evaluation exposes latent failings in existing LLM code agents and benchmark methodologies.

5. Architecture and Oracles: Design Principles

CWEval and similar frameworks mandate specification clarity (full function signature, precise specification, illustrative I/O pairs), stringent separation between security semantics and functional ambiguity (no explicit “sanitize” hints), and the packaging of dynamic outcome-driven oracles (Peng et al., 14 Jan 2025). Benchmarks are designed to be modular and language-agnostic, enabling extension to new languages through test harness porting. CASTLE’s micro-benchmarks emphasize single-vulnerability and line-level accuracy, facilitating precise tool evaluation. SecureAgentBench’s multi-file, repo-scale scenarios enforce realistic developer workflows and context (Chen et al., 26 Sep 2025).

DUALGUAGE automates all phases—sample generation, agentic execution in sandboxed environments, semantic LLM evaluation, and aggregation with dashboard support—yielding reproducible, scalable, and comprehensive assessments (Pathak et al., 24 Nov 2025). SafeGenBench employs a two-stage dual-judge pipeline and recommends the integration of both SAST and LLM-based security evaluation, alongside functional testing and composite scoring (Li et al., 6 Jun 2025).

A plausible implication is that future frameworks will further refine these principles, incorporating richer side-channel analysis, information leakage quantification (as in McFIL), and diverse vulnerability categories.

6. Limitations, Trade-offs, and Extensibility

Current frameworks identify several limitations: LLMs and code agents exhibit difficulty handling long-context, multi-file edits; simple prompt augmentation fails to significantly improve secure generation; static analyzers may over- or under-report certain vulnerabilities; dynamic or LLM-based judges introduce variability and context dependence (Chen et al., 26 Sep 2025, Li et al., 6 Jun 2025). Most benchmarks still favor single-function, single-vulnerability tasks; expanding toward multi-module, cross-language problems remains an open direction.

Trade-offs documented include the “alignment tax,” where security-hardened code generations lose basic utility or functional accuracy (Peng et al., 14 Jan 2025). Instruction tuning on LLMs may actually degrade secure-pass rates despite minor functional improvements (Pathak et al., 24 Nov 2025). Ensemble approaches (combining tools) can exploit complementary strengths but risk cumulative false positives (Dubniczky et al., 12 Mar 2025).

Extensibility is achieved via modular task and oracle design, support for dual or multi-objective scoring, standardized evaluation pipelines, and open-source artifact availability (CWEval, CASTLE, DualGauge) (Peng et al., 14 Jan 2025, Dubniczky et al., 12 Mar 2025, Pathak et al., 24 Nov 2025). Recommendations consistently urge benchmark expansion to broader vulnerability classes, languages, exploit modalities, and continuous evaluation cycles.

7. Future Directions and Research Outlook

Unified security + functionality evaluation platforms represent the convergence of empirical, formal, and practical standards for trustworthy code generation. Future research will focus on dynamic expansion of benchmarks (multi-file, cross-domain, side-channel and leakage analysis), robust dual-judge ensemble evaluation, agentic repair and behavioral regression understanding, and periodic re-evaluation of evolving LLM versions and code agents. Methodologies such as McFIL offer automated quantification of functionality-inherent leakage, providing complementary metrics for privacy-centric assessment (Zinkus et al., 2023).

A plausible implication is the formation of hybrid toolchains combining outcome-driven oracles, formal verification, dynamic exploit probes, and leakage quantification to deliver fine-grained, interpretable, and standardized assessment of secure code generation capability across LLMs, code agents, and static/formal tools. Researchers and practitioners should adopt modular, reproducible, and continually extensible frameworks to drive empirical progress in secure software development.

Key references: (Peng et al., 14 Jan 2025, Dubniczky et al., 12 Mar 2025, Pathak et al., 24 Nov 2025, Li et al., 6 Jun 2025, Chen et al., 26 Sep 2025, Zinkus et al., 2023, Jensen et al., 2024, Thomborson, 2015)