MLVBench: Cross-Language Security Benchmark
- MLVBench is a curated benchmark that unifies vulnerabilities across Python, Java, C++, and Ruby using semantics-preserving code transformations.
- It comprises 33 original vulnerabilities and 106 transformed variants covering 25 CWE categories, ensuring robust and unbiased evaluation.
- Integrated within CFCEval, it employs metrics like Fixing Capability and ELRM to accurately assess both direct and transformation-induced LLM performance.
MLVBench is a cross-language vulnerability benchmark specifically constructed to mitigate dataset bias and rigorously evaluate the security-fixing generalization capabilities of code-focused LLMs. It is designed as an integral component of CFCEval, a framework addressing both the quality and security aspects of LLM-generated code by overcoming the common problem of evaluation inflation due to inadvertent training–test overlap (Cheng et al., 6 Dec 2025).
1. Motivation and Rationale
MLVBench addresses the issue of dataset bias arising from the reuse of public-repository code snippets in both training and evaluation phases of code LLMs, which artificially inflates apparent model performance due to memorization rather than generalization. The primary aim is to build a linguistically diverse, semantically equivalent, but distributionally shifted benchmark for vulnerability repair. This is achieved by applying systematic, semantics-preserving transformations to known vulnerable code, which necessitates true generalization by the model, beyond simple pattern recall. Within CFCEval, MLVBench is the dedicated bias-mitigating dataset supporting the robust measurement of model performance on both familiar and semantically novel variants of security flaws.
2. Composition and Coverage
MLVBench unifies three pre-existing vulnerability resources to create a multi-language, function-level suite:
- PyP4LLMSec (Python vulnerabilities)
- VJBench (Java, based on Vul4J)
- CodeQL Security Examples (C++ and Ruby)
The collected statistics are as follows:
| Language | Vulnerabilities | CWE Coverage | Variants |
|---|---|---|---|
| Python | 7 | 25 | 24 |
| Java | 12 | 25 | 48 |
| C++ | 7 | 25 | 18 |
| Ruby | 7 | 25 | 16 |
| Total | 33 | 25 | 106 |
These vulnerabilities collectively address 25 of the "2024 CWE Top 25 Most Dangerous Weaknesses", ensuring substantial breadth across vulnerability types such as buffer overflows, injection flaws, improper input validation, and format-string errors. The final benchmark comprises 33 original vulnerable functions and 106 transformed variants, totaling 139 evaluation instances.
3. Construction Methodology
The generation of MLVBench instances involves the following steps:
- Selection and Filtering: The union of vulnerabilities from the three source resources is filtered to ensure representation of the CWE Top 25 categories. Gaps are filled by additional sampling to cover all 25 target weakness categories.
- Semantics-Preserving Transformations: Each original vulnerable function undergoes up to seven distributionally shifting but semantics-preserving code transformations, including:
- Identifier renaming
- If-condition flipping
- Loop unrolling or rewriting
- Conditional-statement restructuring
- Function chaining or inlining
- Argument list reordering or modification
- Statement or code-block reordering
Each transformation generates a new prompt and corresponding vulnerability instance , with the guarantee that, after patching, the function demonstrates identical input/output semantics to the original.
4. Integration into CFCEval and Evaluation Protocol
Within CFCEval, MLVBench provides:
- Pairs of evaluation triples for each vulnerability:
- : original vulnerable function, vulnerability description, and human patch
- : transformed version with its corresponding vulnerability and patch
CFCEval employs MLVBench to evaluate LLM-generated code using four specific dimensions:
- Programming Language Quality (PLanQul.)
- Fixing Capability (FixCap.): success on original prompt
- Post-Transformation Fixing Capability (PTFixCap.): success on transformed prompt
- Element-Level Relevance (ELeRelv.): assessed with the new ELRM metric
MLVBench benchmarks both direct and robustness-oriented model performance, measuring not only whether LLMs can generate fixes, but also whether such capabilities generalize under superficial but non-trivial changes of surface code structure.
5. Metrics, Empirical Results, and Comparative Analysis
The empirical framework incorporates both classical and novel metrics. The new ELRM metric yields higher alignment with human and LLM-based expert judgments than BLEU or CodeBLEU under MLVBench’s distributional shifts. In evaluations on a random 20-case subset (80 prompt–response pairs, four models), average scores (0–100) were:
| Model | ELRM | BLEU | CodeBLEU | LLMs |
|---|---|---|---|---|
| Cursor | 24.72 | 30.33 | 29.98 | 2.38 |
| GitHub Copilot | 22.93 | 29.43 | 29.43 | 2.90 |
| CodeGeeX4 | 29.19 | 36.22 | 30.75 | 2.40 |
| DeepSeekCoder | 18.65 | 21.19 | 24.16 | 1.92 |
A paired t-test on ELRM between CodeGeeX4 and DeepSeekCoder (t = 2.81, p = 0.0056) demonstrates discriminative power under high input diversity. Correlations of ELRM with both LLM-judge and human scores (Pearson for LLMs/humans on Cursor) surpass those of BLEU or CodeBLEU, substantiating its effectiveness.
Fixing capability rate (FCR) and post-transform fixing rate (PTFCR) quantify security robustness, e.g., Copilot achieved FCR = 25% and PTFCR = 30% on 20 tasks (as judged by GPT-Scorer).
6. Strengths and Limitations
Strengths:
- Enables cross-language evaluation spanning Python, Java, C/C++, and Ruby.
- Controls for dataset overlap via semantics-preserving yet structurally diverse transformations.
- Maps every (original, transformed) vulnerability to precise human reference patches, supporting robust and fine-grained assessment.
- Broad CWE category coverage reflects real-world security risk landscape.
Limitations:
- Small scale: 33 vulnerabilities and 106 variants may not reflect full-scale software diversity.
- Single-function focus: excludes multi-function or inter-procedural vulnerabilities.
- Does not address dynamic or data-flow-based obfuscations.
- Manual curation may introduce subtle selection bias toward more easily teachable bugs.
7. Recommendations and Future Directions
MLVBench should be employed as a bias-mitigation layer within larger code-generation benchmarks for Code LLMs. Recommendations include:
- Extending MLVBench with new languages (e.g., Go, JavaScript) and additional vulnerability types (e.g., race conditions).
- Augmenting static transformations with dynamic analysis and fuzzing to capture vulnerabilities manifesting only during execution.
- Regular adoption of MLVBench’s FCR, PTFCR, and ELRM metrics in model evaluation pipelines to continuously measure generalization and security robustness improvements.
By rigorously quantifying both dataset-overlap bias and true security-fixing capability, MLVBench—when used within CFCEval’s multidimensional framework—offers a principled approach for advancing secure code-generation research and for developing LLMs with genuine generalization beyond memorized training data (Cheng et al., 6 Dec 2025).