MLVBench: Cross-Language Security Benchmark

Updated 13 December 2025

MLVBench is a curated benchmark that unifies vulnerabilities across Python, Java, C++, and Ruby using semantics-preserving code transformations.
It comprises 33 original vulnerabilities and 106 transformed variants covering 25 CWE categories, ensuring robust and unbiased evaluation.
Integrated within CFCEval, it employs metrics like Fixing Capability and ELRM to accurately assess both direct and transformation-induced LLM performance.

MLVBench is a cross-language vulnerability benchmark specifically constructed to mitigate dataset bias and rigorously evaluate the security-fixing generalization capabilities of code-focused LLMs. It is designed as an integral component of CFCEval, a framework addressing both the quality and security aspects of LLM-generated code by overcoming the common problem of evaluation inflation due to inadvertent training–test overlap (Cheng et al., 6 Dec 2025).

1. Motivation and Rationale

MLVBench addresses the issue of dataset bias arising from the reuse of public-repository code snippets in both training and evaluation phases of code LLMs, which artificially inflates apparent model performance due to memorization rather than generalization. The primary aim is to build a linguistically diverse, semantically equivalent, but distributionally shifted benchmark for vulnerability repair. This is achieved by applying systematic, semantics-preserving transformations to known vulnerable code, which necessitates true generalization by the model, beyond simple pattern recall. Within CFCEval, MLVBench is the dedicated bias-mitigating dataset supporting the robust measurement of model performance on both familiar and semantically novel variants of security flaws.

2. Composition and Coverage

MLVBench unifies three pre-existing vulnerability resources to create a multi-language, function-level suite:

PyP4LLMSec (Python vulnerabilities)
VJBench (Java, based on Vul4J)
CodeQL Security Examples (C++ and Ruby)

The collected statistics are as follows:

Language	Vulnerabilities	CWE Coverage	Variants
Python	7	25	24
Java	12	25	48
C++	7	25	18
Ruby	7	25	16
Total	33	25	106

These vulnerabilities collectively address 25 of the "2024 CWE Top 25 Most Dangerous Weaknesses", ensuring substantial breadth across vulnerability types such as buffer overflows, injection flaws, improper input validation, and format-string errors. The final benchmark comprises 33 original vulnerable functions and 106 transformed variants, totaling 139 evaluation instances.

3. Construction Methodology

The generation of MLVBench instances involves the following steps:

Selection and Filtering: The union of vulnerabilities from the three source resources is filtered to ensure representation of the CWE Top 25 categories. Gaps are filled by additional sampling to cover all 25 target weakness categories.
Semantics-Preserving Transformations: Each original vulnerable function $F$ $F$ undergoes up to seven distributionally shifting but semantics-preserving code transformations, including:
1. Identifier renaming
2. If-condition flipping
3. Loop unrolling or rewriting
4. Conditional-statement restructuring
5. Function chaining or inlining
6. Argument list reordering or modification
7. Statement or code-block reordering

Each transformation generates a new prompt $F_t$ and corresponding vulnerability instance $C_{v,t}$ , with the guarantee that, after patching, the function demonstrates identical input/output semantics to the original.

4. Integration into CFCEval and Evaluation Protocol

Within CFCEval, MLVBench provides:

Pairs of evaluation triples for each vulnerability:
- $(F, C_v, C_r)$ : original vulnerable function, vulnerability description, and human patch
- $(F_t, C_{v,t}, C_{r,t})$ : transformed version with its corresponding vulnerability and patch

CFCEval employs MLVBench to evaluate LLM-generated code $C_g$ using four specific dimensions:

Programming Language Quality (PLanQul.)
Fixing Capability (FixCap.): success on original prompt
Post-Transformation Fixing Capability (PTFixCap.): success on transformed prompt
Element-Level Relevance (ELeRelv.): assessed with the new ELRM metric

MLVBench benchmarks both direct and robustness-oriented model performance, measuring not only whether LLMs can generate fixes, but also whether such capabilities generalize under superficial but non-trivial changes of surface code structure.

5. Metrics, Empirical Results, and Comparative Analysis

The empirical framework incorporates both classical and novel metrics. The new ELRM metric yields higher alignment with human and LLM-based expert judgments than BLEU or CodeBLEU under MLVBench’s distributional shifts. In evaluations on a random 20-case subset (80 prompt–response pairs, four models), average scores (0–100) were:

Model	ELRM	BLEU	CodeBLEU	LLMs
Cursor	24.72	30.33	29.98	2.38
GitHub Copilot	22.93	29.43	29.43	2.90
CodeGeeX4	29.19	36.22	30.75	2.40
DeepSeekCoder	18.65	21.19	24.16	1.92

A paired t-test on ELRM between CodeGeeX4 and DeepSeekCoder (t = 2.81, p = 0.0056) demonstrates discriminative power under high input diversity. Correlations of ELRM with both LLM-judge and human scores (Pearson $\rho = 0.816/0.828$ for LLMs/humans on Cursor) surpass those of BLEU or CodeBLEU, substantiating its effectiveness.

Fixing capability rate (FCR) and post-transform fixing rate (PTFCR) quantify security robustness, e.g., Copilot achieved FCR = 25% and PTFCR = 30% on 20 tasks (as judged by GPT-Scorer).

6. Strengths and Limitations

Strengths:

Enables cross-language evaluation spanning Python, Java, C/C++, and Ruby.
Controls for dataset overlap via semantics-preserving yet structurally diverse transformations.
Maps every (original, transformed) vulnerability to precise human reference patches, supporting robust and fine-grained assessment.
Broad CWE category coverage reflects real-world security risk landscape.

Limitations:

Small scale: 33 vulnerabilities and 106 variants may not reflect full-scale software diversity.
Single-function focus: excludes multi-function or inter-procedural vulnerabilities.
Does not address dynamic or data-flow-based obfuscations.
Manual curation may introduce subtle selection bias toward more easily teachable bugs.

7. Recommendations and Future Directions

MLVBench should be employed as a bias-mitigation layer within larger code-generation benchmarks for Code LLMs. Recommendations include:

Extending MLVBench with new languages (e.g., Go, JavaScript) and additional vulnerability types (e.g., race conditions).
Augmenting static transformations with dynamic analysis and fuzzing to capture vulnerabilities manifesting only during execution.
Regular adoption of MLVBench’s FCR, PTFCR, and ELRM metrics in model evaluation pipelines to continuously measure generalization and security robustness improvements.

By rigorously quantifying both dataset-overlap bias and true security-fixing capability, MLVBench—when used within CFCEval’s multidimensional framework—offers a principled approach for advancing secure code-generation research and for developing LLMs with genuine generalization beyond memorized training data (Cheng et al., 6 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

CFCEval: Evaluating Security Aspects in Code Generated by Large Language Models (2025)

MLVBench: Cross-Language Security Benchmark

1. Motivation and Rationale

2. Composition and Coverage

3. Construction Methodology

4. Integration into CFCEval and Evaluation Protocol

5. Metrics, Empirical Results, and Comparative Analysis

6. Strengths and Limitations

7. Recommendations and Future Directions

Whiteboard

Follow Topic

Continue Learning

MLVBench: Cross-Language Security Benchmark

1. Motivation and Rationale

2. Composition and Coverage

3. Construction Methodology

4. Integration into CFCEval and Evaluation Protocol

5. Metrics, Empirical Results, and Comparative Analysis

6. Strengths and Limitations

7. Recommendations and Future Directions

Sponsor

Whiteboard

Follow Topic

Continue Learning

Related Topics