Papers
Topics
Authors
Recent
2000 character limit reached

MLVBench: Cross-Language Security Benchmark

Updated 13 December 2025
  • MLVBench is a curated benchmark that unifies vulnerabilities across Python, Java, C++, and Ruby using semantics-preserving code transformations.
  • It comprises 33 original vulnerabilities and 106 transformed variants covering 25 CWE categories, ensuring robust and unbiased evaluation.
  • Integrated within CFCEval, it employs metrics like Fixing Capability and ELRM to accurately assess both direct and transformation-induced LLM performance.

MLVBench is a cross-language vulnerability benchmark specifically constructed to mitigate dataset bias and rigorously evaluate the security-fixing generalization capabilities of code-focused LLMs. It is designed as an integral component of CFCEval, a framework addressing both the quality and security aspects of LLM-generated code by overcoming the common problem of evaluation inflation due to inadvertent training–test overlap (Cheng et al., 6 Dec 2025).

1. Motivation and Rationale

MLVBench addresses the issue of dataset bias arising from the reuse of public-repository code snippets in both training and evaluation phases of code LLMs, which artificially inflates apparent model performance due to memorization rather than generalization. The primary aim is to build a linguistically diverse, semantically equivalent, but distributionally shifted benchmark for vulnerability repair. This is achieved by applying systematic, semantics-preserving transformations to known vulnerable code, which necessitates true generalization by the model, beyond simple pattern recall. Within CFCEval, MLVBench is the dedicated bias-mitigating dataset supporting the robust measurement of model performance on both familiar and semantically novel variants of security flaws.

2. Composition and Coverage

MLVBench unifies three pre-existing vulnerability resources to create a multi-language, function-level suite:

  • PyP4LLMSec (Python vulnerabilities)
  • VJBench (Java, based on Vul4J)
  • CodeQL Security Examples (C++ and Ruby)

The collected statistics are as follows:

Language Vulnerabilities CWE Coverage Variants
Python 7 25 24
Java 12 25 48
C++ 7 25 18
Ruby 7 25 16
Total 33 25 106

These vulnerabilities collectively address 25 of the "2024 CWE Top 25 Most Dangerous Weaknesses", ensuring substantial breadth across vulnerability types such as buffer overflows, injection flaws, improper input validation, and format-string errors. The final benchmark comprises 33 original vulnerable functions and 106 transformed variants, totaling 139 evaluation instances.

3. Construction Methodology

The generation of MLVBench instances involves the following steps:

  • Selection and Filtering: The union of vulnerabilities from the three source resources is filtered to ensure representation of the CWE Top 25 categories. Gaps are filled by additional sampling to cover all 25 target weakness categories.
  • Semantics-Preserving Transformations: Each original vulnerable function FF undergoes up to seven distributionally shifting but semantics-preserving code transformations, including:

    1. Identifier renaming
    2. If-condition flipping
    3. Loop unrolling or rewriting
    4. Conditional-statement restructuring
    5. Function chaining or inlining
    6. Argument list reordering or modification
    7. Statement or code-block reordering

Each transformation generates a new prompt FtF_t and corresponding vulnerability instance Cv,tC_{v,t}, with the guarantee that, after patching, the function demonstrates identical input/output semantics to the original.

4. Integration into CFCEval and Evaluation Protocol

Within CFCEval, MLVBench provides:

  • Pairs of evaluation triples for each vulnerability:
    • (F,Cv,Cr)(F, C_v, C_r): original vulnerable function, vulnerability description, and human patch
    • (Ft,Cv,t,Cr,t)(F_t, C_{v,t}, C_{r,t}): transformed version with its corresponding vulnerability and patch

CFCEval employs MLVBench to evaluate LLM-generated code CgC_g using four specific dimensions:

  1. Programming Language Quality (PLanQul.)
  2. Fixing Capability (FixCap.): success on original prompt
  3. Post-Transformation Fixing Capability (PTFixCap.): success on transformed prompt
  4. Element-Level Relevance (ELeRelv.): assessed with the new ELRM metric

MLVBench benchmarks both direct and robustness-oriented model performance, measuring not only whether LLMs can generate fixes, but also whether such capabilities generalize under superficial but non-trivial changes of surface code structure.

5. Metrics, Empirical Results, and Comparative Analysis

The empirical framework incorporates both classical and novel metrics. The new ELRM metric yields higher alignment with human and LLM-based expert judgments than BLEU or CodeBLEU under MLVBench’s distributional shifts. In evaluations on a random 20-case subset (80 prompt–response pairs, four models), average scores (0–100) were:

Model ELRM BLEU CodeBLEU LLMs
Cursor 24.72 30.33 29.98 2.38
GitHub Copilot 22.93 29.43 29.43 2.90
CodeGeeX4 29.19 36.22 30.75 2.40
DeepSeekCoder 18.65 21.19 24.16 1.92

A paired t-test on ELRM between CodeGeeX4 and DeepSeekCoder (t = 2.81, p = 0.0056) demonstrates discriminative power under high input diversity. Correlations of ELRM with both LLM-judge and human scores (Pearson ρ=0.816/0.828\rho = 0.816/0.828 for LLMs/humans on Cursor) surpass those of BLEU or CodeBLEU, substantiating its effectiveness.

Fixing capability rate (FCR) and post-transform fixing rate (PTFCR) quantify security robustness, e.g., Copilot achieved FCR = 25% and PTFCR = 30% on 20 tasks (as judged by GPT-Scorer).

6. Strengths and Limitations

Strengths:

  • Enables cross-language evaluation spanning Python, Java, C/C++, and Ruby.
  • Controls for dataset overlap via semantics-preserving yet structurally diverse transformations.
  • Maps every (original, transformed) vulnerability to precise human reference patches, supporting robust and fine-grained assessment.
  • Broad CWE category coverage reflects real-world security risk landscape.

Limitations:

  • Small scale: 33 vulnerabilities and 106 variants may not reflect full-scale software diversity.
  • Single-function focus: excludes multi-function or inter-procedural vulnerabilities.
  • Does not address dynamic or data-flow-based obfuscations.
  • Manual curation may introduce subtle selection bias toward more easily teachable bugs.

7. Recommendations and Future Directions

MLVBench should be employed as a bias-mitigation layer within larger code-generation benchmarks for Code LLMs. Recommendations include:

  • Extending MLVBench with new languages (e.g., Go, JavaScript) and additional vulnerability types (e.g., race conditions).
  • Augmenting static transformations with dynamic analysis and fuzzing to capture vulnerabilities manifesting only during execution.
  • Regular adoption of MLVBench’s FCR, PTFCR, and ELRM metrics in model evaluation pipelines to continuously measure generalization and security robustness improvements.

By rigorously quantifying both dataset-overlap bias and true security-fixing capability, MLVBench—when used within CFCEval’s multidimensional framework—offers a principled approach for advancing secure code-generation research and for developing LLMs with genuine generalization beyond memorized training data (Cheng et al., 6 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to MLVBench.