CFCEval: LLM Code and CFI Security Evaluation

Updated 13 December 2025

The paper presents two distinct pillars: one for evaluating vulnerability fixes in LLM-generated code using the MLVBench/ELRM suite, and another for assessing LLVM CFI mechanisms via a detailed taxonomy.
The framework’s ELRM metric decomposes code into lexical units, operators, and string literals, achieving superior correlation (up to 0.86) with human judgments compared to standard metrics like CodeBLEU.
The LLVM CFI taxonomy guides a phased deployment strategy by matching specific CFI variants to corresponding vulnerability mitigations, validated by empirical testing on real-world CVEs.

CFCEval encompasses two distinct families of research tools and benchmarks. The first is a framework for multidimensional evaluation of vulnerability-fixing and security properties in code generated by LLMs (Cheng et al., 6 Dec 2025). The second is a taxonomy and empirical protocol for assessing practical exploitation mitigation by LLVM's Control Flow Integrity (CFI) mechanisms (Houy et al., 21 Aug 2025). This article presents a comprehensive survey of both pillars, outlining the unique methodologies, core metrics, empirical outcomes, and broader implications for each.

1. Motivation: Evaluating Security and Correctness in Code

LLM-generated code is increasingly integrated into software systems, introducing concerns about both general programming quality and latent security vulnerabilities. Traditional metrics (e.g., CodeBLEU) suffer from dataset bias and lack fine-grained sensitivity to semantic security fixes, while existing compiler-integrated security mechanisms such as CFI need rigorous, real-world deployment guidance and empirical validation. CFCEval frameworks in both domains address these gaps: CFCEval for LLMs provides a multidimensional benchmark and metrics suite, while CFCEval for CFI offers actionable taxonomy and live exploit blocking evaluation (Cheng et al., 6 Dec 2025, Houy et al., 21 Aug 2025).

2. CFCEval for LLM-Generated Code: Benchmark Architecture and Dataset

CFCEval for code generation evaluation comprises two primary components: the MLVBench dataset and the ELRM metric. MLVBench integrates Python, Java, C/C++, and Ruby vulnerable code drawn from PyP4LLMSec, VJBench, and CodeQL, systematically transformed via identifier renaming and control-flow refactoring to induce a distributional shift and minimize training–testing overlap. The benchmark covers 33 Common Weakness Enumeration (CWE)-annotated vulnerability types and includes 139 function-level ground-truth/fixed/transformed function pairs, ensuring broad vulnerability and language coverage (Cheng et al., 6 Dec 2025).

Language	Vulnerabilities	Original+Variants	CWE Coverage
Python	7	24	CWE Top-25, select extra
Java	12	48	CWE Top-25, select extra
C/C++, Ruby	14	67	CWE Top-25, select extra

MLVBench facilitates evaluation with multiple reference patches per function, controlling for reference diversity and enabling element-level relevance measures.

3. Core Metrics: ELRM and Multi-Axis Scoring

The Element-Level Relevance Metric (ELRM) addresses the limitations of CodeBLEU by decomposing outputs into lexical units, operators, and string literals and applying a weighted sum of sub-metrics:

Standard BLEU for n-gram coincidence.
Weighted BLEU emphasizing language keywords.
BLEU restricted to ordered programming keywords/operators.
Mean of Levenshtein, SequenceMatcher, and Jaccard string similarity over string literals.

The formal scoring rule: $\text{ELRM} = \alpha \cdot \mathrm{BLEU} + \beta \cdot \mathrm{BLEU}_{\text{weight}} + \lambda \cdot \mathrm{BLEU}_{\text{keywords\_ops}} + \mu \cdot \mathrm{Similarity}_{\text{string\_literal}}$

Unlike AST- and data-flow-dependent metrics, ELRM is robust for short, unparsable code fragments and supports multiple reference fixes. Empirically, ELRM achieves the highest correlation with human and LLM judge ratings among all standard metrics:

Metric	Pearson with Human
BLEU (CodeBLEU)	≤ 0.398
CodeBLEU	≤ 0.310
ELRM	0.66 – 0.86

4. Evaluation Regime: Four Complementary Dimensions

CFCEval's protocol scores LLM-generated security repair patches on four orthogonal axes:

Programming Language Quality (PLanQul.): Detects syntactic/semantic ill-formedness; scored by GPT-based and reference-based criteria.
Vulnerability-Fixing Capability (FixCap.): Verifies effective vulnerability remediation; dual GPT and exact-match scoring.
Post-Transformation Fixing Capability (PTFixCap.): Measures fix resilience under input transformation, establishing robustness to codebase style/structure changes.
Element-Level Relevance (EleReLv.): Quantifies semantic closeness to correct patches when full fixes are not realized; uses ELRM, BLEU, and CodeBLEU.

5. Empirical Outcomes and Performance Characterization

CFCEval's experimental results show that:

The ELRM metric robustly discriminates between higher- and lower-quality LLMs (e.g., CodeGeeX4 vs. DeepSeek-Coder: $t = 2.81, p < 0.01$ ).
ELRM outperforms other automated metrics on correlation to both expert and LLM judgments in both fix and relevance settings, with human-aligned correlation coefficients exceeding 0.8 for the top-performing models.
When used as an end-to-end benchmark, CFCEval automatically classifies, scores, and compares LLMs such as Copilot and CodeGeeX4 across all four axes, establishing it as a viable multi-dimensional evaluation suite (Cheng et al., 6 Dec 2025).

6. CFCEval for Compiler-Based Security: Control Flow Integrity Taxonomy and Deployment Protocol

Houy et al. provide a comprehensive taxonomy of LLVM's forward-edge CFI mechanisms, establishing the connection between CFI configuration and the classes of memory corruption vulnerabilities mitigated (Houy et al., 21 Aug 2025). LLVM CFI instruments code with runtime checks guaranteeing that every indirect call or cast adheres to compiler-statically determined sets of legal targets: $\forall\ \text{callsite}\ c: \ \mathrm{Sig}(\text{target}(c)) = \mathrm{Sig}(c)$ The taxonomy specifies the protection efficacy of each CFI variant (cfi-icall, cfi-vcall, cfi-nvcall, cfi-mfcall, cfi-derived-cast, cfi-unrelated-cast, cfi-cast-strict) against heap/stack overflow, use-after-free, and type confusion vulnerabilities. Empirical testing across four real-world CVEs demonstrates that only the icall and vcall variants robustly block heap/stack overflow exploitation, while type confusion and UAF exploits in the absence of protected call/cast sites evade detection.

Vulnerability	Example CVE	Mitigated by CFI Variant
Heap overflow	CVE-2021-3156	cfi-icall (✔)
Stack overflow	CVE-2023-49992	cfi-icall (✔)
Use-after-free	CVE-2022-3666	None
Type confusion	CVE-2024-34391	None

The protocol recommends a phased deployment: enable cfi-icall/vcall first, then progressively incorporate cast-focused variants for type-oriented attacks, balancing expanded coverage with compatibility and performance cost.

7. Impact, Limitations, and Future Directions

CFCEval in the LLM-evaluation context increases the granularity and behavioral relevance of code quality and security scoring, exposing failure modes in prior metrics and informing both LLM pretraining and patch-suggestion methodologies. The MLVBench/ELRM suite is extensible to new languages, vulnerability types, and adversarial transformations.

In the compiler security domain, the CFCEval protocol offers pragmatic, vulnerability-specific guidance for incremental CFI deployment, supporting empirical risk reduction without the overhead of maximalist policies. Remaining limitations include incomplete mitigation for type confusion and UAF attacks lacking protected sites, limited backward-edge coverage, and the need for integration with dynamic CFG and hardware techniques.

Proposed extensions for future work include expansion of benchmarks for comprehensive security coverage, integration of dynamic analysis into automated patch assessment, refinement of element-level metrics for collaborative repair, and research into context-sensitive or label-driven CFI schemes for higher-precision enforcement (Cheng et al., 6 Dec 2025, Houy et al., 21 Aug 2025).