Papers
Topics
Authors
Recent
2000 character limit reached

Element-Level Relevance Metric

Updated 13 December 2025
  • ELRM is a fine-grained metric that evaluates semantic and syntactic relevance between generated code patches and multiple reference variants using refined tokenization.
  • It integrates BLEU submetrics and literal similarity measures to robustly handle code transformations such as renaming, reordering, and refactoring.
  • Empirical validation on datasets like MLVBench shows ELRM achieves significantly higher Pearson correlations with human judgments compared to traditional metrics like BLEU and CodeBLEU.

Element-Level Relevance Metric (ELRM) is a metric introduced to evaluate the fine-grained semantic and syntactic relevance between model-generated code patches and multiple reference code variants, particularly within the context of secure code-generation and vulnerability repair tasks. ELRM aims to address the limitations of prior code similarity metrics, such as BLEU and CodeBLEU, by providing improved sensitivity to short, fragmentary code and by supporting multiple references, refined tokenization, and lexical-semantic matching across several code element types (Cheng et al., 6 Dec 2025).

1. Motivation and Design Principles

The development of ELRM is motivated by several shortcomings in existing code-similarity evaluation metrics:

  • Traditional metrics, such as BLEU and CodeBLEU, are limited by imprecise tokenization (e.g., treating “x==1” as a single token), over-reliance on complete AST/data-flow analysis (which fails on short or synthetic code), and low reference diversity (typically one ground truth).
  • Vulnerability repair tasks often involve short or syntactically altered code fragments, where purely n-gram or tree-based comparators either fail to capture equivalence or over-penalize benign transformations.
  • A central design goal for ELRM is to reward semantically valid variations and penalize irrelevant or redundant code insertions, while supporting code snippets of arbitrary brevity, multiple correct references, and code transformations such as renaming or statement reordering.

2. Formal Definition and Subcomponent Formulation

Let CgC_g denote a generated code patch, and {Cr(k)}k=1K\{C_r^{(k)}\}_{k=1}^K the set of reference (secure) patches. ELRM computes four core sub-metrics:

  • B:B: BLEU on general n-grams.
  • Bw:B_w: BLEU on n-grams weighted by code keywords (following CodeBLEU).
  • Bk ⁣o:B_{k\!o}: BLEU on the ordered sequence of language keywords/operators.
  • S:S_\ell: Average literal similarity, aggregating Levenshtein, SequenceMatcher, and Jaccard similarities between string literals in CgC_g and all Cr(k)C_r^{(k)}.

The final score is a normalized, weighted sum: ELRM(Cg)=αB+βBw+λBk ⁣o+μS    with    α+β+λ+μ=1\mathrm{ELRM}(C_g) = \alpha\,B + \beta\,B_w + \lambda\,B_{k\!o} + \mu\,S_\ell \;\;\text{with}\;\;\alpha+\beta+\lambda+\mu=1 Empirical studies set α=0.10,  β=0.05,  λ=0.80,  μ=0.05\alpha=0.10,\;\beta=0.05,\;\lambda=0.80,\;\mu=0.05.

The core sub-metrics are defined as follows:

  • Standard n-gram BLEU: pn=k=1Ki=1CgCountclip(Cg(i,i+n))k=1Ki=1CgCount(Cg(i,i+n))p_n = \frac{\sum_{k=1}^K\sum_{i=1}^{|C_g|}\mathrm{Count}_{\mathrm{clip}}(C_g(i,i+n))}{\sum_{k=1}^K\sum_{i=1}^{|C_g|}\mathrm{Count}(C_g(i,i+n))}

BP={1,c>r exp(1r/c),cr\mathrm{BP} = \begin{cases} 1, & c > r \ \exp(1 - r/c), & c \le r \end{cases}

B=BP  exp(n=1Nwnlogpn),wn=1NB = \mathrm{BP}\;\exp\left(\sum_{n=1}^N w_n \log p_n\right),\quad w_n = \tfrac{1}{N}

where c=Cgc = |C_g| and rr is the closest reference in length.

  • Weighted BLEU (BwB_w) generalizes BLEU by introducing positional keyword-based weights μni\mu_n^i.
  • Bk ⁣oB_{k\!o} computes BLEU on the token subsequence retaining only language keywords and operators.
  • SS_\ell computes average maximum similarity over all literal pairs across Levenshtein distance, sequence matcher, and Jaccard index, yielding robust detection of literal renames.

Fine-grained tokenization is a requirement: ELRM splits code more aggressively than CodeBLEU, isolating identifiers, keywords, operators, and delimiters (e.g., “x==1” becomes [“x”, “==”, “1”]).

3. Treatment of Code Transformations

ELRM is specifically designed to handle a range of code transformations common in vulnerability repair benchmarks:

  • Identifier renaming: Non-keyword elements are matched exactly; keyword/operator and literal-based metrics are unaffected by identifier changes.
  • Reordering: N-gram BLEU and BLEU on keywords/operators support partial matches, so non-critical statement reordering only partially degrades the score, as opposed to structural AST-based metrics.
  • Structural refactoring: Changes such as loop transforms or branching reversals preserve most keywords/operators and string literals; ELRM assigns partial credit as appropriate.

This design enables ELRM to robustly recognize semantic equivalence in transformed code variants, outperforming metrics with coarser or structurally brittle representations (Cheng et al., 6 Dec 2025).

4. Empirical Validation and Ablation Analysis

Validation of ELRM is performed on MLVBench, a dataset of 33 original vulnerabilities with up to 4 semantic-preserving transformations each (total ~139 variants) across Python, Java, C++, and Ruby. Multiple LLMs are evaluated, including Cursor (Gemini-2.5), GitHub Copilot (GPT-4), DeepSeek-Coder (14B local), and CodeGeeX4.

Correlation with human-annotated judgments is high: per-model Pearson correlation coefficients between ELRM and average human Likert ratings are 0.8281 (Cursor), 0.6681 (Copilot), 0.8601 (CodeGeeX), and 0.7991 (DeepSeek). By contrast, BLEU and CodeBLEU yield correlations no greater than 0.40 in any case.

Ablation studies confirm the critical role of the keyword/operator BLEU component (Bk ⁣oB_{k\!o}): removing it drops human-score correlation by approximately 0.12; removing the literal similarity term (SS_\ell) induces a 0.04 reduction. Tokenization granularity is also validated: reverting to the CodeBLEU tokenizer causes a mean ELRM reduction of ~8 points and a correlation drop of ~0.05 (Cheng et al., 6 Dec 2025).

The fundamental distinction between ELRM and CodeBLEU is the replacement of AST and data-flow matching with (i) dedicated keyword/operator BLEU, (ii) explicit string literal similarity, and (iii) refined tokenization. CodeBLEU’s AST and data-flow components are less reliable for short or fragmentary code. ELRM’s design thus prioritizes lexical and local syntactic relevance to maintain robustness and discriminative power on partial patches and diverse code transformations.

Quantitatively, ELRM shows superior alignment with both LLM-based and human-based evaluations. For example, the mean ELRM for CodeGeeX4 vs. DeepSeek-Coder is significantly differentiated (t=2.81, p=0.0056). ELRM’s Pearson r with LLM-based scores (0.816) also exceeds CodeBLEU’s (0.461), supporting its empirical effectiveness (Cheng et al., 6 Dec 2025).

6. Contextual Applications

ELRM is employed within the CFCEval evaluation framework for LLM-generated code, with relevance as one of four scoring axes (alongside programming quality, vulnerability-fixing capability, and post-transformation fixing capability). The metric is specifically tailored for settings where generated code may be fragmentary, undergoes semantic-preserving transformations, or requires robust discrimination of superficial vs. substantive differences.

In summary, ELRM constitutes a high-granularity, lexical and semantic relevance metric that supersedes prior code-similarity metrics in both methodological rigor and empirical alignment with expert assessments in code patch generation and vulnerability repair (Cheng et al., 6 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to ELRM.