Papers
Topics
Authors
Recent
Search
2000 character limit reached

CodeBERTScore: NL→Code Evaluation Metric

Updated 20 May 2026
  • CodeBERTScore is an unsupervised evaluation metric that leverages code-aware Transformer models to compute token-level semantic similarity between generated code and references.
  • It aggregates precision and recall using inverse document frequency weighting and rescaling to produce F1/F3 scores that strongly correlate with human judgments and functional correctness.
  • The implementation builds on CodeBERT and supports multiple programming languages, offering a robust, open-source solution integrated with HuggingFace Transformers and PyTorch.

CodeBERTScore (CBS) is an unsupervised, embedding-based automatic evaluation metric for natural-language-to-code (NL→Code) generation tasks. It generalizes BERTScore to the code domain by leveraging code-aware Transformer models (such as CodeBERT) to produce contextualized embeddings for both generated code and references, conditioned on the natural language (NL) prompt. CBS computes pairwise token-level semantic similarity, enabling soft matching of syntactically and lexically divergent—but semantically equivalent—code fragments, while modeling consistency with the originating NL context. Correlations with human assessment and functional correctness demonstrate superior alignment compared to traditional metrics such as BLEU or METEOR (Zhou et al., 2023).

1. Underlying Methodology and Mathematical Formulation

CBS operates by concatenating the NL prompt xx with the reference code yy^* and generated code y^\hat{y}, forming Sref=xyS_\text{ref} = x \mathbin{\|} y^* and Sgen=xy^S_\text{gen} = x \mathbin{\|} \hat{y}, where “\|” denotes sequence concatenation. Each sequence is tokenized using the CodeBERT tokenizer TB\mathcal{T}_\mathcal{B}: TB(Sref)=x1,,xk,y1,,ym\mathcal{T}_\mathcal{B}(S_\text{ref}) = \langle x_1,\ldots,x_k, y^*_1,\ldots,y^*_m\rangle

TB(Sgen)=x1,,xk,y^1,,y^n\mathcal{T}_\mathcal{B}(S_\text{gen}) = \langle x_1,\ldots,x_k, \hat{y}_1,\ldots,\hat{y}_n\rangle

After passing these tokens through a pretrained CodeBERT encoder B\mathcal{B}_\ell (with selected layer yy^*0), context (NL) and punctuation tokens are masked. This yields code token embeddings:

  • yy^*1 from reference
  • yy^*2 from candidate

For each pair, cosine similarity yy^*3 is computed: yy^*4 The similarity matrix yy^*5 forms the basis for subsequent aggregation.

CBS aggregates these similarities into precision and recall with inverse-document-frequency (idf) token weighting: yy^*6 The F1 score is their harmonic mean; optionally, yy^*7 (e.g., yy^*8 for functional emphasis) can be used: yy^*9

Scores are linearly rescaled into y^\hat{y}0 via baseline subtraction: y^\hat{y}1, with y^\hat{y}2 the mean score of random code pairs (typically y^\hat{y}3).

2. Model Architecture and Implementation Specifics

CBS builds upon Microsoft's code-bert-base model: a 12-layer Transformer, hidden size y^\hat{y}4. Five language-specific variants (Java, Python, C, C++, JavaScript) are trained by continued masked language modeling (MLM) on CodeParrot-filtered corpora (y^\hat{y}5115M GitHub files, 1M steps, batch 32, learning rate 5e-5y^\hat{y}63e-5).

Embedding extraction is conducted at mid-to-high layers (y^\hat{y}7), as optimal correlation with ground truth varies by language. Token idf weights are estimated on held-out development sets per language. The CBS reference implementation utilizes HuggingFace Transformers and Torch, and is distributed with openly available models and code (Zhou et al., 2023).

3. Evaluation Protocols and Comparative Results

Evaluation benchmarks:

  • Human preference: CoNaLa (472 NL→Python pairs, 5 model outputs per prompt, human rating 0–4).
  • Functional correctness: HumanEval (164 Python prompts, reference and test cases), translated to Java, C++, JavaScript, using Codex outputs with pass/fail test case supervision.

Correlation analysis employs:

  • Kendall’s y^\hat{y}8 (within each prompt’s 5 outputs)
  • Spearman y^\hat{y}9 and Pearson Sref=xyS_\text{ref} = x \mathbin{\|} y^*0 (global across all generations)

Key outcomes (summarized for human preference and functional correctness):

Metric CoNaLa Sref=xyS_\text{ref} = x \mathbin{\|} y^*1 CoNaLa Sref=xyS_\text{ref} = x \mathbin{\|} y^*2
BLEU 0.374 0.543
chrF 0.470 0.623
METEOR 0.366 0.540
CodeBERTScore 0.517 0.662
Lang BLEU Sref=xyS_\text{ref} = x \mathbin{\|} y^*3 chrF Sref=xyS_\text{ref} = x \mathbin{\|} y^*4 METEOR Sref=xyS_\text{ref} = x \mathbin{\|} y^*5 CBS Sref=xyS_\text{ref} = x \mathbin{\|} y^*6 CBS Sref=xyS_\text{ref} = x \mathbin{\|} y^*7
Java 0.481 0.532 0.558 0.553 0.369
C++ 0.112 0.319 0.301 0.327 0.393
Python 0.393 0.394 0.418 0.422 0.415
JavaScript 0.248 0.302 0.324 0.319 0.402

CBS exhibits monotonic improvements in Sref=xyS_\text{ref} = x \mathbin{\|} y^*8 and Sref=xyS_\text{ref} = x \mathbin{\|} y^*9, exceeding BLEU by Sgen=xy^S_\text{gen} = x \mathbin{\|} \hat{y}00.05–0.14 Sgen=xy^S_\text{gen} = x \mathbin{\|} \hat{y}1 on human preference and matching or outperforming all baselines on functional correctness (Zhou et al., 2023).

4. Practical Usage and Software Integration

The CBS software is implemented in Python atop HuggingFace and Torch. Language-specific models are available on the HuggingFace Hub. Typical invocation entails:

Sgen=xy^S_\text{gen} = x \mathbin{\|} \hat{y}9

The package handles model loading, optimal layer selection, (NL∥code) tokenization, masking, pairwise similarity, IDF weighting, and final aggregation. Staff recommendations are to use language-specific pretrained models, extract from intermediate Transformer layers (preferably 7–10), employ Sgen=xy^S_\text{gen} = x \mathbin{\|} \hat{y}2 for human preference and Sgen=xy^S_\text{gen} = x \mathbin{\|} \hat{y}3 for functional correctness, apply idf weighting, and enable baseline scaling for normalized outputs (Zhou et al., 2023).

5. Limitations and Failure Modes

CBS requires GPU acceleration for efficient computation; pairwise BERT embedding similarity is more computationally expensive than traditional n-gram counting. The quality of CBS is tied to that of its encoder model; future code LMs (such as CodeGen, SantaCoder) could further enhance score calibration. Layer selection, idf, scoring Sgen=xy^S_\text{gen} = x \mathbin{\|} \hat{y}4, and Sgen=xy^S_\text{gen} = x \mathbin{\|} \hat{y}5 parameter must be tuned per language and task.

Failure modes include:

  • Very short predictions (1–2 tokens): susceptible to idf or similarity noise.
  • Obfuscated or anomalous variable names: may yield unpredictable soft matches if substantially out of training distribution.

This suggests careful dataset preprocessing and per-language hyperparameter tuning are necessary for optimal results.

6. Recommendations and Comparative Advantages

Empirical findings confirm that using language-specific models confers a Sgen=xy^S_\text{gen} = x \mathbin{\|} \hat{y}61–2 point gain in Sgen=xy^S_\text{gen} = x \mathbin{\|} \hat{y}7 over the generic CodeBERT-base. Extracting embeddings from mid-to-high layers achieves stronger correlation with reference-based and functional assessments than final-layer features. IDF weighting is shown to downweight trivial tokens (such as assignment or punctuation), and Sgen=xy^S_\text{gen} = x \mathbin{\|} \hat{y}8 is advised where recall alignment (e.g., for correctness tests) is paramount.

In sum, CBS provides an effective drop-in replacement for BLEU/ROUGE/METEOR in code generation evaluation, delivering improved agreement with human judgments and test-case correctness. Its open-source implementation, together with pretrained language-targeted encoders, underpins a readily adoptable evaluation solution for NL→code systems (Zhou et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CodeBERTScore (CBS).