CodeBERTScore: NL→Code Evaluation Metric
- CodeBERTScore is an unsupervised evaluation metric that leverages code-aware Transformer models to compute token-level semantic similarity between generated code and references.
- It aggregates precision and recall using inverse document frequency weighting and rescaling to produce F1/F3 scores that strongly correlate with human judgments and functional correctness.
- The implementation builds on CodeBERT and supports multiple programming languages, offering a robust, open-source solution integrated with HuggingFace Transformers and PyTorch.
CodeBERTScore (CBS) is an unsupervised, embedding-based automatic evaluation metric for natural-language-to-code (NL→Code) generation tasks. It generalizes BERTScore to the code domain by leveraging code-aware Transformer models (such as CodeBERT) to produce contextualized embeddings for both generated code and references, conditioned on the natural language (NL) prompt. CBS computes pairwise token-level semantic similarity, enabling soft matching of syntactically and lexically divergent—but semantically equivalent—code fragments, while modeling consistency with the originating NL context. Correlations with human assessment and functional correctness demonstrate superior alignment compared to traditional metrics such as BLEU or METEOR (Zhou et al., 2023).
1. Underlying Methodology and Mathematical Formulation
CBS operates by concatenating the NL prompt with the reference code and generated code , forming and , where “” denotes sequence concatenation. Each sequence is tokenized using the CodeBERT tokenizer :
After passing these tokens through a pretrained CodeBERT encoder (with selected layer 0), context (NL) and punctuation tokens are masked. This yields code token embeddings:
- 1 from reference
- 2 from candidate
For each pair, cosine similarity 3 is computed: 4 The similarity matrix 5 forms the basis for subsequent aggregation.
CBS aggregates these similarities into precision and recall with inverse-document-frequency (idf) token weighting: 6 The F1 score is their harmonic mean; optionally, 7 (e.g., 8 for functional emphasis) can be used: 9
Scores are linearly rescaled into 0 via baseline subtraction: 1, with 2 the mean score of random code pairs (typically 3).
2. Model Architecture and Implementation Specifics
CBS builds upon Microsoft's code-bert-base model: a 12-layer Transformer, hidden size 4. Five language-specific variants (Java, Python, C, C++, JavaScript) are trained by continued masked language modeling (MLM) on CodeParrot-filtered corpora (5115M GitHub files, 1M steps, batch 32, learning rate 5e-563e-5).
Embedding extraction is conducted at mid-to-high layers (7), as optimal correlation with ground truth varies by language. Token idf weights are estimated on held-out development sets per language. The CBS reference implementation utilizes HuggingFace Transformers and Torch, and is distributed with openly available models and code (Zhou et al., 2023).
3. Evaluation Protocols and Comparative Results
Evaluation benchmarks:
- Human preference: CoNaLa (472 NL→Python pairs, 5 model outputs per prompt, human rating 0–4).
- Functional correctness: HumanEval (164 Python prompts, reference and test cases), translated to Java, C++, JavaScript, using Codex outputs with pass/fail test case supervision.
Correlation analysis employs:
- Kendall’s 8 (within each prompt’s 5 outputs)
- Spearman 9 and Pearson 0 (global across all generations)
Key outcomes (summarized for human preference and functional correctness):
| Metric | CoNaLa 1 | CoNaLa 2 |
|---|---|---|
| BLEU | 0.374 | 0.543 |
| chrF | 0.470 | 0.623 |
| METEOR | 0.366 | 0.540 |
| CodeBERTScore | 0.517 | 0.662 |
| Lang | BLEU 3 | chrF 4 | METEOR 5 | CBS 6 | CBS 7 |
|---|---|---|---|---|---|
| Java | 0.481 | 0.532 | 0.558 | 0.553 | 0.369 |
| C++ | 0.112 | 0.319 | 0.301 | 0.327 | 0.393 |
| Python | 0.393 | 0.394 | 0.418 | 0.422 | 0.415 |
| JavaScript | 0.248 | 0.302 | 0.324 | 0.319 | 0.402 |
CBS exhibits monotonic improvements in 8 and 9, exceeding BLEU by 00.05–0.14 1 on human preference and matching or outperforming all baselines on functional correctness (Zhou et al., 2023).
4. Practical Usage and Software Integration
The CBS software is implemented in Python atop HuggingFace and Torch. Language-specific models are available on the HuggingFace Hub. Typical invocation entails:
9
The package handles model loading, optimal layer selection, (NL∥code) tokenization, masking, pairwise similarity, IDF weighting, and final aggregation. Staff recommendations are to use language-specific pretrained models, extract from intermediate Transformer layers (preferably 7–10), employ 2 for human preference and 3 for functional correctness, apply idf weighting, and enable baseline scaling for normalized outputs (Zhou et al., 2023).
5. Limitations and Failure Modes
CBS requires GPU acceleration for efficient computation; pairwise BERT embedding similarity is more computationally expensive than traditional n-gram counting. The quality of CBS is tied to that of its encoder model; future code LMs (such as CodeGen, SantaCoder) could further enhance score calibration. Layer selection, idf, scoring 4, and 5 parameter must be tuned per language and task.
Failure modes include:
- Very short predictions (1–2 tokens): susceptible to idf or similarity noise.
- Obfuscated or anomalous variable names: may yield unpredictable soft matches if substantially out of training distribution.
This suggests careful dataset preprocessing and per-language hyperparameter tuning are necessary for optimal results.
6. Recommendations and Comparative Advantages
Empirical findings confirm that using language-specific models confers a 61–2 point gain in 7 over the generic CodeBERT-base. Extracting embeddings from mid-to-high layers achieves stronger correlation with reference-based and functional assessments than final-layer features. IDF weighting is shown to downweight trivial tokens (such as assignment or punctuation), and 8 is advised where recall alignment (e.g., for correctness tests) is paramount.
In sum, CBS provides an effective drop-in replacement for BLEU/ROUGE/METEOR in code generation evaluation, delivering improved agreement with human judgments and test-case correctness. Its open-source implementation, together with pretrained language-targeted encoders, underpins a readily adoptable evaluation solution for NL→code systems (Zhou et al., 2023).