BERTScore: Semantic Text Evaluation

Updated 31 October 2025

BERTScore is an evaluation metric that computes semantic similarity between candidate and reference texts using context-aware embeddings from transformer models.
It outperforms traditional metrics like BLEU and ROUGE by aligning deep semantic representations and achieving strong correlation with human judgments.
Practical deployment requires careful model and layer selection, with extensions available for multilingual tasks and audio/speech evaluation.

BERTScore is an automatic reference-based evaluation metric for text generation that computes semantic similarity by aligning candidate and reference sentences in contextual embedding space. Unlike traditional surface-form metrics such as BLEU, ROUGE, or METEOR, BERTScore leverages pretrained transformer-based models (e.g., BERT, RoBERTa, XLNet, XLM) to produce context-aware token embeddings, enabling measurement of deep semantic equivalence between generated text and human references.

1. Mathematical Definition and Computational Workflow

Let the reference sentence be $x = \langle x_1, ..., x_k \rangle$ and the candidate sentence be $\hat{x} = \langle \hat{x}_1, ..., \hat{x}_l \rangle$ . BERTScore operates as follows:

Token Embedding Extraction: Each token in both sentences is passed through a pretrained contextual embedding model, yielding vectors $\mathbf{x}_i$ and $\hat{\mathbf{x}}_j$ for $i=1,\dots, k$ and $j=1,\dots, l$ , respectively. These vectors are context-sensitive and reflect the semantics of a token in its sentential context.
Similarity Computation: Cosine similarity is computed for every reference-candidate token pair:

$\text{sim}_{i,j} = \mathbf{x}_i^\top \hat{\mathbf{x}}_j$

Greedy Matching:
- Recall:
$R = \frac{1}{|x|} \sum_{x_i \in x} \max_{\hat{x}_j \in \hat{x}} \mathbf{x}_i^\top \hat{\mathbf{x}}_j$

Precision:

$P = \frac{1}{|\hat{x}|} \sum_{\hat{x}_j \in \hat{x}} \max_{x_i \in x} \mathbf{x}_i^\top \hat{\mathbf{x}}_j$
F1 Score:

$F = 2 \frac{P \cdot R}{P + R}$

(Optional) Importance Weighting: Incorporates inverse document frequency (idf) so that rarer words contribute more to the score:

$\text{idf}(w) = -\log \frac{1}{M} \sum_{i=1}^M \mathbb{I}[w \in x^{(i)}]$

Weighted recall, for example:

$R = \frac{\sum_{x_i \in x} \text{idf}(x_i) \max_{\hat{x}_j \in \hat{x}} \mathbf{x}_i^\top \hat{\mathbf{x}}_j}{\sum_{x_i \in x} \text{idf}(x_i)}$

Baseline Rescaling: Empirical baseline scores are computed on random sentence pairs and used for linear normalization, mapping scores to $[0, 1]$ for interpretability.

2. Rationale: Semantic, Contextual, and LLM Foundations

Legacy metrics (BLEU, ROUGE) focus on $n$ -gram overlap, directly penalizing surface-form variation. They miss semantic equivalence in paraphrasing (e.g., “like foreign cars” versus “prefer imported cars”) and are sensitive to word order. BERTScore circumvents these limitations by comparing context-sensitive embeddings: each token's vector represents its meaning conditioned on sentence context, allowing for detection of paraphrases, synonyms, and reordered constructions. The metric is grounded in transformer architectures pretrained on large corpora, enabling cross-lingual generality and semantic robustness.

3. Empirical Validation and Tasks

BERTScore was evaluated on outputs from 363 machine translation and image captioning systems, including WMT16/17/18 datasets and COCO 2015. Results demonstrated:

Strong correlation with human judgments: At both system and segment levels, BERTScore F1 exhibited higher or comparable Pearson and Kendall correlation than BLEU, METEOR, CHRF, and supervised RUSE across languages.
Superior model selection: Achieved superior Hits@1 in ranking the best system compared to other metrics.
Image captioning: Outperformed BLEU, CIDEr, and METEOR; matched or exceeded SPICE in human correlation.
Paraphrase robustness: Maintained discriminative capacity in adversarial paraphrase detection tasks on PAWS and QQP datasets, exhibiting minimal degradation versus the marked drop in n-gram metrics.
Processing efficiency: Evaluated typical test sets ( $\sim$ 3,000 pairs) in under 16 seconds on GPU.

4. Distinctions from Conventional Metrics

Semantic Matching: Relies entirely on embedding similarity rather than exact token overlap.
Context-Sensitivity: Uses representations derived from full sentence context.
No External Lexical Resources: Avoids dependence on curated synonym lists.
Differentiable and Compatible with ML Objectives: The scoring mechanism is differentiable and amenable to integration within training regimes.
Applicability and Generality: Pretrained transformer models are available for over 100 languages.
Layer Selection: Empirically, intermediate transformer layers yield optimal semantic similarity; cross-validation or validation-based selection is recommended.

5. Limitations and Considerations

No singular optimal configuration: Performance depends on selection of model architecture, pretraining language, and embedding layer.
Named entities, factual errors: The metric, along with others, is less sensitive to factual discrepancies, especially those not expressed in context (e.g., “German” vs. “English”, numeric errors).
Low-resource language instability: Multilingual BERT models exhibit less stability for low-resource languages.
Model dependence: Metric results are sensitive to the specific pretrained transformer used.
Surface-form subtlety: Some minimal semantic mismatches, named entity confusions, or unit conversion errors may elude detection.

Domain adaptation: For English, 24-layer RoBERTa is recommended; for other languages, use an appropriate multilingual transformer or domain-specific variant.
BERTScore variants: Importance weighting, metric adaptation, and fine-tuning can be effective for task-specific requirements.
Recent extensions: Metrics such as KG-BERTScore (Wu et al., 2023), SpeechBERTScore (Saeki et al., 2024), and AudioBERTScore (Kishi et al., 1 Jul 2025) have generalized the approach to reference-free MT evaluation, speech, and audio domains, respectively, incorporating knowledge graphs, dense speech features, and non-local similarity aggregation.
Summary formula (editor's collation):

Variant	Recall Formula	Precision Formula	F1 Formula
BERTScore (core)	$R = \frac{1}{\|x\|} \sum_{x_i \in x} \max_{\hat{x}_j \in \hat{x}} \mathbf{x}_i^\top \hat{\mathbf{x}}_j$	$P = \frac{1}{\|\hat{x}\|} \sum_{\hat{x}_j \in \hat{x}} \max_{x_i \in x} \mathbf{x}_i^\top \hat{\mathbf{x}}_j$	$F = 2 \frac{P \cdot R}{P + R}$
Weighted	$R = \frac{\sum_{x_i \in x} \text{idf}(x_i) \max_{\hat{x}_j} \mathbf{x}_i^\top \hat{\mathbf{x}}_j}{\sum_{x_i \in x} \text{idf}(x_i)}$	—	—

7. Significance and Prospective Directions

BERTScore offers a robust, semantically informed measure for text generation evaluation, demonstrating consistent alignment with human judgment across tasks, languages, and domains. Its architecture and mathematical foundation support extension to audio, speech, and knowledge graph modalities. The integration of rich context-sensitive representation enables discriminative robustness to paraphrasing and diversity, marking a shift from form-centric to meaning-centric evaluation paradigms. Selecting appropriate model layers and configurations is essential for optimal deployment; further research continues to address factuality, rare token matching, and cross-domain generalization (Zhang et al., 2019).