COMET: Crosslingual Translation Evaluation

Updated 11 January 2026

COMET is a family of neural, reference-based metrics designed for robust evaluation of machine translation quality across diverse language pairs.
It leverages state-of-the-art crosslingual Transformer encoders and human judgment data by employing regression and ranking heads to achieve high alignment with human assessments.
Extensions include uncertainty quantification, hybrid lexical–neural scoring, enhanced support for under-resourced languages, and efficient model compression techniques for scalable deployment.

Crosslingual Optimized Metric for Evaluation of Translation (COMET) is a family of neural, reference-based evaluation metrics for machine translation (MT), designed to robustly assess translation quality across many language pairs. Built on top of state-of-the-art cross-lingual Transformer encoders and trained on a variety of human assessment data, COMET achieves high correlation with segment- and system-level human judgements, outperforming traditional string-based or neural-only metrics in both overall accuracy and robustness. The metric family continues to evolve, with variants for uncertainty quantification, enhanced support for under-resourced languages, hybrid lexical–neural scoring, and model compression for efficient deployment.

1. Model Architecture and Scoring Workflow

The original COMET framework utilizes a pretrained cross-lingual Transformer encoder, e.g., XLM-RoBERTa (Rei et al., 2020). The key pipeline components are as follows:

Inputs: Up to three segments—the source sentence $s$ , MT hypothesis $h$ , and a human reference $r$ .
Encoding: Each segment is independently processed through the encoder. Layer-wise pooling is applied with trainable weights, aggregating token representations across layers through a softmax-attended sum, followed by average pooling to yield $d$ -dimensional sentence embeddings.
Estimator Head (Regression): Sentence embeddings from $s$ , $h$ , and $r$ are combined element-wise (products and differences), forming a 6 $d$ -dimensional feature vector. This vector passes through a two-layer feed-forward network with tanh activations and dropout, outputting a scalar segment score $\hat{y} \in [0,1]$ .
Ranking Head (Triplet Margin): For paired comparisons using Direct Assessment Relative Ranking (DARR), embeddings are computed for ( $s$ , $h^+$ , $h^-$ , $r$ ). A triplet margin loss encourages $h^+$ (higher judged translation) to be closer than $h^-$ to both $s$ and $r$ in embedding space.

Concrete architecture and scoring differences are indicated in the table below:

Component	Estimator Head	Ranking Head
Inputs	$s$ , $h$ , $r$	$s$ , $h^+$ , $h^-$ , $r$
Output	Scalar $\hat{y}$	Margin between pairs
Loss	MSE	Triplet margin
Inference	Score in [0,1]	Harmonic distance-based score

The core COMET scoring function is as follows (for estimator head):

$\hat{y} = \text{FFN}(\text{concat}[h, r, h\odot s, h\odot r, |h-s|, |h-r|])$

For ranking head inference (single hypothesis):

$F(s, h, r) = \frac{2 \cdot d(s, h) \cdot d(r, h)}{d(s, h) + d(r, h)}, \quad \text{COMET score} = \frac{1}{1 + F(s, h, r)}$

2. Training Data, Objectives, and Optimization

COMET trains on large, heterogeneous datasets corresponding to different human judgement schemes, with separate models for each:

QM21 HTER (≈173K tuples): Regression to Human-mediated Translation Edit Rate.
WMT DARR (2017–2019, ≈24 language pairs): Ranking loss for Direct Assessment-based relative ordering.
Proprietary MQM (≈12K): Regression to normalized Multidimensional Quality Metrics.

Corresponding losses are:

HTER/MQM: Mean squared error against ground-truth scalar quality score.
DA/DARR: Triplet margin loss with margin $\delta=1.0$ on relative rankings.

The models use the Adam optimizer with discriminative learning rates. Encoder parameters may be frozen initially, followed by joint fine-tuning. Each language pair or judgement type receives a dedicated model; no multi-task or auxiliary losses are used in the original framework.

3. Performance, Robustness, and Evaluation Practice

COMET delivers consistently superior segment-level correlation with human judgments, as measured by Pearson’s $r$ , Spearman’s $\rho$ , and shared-task Kendall’s T statistics. Highlights (WMT19):

en→X pairs: COMET-RANK $T \approx 0.60$ vs BLEU $\approx 0.32$ , chrF $\approx 0.44$ , BERTScore $\approx 0.54$ .
X→en pairs: COMET-RANK $T \approx 0.38$ vs BLEU $\approx 0.24$ , BERTScore $\approx 0.32$ .
Non-English pairs: COMET-RANK $T \approx 0.41$ (best baseline: $\approx 0.34$ ).

COMET’s relative performance is maintained in robustness tests focusing on discrimination among high-performing systems, where classic metrics tend to saturate.

Ablation studies reveal strong contributions from including the source segment, learned layer-wise pooling, and leveraging both source and reference signals for adequacy. Simple regression-based models demonstrate high sample efficiency; e.g., in en→ru, an MQM-trained COMET outperforms string-based metrics with less than 10% of their training data (Rei et al., 2020).

4. Known Pitfalls and Reproducibility Recommendations

The flexibility of COMET as a learned metric introduces several reproducibility and reliability challenges (Zouhar et al., 2024):

Software environment drift: Scores can vary by up to 0.05 solely due to differences in the unbabel-comet version or Python runtime. Explicit versioning is recommended for reproducibility.
Quantization precision: FP16 on GPU is safe; int8 quantization on CPU produces unreliable scores.
Data/usage artifacts: Empty hypotheses are mapped to nonzero scores (≈0.3–0.4); hypotheses in the wrong target language, or influenced by source difficulty or domain tokens, result in over-confident estimates. No internal LID or malformed-input rejection is implemented.
Multi-reference support: COMET natively accepts only one reference per call. Support for multiple references relies on ad hoc post hoc strategies (max, mean scores, or aggregate), with no consistent improvement.
Model/reporting practices: Identifiers for specific COMET checkpoints/models are inconsistently reported, undermining comparability. The sacreCOMET tool standardizes citation and environment reporting.

Recommended best practices include explicit model/environment reporting, language identification filters, scoring empty hypotheses as zero, and complementing COMET with string-based metrics like BLEU or chrF.

5. Extensions: Uncertainty Quantification

To address fragile confidence in individual segment predictions, multiple approaches for uncertainty-aware COMET variants are developed (Glushkova et al., 2021, Zerva et al., 2022):

Monte Carlo dropout: Compute predictive mean and variance by running multiple stochastic forward passes with dropout enabled.
Deep ensembles: Train several COMET instances with different initializations, aggregate predictions to estimate uncertainty.
Heteroscedastic regression: Directly predict a per-instance variance parameter, optimizing log-likelihood; accommodates aleatoric (data) uncertainty.
KL divergence minimization: Train to match mean and variance from multiple annotator labels.
Direct error/uncertainty prediction: Train a secondary network to predict expected absolute error, capturing epistemic (model) uncertainty.

These approaches yield well-calibrated confidence intervals (ECE < 0.025), positive Uncertainty Pearson Scores (UPS ≈ 0.2–0.5), and maintain predictive accuracy. Efficient, single-pass objectives (e.g., heteroscedastic regression) show minimal overhead relative to costly MC dropout or ensembles. Uncertainty-aware scores enable risk- and confidence-based system filtering, targeted human review, and more trustworthy metric deployment (Glushkova et al., 2021, Zerva et al., 2022).

6. Adaptations and Advances: Hybridization, Language Coverage, and Compression

Hybrid Lexical-Neural Metrics

Integrating COMET with surface metrics such as BLEU and chrF compensates for occasional neural metric failures on critical phenomena (e.g., entities/numbers) (Glushkova et al., 2023). Explicit fusion includes:

Sentence-level features: Feeding normalized BLEU/chrF as additional features to COMET’s prediction head.
Word-level tags: Using TER-based alignment tags ("ok"/"bad") as token-level auxiliary inputs for subword error highlighting.

Such augmentations improve detection of catastrophic errors and overall segment discrimination in robustness challenge sets, with statistically significant gains over vanilla COMET and string-only metrics.

Under-Resourced Language Generalization

AfriCOMET adapts the COMET estimator architecture with an African-centric encoder (AfroXLM-R), trained with simplified MQM and DA-annotated datasets for 13 African languages (Wang et al., 2023). Multi-task learning (MTL) head structure jointly predicts adequacy, source–MT, and MT–reference scores, yielding Spearman $\rho=0.441$ across African language pairs—a substantial advance over string-based and original COMET metrics. Training only on high-quality, multi-annotator "WMT Others" data and leveraging encoder adaptation enables strong zero/few-shot transfer to unseen, under-resourced languages.

Efficient Model Compression

xCOMET-lite demonstrates empirically validated model distillation, integer quantization, and structured pruning for high-throughput, low-resource deployment (2406.14553). Knowledge distillation into smaller Transformer encoders (e.g., mDeBERTa-v3, 278M params) retains over 92% macro-level quality at 1/40th the memory footprint of xCOMET-XXL. Quantization to as low as 3–4 bits maintains quality within 1–2% of the original models, enabling batch throughput above 140 sentences/sec on consumer GPUs. Simple pruning recovers much of the benefit; however, combining pruning and distillation leads to quality collapse. Trade-offs are summarized below:

Method	RAM Saving	Throughput	Quality Loss (Kendall's $\tau$ )
8-bit Quantization	33–64%	negligible	≤0.002
Distillation	50%	4–15×	–0.019 to –0.044
Pruning (≤25%)	13–18%	1.2–1.3×	–0.016 to –0.032

Empirically, xCOMET-lite (distilled + quantized) outperforms earlier compact neural metrics (e.g., BLEURT-20) in the strong-small regime (2406.14553).

7. Impact, Limitations, and Research Outlook

COMET’s architecture enables accurate, context- and adequacy-sensitive MT evaluation in diverse, multilingual, and domain-diverse conditions, setting the standard for segment-level metrics throughout major shared tasks and enterprise-scale deployment. However, as with all machine learning-based evaluation paradigms, inherent challenges persist:

Pitfalls: Sensitivity to implementation/configuration, data domain mismatch, and linguistic anomalies.
Comparability: Cross-paper and cross-setup COMET scores are not absolute; explicit configuration reporting is essential.
Multi-reference/Multi-domain Gaps: Native support for multiple references and robust domain adaptation remain imperfect.
Uncertainty Use: Calibration and interpretability of uncertainty-aware outputs, while advanced, require further validation for critical tasks such as high-stakes MT deployment.
Resource requirements: While compression advances are significant, full-scale xCOMET remains demanding for some users.

Ongoing research addresses these limitations via environment signature tracking tools (e.g., sacreCOMET), encoder adaptation, hybrid lexical–neural approaches, and uncertainty calibration. The COMET family continues to be a focal point for both applied and theoretical progress in neural MT evaluation (Rei et al., 2020, Zouhar et al., 2024, Glushkova et al., 2021, Zerva et al., 2022, Wang et al., 2023, Glushkova et al., 2023, 2406.14553).