Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

COMET: A Neural Framework for MT Evaluation (2009.09025v2)

Published 18 Sep 2020 in cs.CL
COMET: A Neural Framework for MT Evaluation

Abstract: We present COMET, a neural framework for training multilingual machine translation evaluation models which obtains new state-of-the-art levels of correlation with human judgements. Our framework leverages recent breakthroughs in cross-lingual pretrained LLMing resulting in highly multilingual and adaptable MT evaluation models that exploit information from both the source input and a target-language reference translation in order to more accurately predict MT quality. To showcase our framework, we train three models with different types of human judgements: Direct Assessments, Human-mediated Translation Edit Rate and Multidimensional Quality Metrics. Our models achieve new state-of-the-art performance on the WMT 2019 Metrics shared task and demonstrate robustness to high-performing systems.

An Expert Overview of "Comet: A Neural Framework for MT Evaluation"

The paper “Comet: A Neural Framework for MT Evaluation,” authored by Ricardo Rei, Craig Stewart, Ana C. Farinha, and Alon Lavie, introduces {\sc Comet}, a novel neural framework designed for the multilingual evaluation of Machine Translation (MT) systems. This research aims to advance the state-of-the-art in MT evaluation by leveraging recent developments in cross-lingual pretrained LLMs to better correlate with human judgments.

Introduction

Traditional MT evaluation metrics, such as {\sc Bleu} and {\sc Meteor}, rely heavily on lexical similarities between the MT output and a human reference translation. These metrics, while computationally efficient, underperform with modern neural MT systems that often generate translations that differ lexically but are contextually accurate. Recognizing the inadequacy of these traditional metrics, the authors propose {\sc Comet}, a framework that more effectively captures the semantic and contextual quality of MT outputs.

Model Architectures

The core contribution of this paper is the {\sc Comet} framework, which supports two types of model architectures: the Estimator model and the Translation Ranking model.

Estimator Model

The Estimator model aims to regress on segment-level human judgment scores such as Direct Assessments (DA), Human-mediated Translation Edit Rate (HTER), and Multidimensional Quality Metrics (MQM). The architecture comprises:

  • Cross-lingual Encoder: Utilizes pretrained models like XLM-RoBERTa to generate embeddings for source, hypothesis, and reference sentences.
  • Pooling Layer: Employs a layer-wise attention mechanism to pool information from various encoder layers into a single token embedding.
  • Feed-Forward Regressor: The combined features from the embeddings are fed into a feed-forward neural network to predict quality scores.

Translation Ranking Model

The Translation Ranking model is designed to minimize the distance between better/worse hypotheses concerning both the source and reference translations. The architecture involves:

  • Cross-lingual Encoder and Pooling: Similar to the Estimator model, this component generates embeddings for input sequences.
  • Triplet Margin Loss: During training, it uses a triplet margin loss function to ensure the "better" hypothesis is closer to the source and reference compared to the "worse" hypothesis.

Experimental Setup

The authors train three models using the {\sc Comet} framework:

  • {\sc Comet-hter}: Trained on the QT21 corpus, focusing on HTER scores.
  • {\sc Comet-mqm}: Trained on an internal MQM-annotated corpus for customer support chat messages.
  • {\sc Comet-rank}: Trained on WMT {\small DA}RR scores from multiple years to rank translation hypotheses.

Each model is evaluated under the WMT 2019 Metrics Shared Task setup, using Kendall's Tau-like correlation to measure performance against human judgments. Additionally, the robustness of the models was tested on high-performing MT systems to determine their ability to distinguish top-quality translations.

Numerical Performance and Findings

The numerical results highlighted in Tables \ref{tab:english-to-x2019} and \ref{tab:x-to-english2019} demonstrate that the {\sc Comet} models significantly outperform traditional metrics and more recent embedding-based metrics like {\sc Bertscore} and {\sc Bleurt} across multiple language pairs. Notably, {\sc Comet-rank} consistently shows the highest correlation with human judgments, indicating its effectiveness in capturing MT quality.

Furthermore, the robustness analysis in Figure \ref{fig:Top models} reveals {\sc Comet} models maintain strong performance even when evaluated on the top 10, 8, 6, and 4 MT systems. This robustness underscores {\sc Comet}'s capability to discern finer differences in high-quality translations, addressing a critical shortfall in existing metrics.

Theoretical and Practical Implications

Theoretically, the {\sc Comet} framework offers a significant step towards bridging the gap between human judgment and automated MT quality evaluation by integrating source, hypothesis, and reference embeddings. This multi-dimensional approach provides a deeper understanding of translation quality beyond surface-level lexical matching.

Practically, the strong performance of {\sc Comet} across diverse language pairs and especially its competitive results on non-English languages showcase its versatility and potential for widespread adoption in the MT research and development community.

Future Developments

Future work may explore optimizing {\sc Comet} for efficiency, potentially by adopting more compact models like DistilBERT. Additionally, the relative contribution of source and reference inputs during training and inference warrants further exploration to refine the model's balance in leveraging these inputs.

Conclusion

In summary, the paper convincingly presents {\sc Comet} as a robust and adaptable framework for MT evaluation, addressing key challenges with existing metrics. It provides evidence of superior performance in correlating with human evaluations and demonstrates promise for handling high-quality translations, thus marking an important advancement in the field of MT quality assessment.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Ricardo Rei (34 papers)
  2. Craig Stewart (7 papers)
  3. Alon Lavie (12 papers)
  4. Ana C Farinha (1 paper)
Citations (918)