Extrinsic Evaluation of Machine Translation Metrics

Published 20 Dec 2022 in cs.CL and cs.AI | (2212.10297v2)

Abstract: Automatic machine translation (MT) metrics are widely used to distinguish the translation qualities of machine translation systems across relatively large test sets (system-level evaluation). However, it is unclear if automatic metrics are reliable at distinguishing good translations from bad translations at the sentence level (segment-level evaluation). In this paper, we investigate how useful MT metrics are at detecting the success of a machine translation component when placed in a larger platform with a downstream task. We evaluate the segment-level performance of the most widely used MT metrics (chrF, COMET, BERTScore, etc.) on three downstream cross-lingual tasks (dialogue state tracking, question answering, and semantic parsing). For each task, we only have access to a monolingual task-specific model. We calculate the correlation between the metric's ability to predict a good/bad translation with the success/failure on the final task for the Translate-Test setup. Our experiments demonstrate that all metrics exhibit negligible correlation with the extrinsic evaluation of the downstream outcomes. We also find that the scores provided by neural metrics are not interpretable mostly because of undefined ranges. We synthesise our analysis into recommendations for future MT metrics to produce labels rather than scores for more informative interaction between machine translation and multilingual language understanding.

Abstract PDF Upgrade to Chat

Authors (4)

Citations (17)

View on Semantic Scholar

Summary

The paper demonstrates that standard MT metrics do not correlate strongly with extrinsic task success at the segment level.
The study applies a Translate-Test setup across semantic parsing, question answering, and dialogue state tracking to robustly assess metric efficacy.
The paper recommends transitioning from numerical scores to error-focused labeling and diverse training data to enhance MT evaluation.

Extrinsic Evaluation of Machine Translation Metrics

Introduction

The paper "Extrinsic Evaluation of Machine Translation Metrics" investigates the efficacy of automatic machine translation (MT) metrics in detecting the segment-level quality of translations. These metrics have traditionally been used to evaluate translation quality at the system level, but their ability to assess individual sentence translations remains ambiguous. This study correlates MT metrics with outcomes from downstream tasks, including dialogue state tracking, question answering, and semantic parsing, to measure their utility in practical applications.

Methodology

The methodology involves the Translate-Test setup, where translations of test language sentences to the task language are evaluated for downstream tasks.

Figure 1: The meta-evaluation pipeline. The predictions for the extrinsic task in the test language (Chinese, ZH) are obtained using the Translate-Test setup --- the test language is translated into the task language (English, EN) before passing to the task-specific model.

Tasks Evaluated

Semantic Parsing (SP): Uses logical forms to transform natural language into machine-readable formats, evaluated using exact-match denotation accuracy.
Extractive Question Answering (QA): Predicts word spans from a paragraph to answer questions, with performance measured by Exact-Match metrics.
Dialogue State Tracking (DST): Maps user intents in conversations to predefined slots, tested with Joint Goal Accuracy.

After translations are processed by task-specific models, MT metrics assess whether translations contribute to downstream task success or failure.

Results

The study finds minimal correlation between MT metrics scores and success in extrinsic tasks, indicating that current metrics are not reliable at the segment level.

Case Study

A focused analysis on the semantic parsing task with Chinese inputs illustrates the limitations of MT metrics, using COMET-DA scores to categorize correct versus incorrect parses.

Figure 2: Graph of predictions by COMET-DA (threshold: -0.028), categorized by the metric scores in ten intervals. Task: Semantic Parsing with English parser and test language is Chinese.

Analysis and Recommendations

The paper proposes recommendations for improving MT metrics:

MQM for Human Evaluation: Suggests leveraging Multidimensional Quality Metrics for marking explicit translation errors to refine metric accuracy.
Labels vs Scores: Advocates for segment-level evaluation metrics that predict error types rather than relying on ambiguous numerical scores.
Diverse Training Data: Recommends including varied references during training to improve the robustness of metrics against paraphrases.

Conclusion

The study provides insight into the limitations of current MT metrics in assessing translation quality at the segment level. Recommendations focus on enhancing MT evaluation by transitioning from score-based assessments to error-focused labeling. Future developments should address the diverse error sensitivities of extrinsic tasks to improve MT systems' efficacy in real-world applications.

Figure 3: Examples of errors made by COMET-DA.

The findings emphasize a need for more sophisticated approaches to evaluate MT translations, intending to bridge the gap between intrinsic metric performance and extrinsic task success.

Markdown Report Issue