BlonDe: A Document-Level MT Evaluation Metric
The paper presents BlonDe, a novel metric designed to address the inadequacies of standard sentence-level metrics, such as BLEU, for evaluating document-level machine translation (MT). Standard metrics are insufficient for capturing document-level nuances, as they primarily focus on sentence-level evaluations and lack the capability to account for inter-sentential context and discourse phenomena. Recognizing this gap, the authors propose BlonDe, which emphasizes the evaluation of translation quality from a document-level perspective.
BlonDe offers a comprehensive evaluation by categorizing discourse-related spans and employing a similarity-based F1 measure across these categories. The authors extend the evaluation framework from isolated sentences to whole documents, thereby incorporating discourse coherence into the assessment. Key document-level phenomena addressed by BlonDe include inconsistency, ellipsis, and ambiguity, which are not typically captured by traditional metrics but are critical for a thorough assessment of MT quality at the document level.
Through experimentation on a newly constructed document-level dataset—Bilingual Web Book (bwb)—the researchers demonstrate BlonDe's effectiveness. The large bwb dataset, which spans multiple genres and contains over 9 million sentence pairs, highlights a substantial proportion of document-level translation errors. The authors categorize these errors and reveal that inconsistency (64.4%), ellipsis (20.3%), and ambiguity (7.3%) form a significant portion of translation mistakes.
BlonDe outperforms existing metrics by illustrating superior selectivity and interpretability. In human studies, BlonDe achieves a higher Pearson's correlation with human judgments compared to prior metrics, reinforcing its validity as a reliable tool for document-level MT evaluation. Furthermore, BlonDe's ability to evaluate pronouns, tenses, named entities, and discourse markers offers an enhanced perspective on translation quality that goes beyond the sentence level.
The paper also introduces BlonD-d and BlonD+, variants of BlonDe that further isolate document-specific translations and incorporate human annotations, respectively. This allows users to integrate human-evaluated discourse features seamlessly into BlonDe's framework, offering even greater flexibility and precision in translation evaluation.
The implications of this work are significant for the MT community. BlonDe provides a robust framework for evaluating MT systems in a manner that is more aligned with human judgment, particularly for document-level tasks, thereby encouraging the development of translation systems that better handle contextual dependencies. As MT approaches continue to evolve, metrics like BlonDe will be essential for accurately gauging progress toward producing translations that are coherent, cohesive, and contextually appropriate at the document level. Future work may involve further expansion of BlonDe to support additional languages and discourse phenomena, enhancing its applicability across diverse MT scenarios.