Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 86 tok/s

Gemini 2.5 Pro 58 tok/s Pro

GPT-5 Medium 34 tok/s Pro

GPT-5 High 31 tok/s Pro

GPT-4o 83 tok/s Pro

Kimi K2 180 tok/s Pro

GPT OSS 120B 440 tok/s Pro

Claude Sonnet 4.5 35 tok/s Pro

2000 character limit reached

MQM-APE: Toward High-Quality Error Annotation Predictors with Automatic Post-Editing in LLM Translation Evaluators (2409.14335v2)

Published 22 Sep 2024 in cs.CL

Abstract: LLMs have shown significant potential as judges for Machine Translation (MT) quality assessment, providing both scores and fine-grained feedback. Although approaches such as GEMBA-MQM have shown state-of-the-art performance on reference-free evaluation, the predicted errors do not align well with those annotated by human, limiting their interpretability as feedback signals. To enhance the quality of error annotations predicted by LLM evaluators, we introduce a universal and training-free framework, $\textbf{MQM-APE}$, based on the idea of filtering out non-impactful errors by Automatically Post-Editing (APE) the original translation based on each error, leaving only those errors that contribute to quality improvement. Specifically, we prompt the LLM to act as 1) $\textit{evaluator}$ to provide error annotations, 2) $\textit{post-editor}$ to determine whether errors impact quality improvement and 3) $\textit{pairwise quality verifier}$ as the error filter. Experiments show that our approach consistently improves both the reliability and quality of error spans against GEMBA-MQM, across eight LLMs in both high- and low-resource languages. Orthogonal to trained approaches, MQM-APE complements translation-specific evaluators such as Tower, highlighting its broad applicability. Further analysis confirms the effectiveness of each module and offers valuable insights into evaluator design and LLMs selection.

Citations (1)

View on Semantic Scholar

Summary

The paper introduces MQM-APE, a novel framework that integrates automatic post-editing to filter non-impactful errors in LLM translation evaluations.
It uses a three-module approach—error analysis, post-editing, and pairwise verification—to enhance alignment with human MQM scores.
Experimental results show significant improvements in pairwise accuracy and error span precision, boosting translation quality metrics substantially.

Overview of MQM-APE: Enhancing LLM Translation Evaluations with Automatic Post-Editing

Introduction

The paper "MQM-APE: Toward High-Quality Error Annotation Predictors with Automatic Post-Editing in LLM Translation Evaluators" proposes an innovative framework designed to improve the reliability and interpretability of error annotations in Machine Translation (MT) evaluations. By incorporating Automatic Post-Editing (APE) into LLM evaluators, the authors introduce a method that filters out non-impactful errors, thereby enhancing the overall quality of error annotations and translation quality assessments. This essay will provide an expert overview of the paper, discussing its methodologies, results, and implications for future AI developments.

Methodology

The authors propose MQM-APE, a universal, training-free framework that integrates APE into the error annotation process. The framework operates through three sequential modules:

Error Analysis Evaluator: This module leverages GEMBA-MQM for initial error detection. It identifies and categorizes errors in the translation, producing a set of raw error annotations.
Automatic Post-Editor: Using the identified errors, this module post-edits the original translation. The goal is to determine the impact of each error on the translation quality.
Pairwise Quality Verifier: This module validates the improvement by comparing the post-edited translation with the original. Errors that do not contribute to quality improvement are discarded.

Experiments were conducted on various LLMs, including general-purpose and translation-specific models, across high- and low-resource languages to assess the framework's performance.

Experimental Results

Reliability and Interpretability

MQM-APE consistently outperformed GEMBA-MQM at both the system and segment levels. The paper reports improvements in pairwise accuracy, reflecting better alignment with human-annotated MQM scores. For instance, system-level accuracy saw increases ranging from +0.4 to +5.8 percentage points across different LLMs. Additionally, error span precision (SP) and major error span precision (MP) metrics showed notable enhancements, with SP improvements ranging from +0.2 to +4.6 percentage points and MP improvements up to +0.7 points in some models. These results indicate that MQM-APE successfully enhances the quality and interpretability of error annotations.

Practical Implications

The APE step consistently improved translation quality. Metrics such as $\text{CometKiwi}_{22}^{\text{QE}}$ and $\text{BLEURT}_{20}$ demonstrated substantial gains in post-edited translations, with scores improving by as much as +6.12 percentage points. Furthermore, segment-level comparisons revealed that APE outperformed original translations in a significant majority of cases, with win-lose ratios exceeding 3.64 in most models, demonstrating APE's efficacy in enhancing translation quality.

Implications and Future Directions

The incorporation of APE offers several practical and theoretical implications. Practically, it provides a mechanism for refining error annotations and improving translation quality, beneficial for MT system developers and researchers focused on quality assurance. Theoretically, the findings suggest that APE can effectively filter impactful errors, offering a robust means of enhancing LLM-based evaluations without requiring extensive model retraining.

Future research could explore several avenues:

Collaborative Model Evaluation: Investigating the synergy between different LLMs for even more robust evaluations.
Adjusting Error Categories: Focusing on better alignment of error categories with human evaluations, particularly for less frequent error types.
Cost-Effective Alternatives: Further refining cost-reducing measures, such as replacing the pairwise verifier with effective numerical metrics, to balance cost and performance.

Conclusion

The MQM-APE framework represents a significant step forward in MT quality assessment, blending the strengths of GEMBA-MQM and APE for enhanced error annotation quality. The extensive evaluations across multiple LLMs and languages substantiate its effectiveness, offering valuable insights for both the research community and practical applications in MT systems. While there are areas for further improvement and exploration, the current findings lay a strong foundation for future advancements in AI-driven translation evaluations.

In conclusion, MQM-APE exemplifies the ongoing evolution in leveraging LLMs for more accurate and interpretable MT quality assessments, underscoring the importance of integrating dynamic error correction mechanisms such as APE in this domain.