Calibrating Translation Decoding with Quality Estimation on LLMs

Published 26 Apr 2025 in cs.CL | (2504.19044v3)

Abstract: Neural machine translation (NMT) systems typically employ maximum a posteriori (MAP) decoding to select the highest-scoring translation from the distribution mass. However, recent evidence highlights the inadequacy of MAP decoding, often resulting in low-quality or even pathological hypotheses -- the decoding objective is not aligned with real-world translation quality. This paper proposes calibrating hypothesis likelihoods with translation quality from a distribution view by directly optimizing their Pearson correlation -- thereby enhancing the effectiveness of translation decoding. With our method, translation on LLMs improves substantially after limited training (2K instances per direction). This improvement is orthogonal to those achieved through supervised fine-tuning, leading to substantial gains across a broad range of metrics and human evaluations -- even when applied to top-performing translation-specialized LLMs fine-tuned on high-quality translation data, such as Tower, or when compared to recent preference optimization methods, like CPO. Moreover, the calibrated translation likelihood can directly serve as a strong proxy for translation quality, closely approximating or even surpassing some state-of-the-art translation quality estimation models, like CometKiwi. Lastly, our in-depth analysis demonstrates that calibration enhances the effectiveness of MAP decoding, thereby enabling greater efficiency in real-world deployment. The resulting state-of-the-art translation model, which covers 10 languages, along with the accompanying code and human evaluation data, has been released to the community: https://github.com/moore3930/calibrating-LLM-mt.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

Overview of Calibrating Translation Decoding with Quality Estimation in LLMs

The paper "Calibrating Translation Decoding with Quality Estimation on LLMs" addresses a significant challenge in the field of neural machine translation (NMT), namely the inadequacy of maximum a posteriori (MAP) decoding in aligning translation outcomes with real-world quality standards. The authors propose an innovative calibration approach that directly optimizes the Pearson correlation between hypothesis likelihood and translation quality, aiming to bridge this gap and improve translation outcomes. This essay explores the methodologies, results, and implications of this research.

Calibration Approach and Methodology

Traditional NMT systems utilize MAP decoding to select the highest-scoring translation, which often results in low-quality translations due to poor correlation between hypothesis likelihood and actual translation quality. The paper introduces a calibration method which leverages Pearson correlation as an objective function during the training of LLMs. By sampling multiple hypotheses for a given translation prompt and evaluating each with an external quality metric (e.g., COMET), the system directly minimizes the negative Pearson correlation between likelihood and quality scores using gradient-based optimization.

The approach is characterized by its simplicity and effectiveness—a standard gradient-based optimizer minimizes the defined Pearson-based loss, ensuring that likelihood distribution is better aligned with quality metrics across diverse inputs.

Experimental Findings

The authors conducted extensive experiments using translation-specialized LLMs like Tower and ALMA, demonstrating significant performance enhancements across various metrics. When applied to state-of-the-art models such as Tower, the calibration method resulted in substantial improvements, rivaling systems employing computation-intensive test-time optimizations like Minimum Bayes Risk decoding.

For instance, calibrated models demonstrated an increase in translation quality, with an average gain of up to 3.6 points in KIWI-XXL and 1.2 points in COMET compared to baseline systems. Importantly, the calibrated models achieved comparable or superior performance to much larger models such as Tower-70B, highlighting their efficiency and scalability.

Implications and Future Directions

The implications of this research are multifaceted. Practically, the calibration method enhances the effectiveness of MAP decoding, allowing for substantial quality improvements without the need for extensive computational resources typically associated with methods like Best-of-N sampling. Theoretically, the shared objective between translation quality optimization and estimation suggests a unified perspective, where well-performing models inherently learn to discern high-quality translations.

Future research could explore extending this calibration methodology to other generative tasks within NLP, leveraging the framework to improve real-world alignment of generated outputs with task-specific quality metrics. Additionally, the exploration of alternative correlation metrics or probabilistic models could further enhance calibration effectiveness and generalizability.

Conclusion

In summary, this study highlights the critical importance of calibration in NMT systems, offering a robust framework for improving translation quality in LLMs. By optimizing the Pearson correlation between hypothesis likelihood and quality, the proposed method paves the way for more effective and efficient translation systems, with implications for broader applications in AI-driven language processing technologies.

Markdown Report Issue