Lost in the Source Language: How Large Language Models Evaluate the Quality of Machine Translation (2401.06568v2)

Published 12 Jan 2024 in cs.CL and cs.AI

Abstract: This study investigates how LLMs leverage source and reference data in machine translation evaluation task, aiming to better understand the mechanisms behind their remarkable performance in this task. We design the controlled experiments across various input modes and model types, and employ both coarse-grained and fine-grained prompts to discern the utility of source versus reference information. We find that reference information significantly enhances the evaluation accuracy, while surprisingly, source information sometimes is counterproductive, indicating LLMs' inability to fully leverage the cross-lingual capability when evaluating translations. Further analysis of the fine-grained evaluation and fine-tuning experiments show similar results. These findings also suggest a potential research direction for LLMs that fully exploits the cross-lingual capability of LLMs to achieve better performance in machine translation evaluation tasks.

Citations (3)

View on Semantic Scholar

Summary

The paper finds that reference information boosts LLM evaluation accuracy while source input can mislead model performance.
The study evaluates multiple input modes (T, S-T, R-T, S-R-T) using system accuracy and segment-level Kendall’s τ correlation.
The results highlight a need to enhance LLMs' cross-lingual capabilities for more reliable and nuanced machine translation evaluation.

Introduction to LLMs in Machine Translation Evaluation

LLMs like GPT-3 and LaMDA have been at the forefront in various linguistic and translation tasks. Among their many capabilities, evaluating machine-translated text is a significant area that needs exploration. Though these models show promising results, understanding their operational mechanisms in translation evaluation is crucial for advancements in this domain.

Evaluation of Machine Translation

The process of evaluating machine translations typically involves comparing the machine-generated text against a "gold standard" set by human translations. Traditional metrics like BLEU have been in use for many years, but their reliability is questioned, especially when it comes to high-quality translations that require a nuanced understanding of language. In contrast, neural metrics and LLMs offer a more sophisticated approach, often correlating better with human judgment.

Methodology of the Study

Researchers designed experiments to test how different input modes influence the LLM's ability to evaluate translations. These modes included providing LLMs with only the translation (T), source and translation (S-T), reference and translation (R-T), and source, reference, and translation combined (S-R-T). They used system-level accuracy and segment-level Kendall’s τ correlation as primary metrics of evaluation. Interestingly, the paper found that while reference information significantly improved evaluation accuracy, source information sometimes had a counterproductive effect, highlighting a potential lack of cross-lingual capability in current models.

Key Findings and Implications

One of the critical findings is that LLMs demonstrate a remarkable ability to leverage reference information effectively, which enhances the translation evaluation's accuracy. On the other hand, the inclusion of source information appears to confuse LLMs, pointing to a potential area for future research to improve cross-lingual capabilities of LLMs. Moreover, when LLMs were tasked to detect translation errors, reference information again showed a positive impact, whereas source information did not contribute as effectively.

These results underscore the current limitations of LLMs in machine translation evaluation tasks and point towards the necessity of further research focused on their ability to process and utilize source language information accurately.

PDF Markdown

Related Papers

Tweets

https://twitter.com/xuhuang87/status/1746881634664591665