Error Analysis Prompting Enables Human-Like Translation Evaluation in Large Language Models (2303.13809v4)
Abstract: Generative LLMs, e.g., ChatGPT, have demonstrated remarkable proficiency across several NLP tasks, such as machine translation, text summarization. Recent research (Kocmi and Federmann, 2023) has shown that utilizing LLMs for assessing the quality of machine translation (MT) achieves state-of-the-art performance at the system level but \textit{performs poorly at the segment level}. To further improve the performance of LLMs on MT quality assessment, we investigate several prompting designs, and propose a new prompting method called \textbf{\texttt{Error Analysis Prompting}} (EAPrompt) by combining Chain-of-Thoughts (Wei et al., 2022) and Error Analysis (Lu et al., 2023). This technique emulates the commonly accepted human evaluation framework - Multidimensional Quality Metrics (MQM, Freitag et al. (2021)) and \textit{produces explainable and reliable MT evaluations at both the system and segment level}. Experimental Results from the WMT22 metrics shared task validate the effectiveness of EAPrompt on various LLMs, with different structures. Further analysis confirms that EAPrompt effectively distinguishes major errors from minor ones, while also sharing a similar distribution of the number of errors with MQM. These findings highlight the potential of EAPrompt as a human-like evaluator prompting technique for MT evaluation.
- Findings of the IWSLT 2021 evaluation campaign. In IWSLT.
- Findings of the 2019 conference on machine translation (WMT19). In WMT.
- Language models are few-shot learners. NeurIPS.
- Scaling instruction-finetuned language models. arXiv preprint.
- Training verifiers to solve math word problems. arXiv preprint.
- Experts, errors, and context: A large-scale study of human evaluation for machine translation. TACL.
- Results of WMT22 metrics shared task: Stop using BLEU – neural metrics are better and more robust. In WMT.
- Can machine translation systems be evaluated by the crowd alone. Natural Language Engineering.
- How good are gpt models at machine translation? a comprehensive evaluation. arXiv preprint.
- Is chatgpt a good translator? a preliminary study. arXiv preprint.
- Findings of the 2022 conference on machine translation (WMT22). In WMT.
- Tom Kocmi and Christian Federmann. 2023. Large language models are state-of-the-art evaluators of translation quality. arXiv preprint.
- To ship or not to ship: An extensive evaluation of automatic metrics for machine translation. In WMT.
- Toward human-like evaluation for natural language generation with error analysis. arXiv preprint.
- Tangled up in BLEU: Reevaluating the evaluation of automatic machine translation evaluation metrics. In ACL.
- Results of the WMT20 metrics shared task. In WMT.
- Training language models to follow instructions with human feedback. arXiv preprint.
- Bleu: a method for automatic evaluation of machine translation. In ACL.
- Towards making the most of chatgpt for machine translation. arXiv preprint.
- Is chatgpt a general-purpose natural language processing task solver? arXiv preprint.
- Original or translated? on the use of parallel data for translation quality estimation. arXiv preprint.
- Language models are unsupervised multitask learners. OpenAI blog.
- Are references really needed? unbabel-IST 2021 submission for the metrics shared task. In WMT.
- COMET: A neural framework for MT evaluation. In EMNLP.
- BLEURT: Learning robust metrics for text generation. In ACL.
- Machine translation evaluation versus quality estimation. Machine translation.
- Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint.
- Brian Thompson and Matt Post. 2020. Automatic machine translation evaluation in many languages via zero-shot paraphrasing. In EMNLP.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Automatic post-editing of MT output using large language models. In AMTA.
- UniTE: Unified translation evaluation. In ACL.
- Chain of thought prompting elicits reasoning in large language models. arXiv preprint.
- Chatgpt or grammarly? evaluating chatgpt on grammatical error correction benchmark. arXiv preprint.
- Vega-MT: The JD explore academy machine translation system for WMT22. In WMT.
- Findings of the WMT 2022 shared task on quality estimation. In WMT.
- Can chatgpt understand too? a comparative study on chatgpt and fine-tuned bert. arXiv preprint.
- Qingyu Lu (6 papers)
- Baopu Qiu (3 papers)
- Liang Ding (158 papers)
- Kanjian Zhang (8 papers)
- Tom Kocmi (29 papers)
- Dacheng Tao (826 papers)