Convergences and Divergences between Automatic Assessment and Human Evaluation: Insights from Comparing ChatGPT-Generated Translation and Neural Machine Translation (2401.05176v3)
Abstract: LLMs have demonstrated parallel and even superior translation performance compared to neural machine translation (NMT) systems. However, existing comparative studies between them mainly rely on automated metrics, raising questions into the feasibility of these metrics and their alignment with human judgment. The present study investigates the convergences and divergences between automated metrics and human evaluation in assessing the quality of machine translation from ChatGPT and three NMT systems. To perform automatic assessment, four automated metrics are employed, while human evaluation incorporates the DQF-MQM error typology and six rubrics. Notably, automatic assessment and human evaluation converge in measuring formal fidelity (e.g., error rates), but diverge when evaluating semantic and pragmatic fidelity, with automated metrics failing to capture the improvement of ChatGPT's translation brought by prompt engineering. These results underscore the indispensable role of human judgment in evaluating the performance of advanced translation tools at the current stage.
- Satanjeev Banerjee and Alon Lavie. 2005. METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization@ACL 2005, Ann Arbor, Michigan, USA, June 29, 2005, pages 65–72. Association for Computational Linguistics.
- Machine translation human evaluation: an investigation of evaluation based on post-editing and its relation with direct assessment. In Proceedings of the 15th International Conference on Spoken Language Translation, IWSLT 2018, Bruges, Belgium, October 29-30, 2018, pages 62–69. International Conference on Spoken Language Translation.
- Ten years of wmt evaluation campaigns: Lessons learnt.
- Language models are few-shot learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
- Eirini Chatzikoumi. 2020. How to evaluate machine translation: A review of automated and human metrics. Nat. Lang. Eng., 26(2):137–161.
- Iterative translation refinement with large language models. CoRR, abs/2306.03856.
- Palm: Scaling language modeling with pathways. CoRR, abs/2204.02311.
- Joanna Drugan. 2013. Quality In Professional Translation: Assessment and Improvement. Bloomsbury Academic.
- A survey of machine translation competences: Insights for translation technology educators and practitioners. Perspectives Studies in Translatology, 23:1–26.
- Robert Godwin-Jones. 2022. Partnering with ai: Intelligent writing assistance and instructed language learning. Language Learning & Technology, 26(2):5–24.
- Achieving human parity on automatic chinese to english news translation. CoRR, abs/1803.05567.
- Exploring human-like translation strategy with large language models. CoRR, abs/2305.04118.
- Measuring massive multitask language understanding. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net.
- How good are GPT models at machine translation? A comprehensive evaluation. CoRR, abs/2302.09210.
- Kaibao Hu and Xiaoqian Li. 2023. The creativity and limitations of ai neural machine translation: A corpus-based study of deepl’s english-to-chinese translation of shakespeare’s plays. Babel, 69(4):546–563.
- Distinguishing translations by human, nmt, and chatgpt: A linguistic and statistical approach. CoRR, abs/2312.10750.
- Is chatgpt a good translator? yes with gpt-4 as the engine. CoRR, abs/2301.08745.
- Marzena Karpinska and Mohit Iyyer. 2023. Large language models effectively leverage document-level context for literary translation, but critical errors persist. In Proceedings of the Eighth Conference on Machine Translation, WMT 2023, Singapore, December 6-7, 2023, pages 419–451. Association for Computational Linguistics.
- Defining translation quality. In Tradumatica, 12, pages 413–420.
- Tom Kocmi and Christian Federmann. 2023. Large language models are state-of-the-art evaluators of translation quality. In Proceedings of the 24th Annual Conference of the European Association for Machine Translation, EAMT 2023, Tampere, Finland, 12-15 June 2023, pages 193–203. European Association for Machine Translation.
- Chatgpt beyond english: Towards a comprehensive evaluation of large language models in multilingual learning. In Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023, pages 13171–13189. Association for Computational Linguistics.
- Accurate semantic textual similarity for cleaning noisy parallel corpora using semantic machine translation evaluation metric: The NRC supervised submissions to the parallel corpus filtering task. In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, WMT 2018, Belgium, Brussels, October 31 - November 1, 2018, pages 908–916. Association for Computational Linguistics.
- Multidimensional quality metrics: a flexible system for assessing translation quality. In Proceedings of Translating and the Computer 35, London, UK. Aslib.
- Error analysis prompting enables human-like translation evaluation in large language models: A case study on chatgpt. CoRR, abs/2303.13809.
- Xiaolei Lu and Chao Han. 2023. Automatic assessment of spoken-language interpreting based on machine-translation evaluation metrics: A multi-scenario exploratory study. Interpreting, 25(1):109–143.
- Linguistically motivated evaluation of the 2023 state-of-the-art machine translation: Can chatgpt outperform nmt? In Proceedings of the Eighth Conference on Machine Translation, WMT 2023, Singapore, December 6-7, 2023, pages 224–245. Association for Computational Linguistics.
- MT post-editing guidelines.
- Augmenting large language model translators via translation memories. In Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023, pages 10287–10299. Association for Computational Linguistics.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, July 6-12, 2002, Philadelphia, PA, USA, pages 311–318. ACL.
- Towards making the most of chatgpt for machine translation. In Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023, pages 5622–5633. Association for Computational Linguistics.
- Maja Popovic. 2015. chrf: character n-gram f-score for automatic MT evaluation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, WMT@EMNLP 2015, 17-18 September 2015, Lisbon, Portugal, pages 392–395. The Association for Computer Linguistics.
- Maja Popović. 2018. Error Classification and Analysis for Machine Translation Quality Assessment, pages 129–158. Springer International Publishing, Cham.
- Are references really needed? unbabel-ist 2021 submission for the metrics shared task. In Proceedings of the Sixth Conference on Machine Translation, WMT@EMNLP 2021, Online Event, November 10-11, 2021, pages 1030–1040. Association for Computational Linguistics.
- Questeval: Summarization asks for fact-based evaluation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pages 6594–6604. Association for Computational Linguistics.
- A study of translation edit rate with targeted human annotation. In Proceedings of the 7th Conference of the Association for Machine Translation in the Americas: Technical Papers, AMTA 2006, Cambridge, Massachusetts, USA, August 8-12, 2006, pages 223–231. Association for Machine Translation in the Americas.
- Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research.
- Document-level machine translation with large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 16646–16661. Association for Computational Linguistics.
- What language model architecture and pretraining objective works best for zero-shot generalization? In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pages 22964–22984. PMLR.
- Emergent abilities of large language models. Transactions on Machine Learning Research. Survey Certification.
- Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS.
- Empowering llm-based machine translation with cultural awareness. CoRR, abs/2305.14328.
- Unifying the perspectives of nlp and software engineering: A survey on language models for code. CoRR, abs/2311.07989.
- A survey of large language models. CoRR, abs/2303.18223.
- Zhaokun Jiang (4 papers)
- Ziyin Zhang (16 papers)
- Qianxi Lv (2 papers)
- Lei Lei (98 papers)