Towards Large Language Model driven Reference-less Translation Evaluation for English and Indian Languages (2404.02512v1)
Abstract: With the primary focus on evaluating the effectiveness of LLMs for automatic reference-less translation assessment, this work presents our experiments on mimicking human direct assessment to evaluate the quality of translations in English and Indian languages. We constructed a translation evaluation task where we performed zero-shot learning, in-context example-driven learning, and fine-tuning of LLMs to provide a score out of 100, where 100 represents a perfect translation and 1 represents a poor translation. We compared the performance of our trained systems with existing methods such as COMET, BERT-Scorer, and LABSE, and found that the LLM-based evaluator (LLaMA-2-13B) achieves a comparable or higher overall correlation with human judgments for the considered Indian language pairs.
- Findings of the 2021 conference on machine translation (WMT21). In Proceedings of the Sixth Conference on Machine Translation, pages 1–88, Online. Association for Computational Linguistics.
- Characterizing attribution and fluency tradeoffs for retrieval-augmented large language models. arXiv preprint arXiv:2302.05578.
- Khetam Al Sharou and Lucia Specia. 2022. A taxonomy and study of critical errors in machine translation. In Proceedings of the 23rd Annual Conference of the European Association for Machine Translation, pages 171–180.
- Massively multilingual neural machine translation in the wild: Findings and challenges. arXiv preprint arXiv:1907.05019.
- Jerome R Bellegarda. 2004. Statistical language model adaptation: review and perspectives. Speech communication, 42(1):93–108.
- A neural probabilistic language model. Advances in neural information processing systems, 13.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- A survey on evaluation of large language models. arXiv preprint arXiv:2307.03109.
- Everlyn Chimoto and Bruce Bassett. 2022. COMET-QE and active learning for low-resource machine translation. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 4735–4740, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451, Online. Association for Computational Linguistics.
- BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805.
- Language-agnostic BERT sentence embedding. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 878–891, Dublin, Ireland. Association for Computational Linguistics.
- Experts, errors, and context: A large-scale study of human evaluation for machine translation. Transactions of the Association for Computational Linguistics, 9:1460–1474.
- Indictrans2: Towards high-quality and accessible machine translation models for all 22 scheduled indian languages. arXiv preprint arXiv:2305.16307.
- Meteor-hindi: automatic mt evaluation metric for hindi as a target. In Proceedings of ICON-2010: 8th international conference on natural language processing, Macmillan Publishers. India.
- How do humans evaluate machine translation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 457–466, Lisbon, Portugal. Association for Computational Linguistics.
- Large language models: A comprehensive survey of its applications, challenges, limitations, and future prospects.
- Takeshi Hayakawa and Yuki Arase. 2020. Fine-grained error analysis on English-to-Japanese machine translation in the medical domain. In Proceedings of the 22nd Annual Conference of the European Association for Machine Translation, pages 155–164, Lisboa, Portugal. European Association for Machine Translation.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
- Mistral 7b. arXiv preprint arXiv:2310.06825.
- MS-COMET: More and better human judgements improve metric performance. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 541–548, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
- Large language models are zero-shot reasoners. In Advances in Neural Information Processing Systems, volume 35, pages 22199–22213. Curran Associates, Inc.
- Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles.
- Peft: State-of-the-art parameter-efficient fine-tuning methods. https://github.com/huggingface/peft.
- Tangled up in BLEU: Reevaluating the evaluation of automatic machine translation evaluation metrics. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4984–4997, Online. Association for Computational Linguistics.
- Recent advances in natural language processing via large pre-trained language models: A survey. ACM Computing Surveys, 56(2):1–40.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318.
- The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116.
- Maja Popović. 2015. chrf: character n-gram f-score for automatic mt evaluation. In Proceedings of the tenth workshop on statistical machine translation, pages 392–395.
- Improving language understanding by generative pre-training.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
- Neural machine translation for low-resource languages: A survey. ACM Computing Surveys, 55(11):1–37.
- COMET: A neural framework for MT evaluation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2685–2702, Online. Association for Computational Linguistics.
- CometKiwi: IST-unbabel 2022 submission for the quality estimation shared task. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 634–645, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
- Irene Rivera-Trigueros. 2022. Machine translation systems and quality assessment: a systematic review. Language Resources and Evaluation, 56(2):593–619.
- Rajeev Sangal. 2022. Evaluating MT Systems: A Theoretical Framework. arXiv preprint arXiv:2202.05806.
- Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100.
- BLEURT: Learning robust metrics for text generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7881–7892, Online. Association for Computational Linguistics.
- Ter-plus: paraphrase, semantic, and alignment enhancements to translation edit rate. Machine Translation, 23:117–127.
- Nigar M Shafiq Surameery and Mohammed Y Shakor. 2023. Use chat gpt to solve programming bugs. International Journal of Information Technology & Computer Engineering (IJITC) ISSN: 2455-5290, 3(01):17–22.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Attention is all you need. Advances in neural information processing systems, 30.
- Challenges of Neural Machine Translation for Short Texts. Computational Linguistics, 48(2):321–342.
- A survey on large language model based autonomous agents. arXiv preprint arXiv:2308.11432.
- Large language models meet NL2Code: A survey. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7443–7464, Toronto, Canada. Association for Computational Linguistics.
- Findings of the WMT 2022 shared task on quality estimation. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 69–99, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.
- Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675.
- A survey of large language models. arXiv preprint arXiv:2303.18223.
- Vandan Mujadia (6 papers)
- Pruthwik Mishra (12 papers)
- Arafat Ahsan (2 papers)
- Dipti Misra Sharma (15 papers)