Cross-Lingual Auto Evaluation for Assessing Multilingual LLMs (2410.13394v1)

Published 17 Oct 2024 in cs.CL

Abstract: Evaluating machine-generated text remains a significant challenge in NLP, especially for non-English languages. Current methodologies, including automated metrics, human assessments, and LLM-based evaluations, predominantly focus on English, revealing a significant gap in multilingual evaluation frameworks. We introduce the Cross Lingual Auto Evaluation (CIA) Suite, an extensible framework that includes evaluator LLMs (Hercule) and a novel test set (Recon) specifically designed for multilingual evaluation. Our test set features 500 human-annotated instructions spanning various task capabilities along with human judgment scores across six languages. This would enable benchmarking of general-purpose multilingual LLMs and facilitate meta-evaluation of Evaluator LLMs. The proposed model, Hercule, is a cross-lingual evaluation model that addresses the scarcity of reference answers in the target language by learning to assign scores to responses based on easily available reference answers in English. Our experiments demonstrate that Hercule aligns more closely with human judgments compared to proprietary models, demonstrating the effectiveness of such cross-lingual evaluation in low resource scenarios. Further, it is also effective in zero-shot evaluation on unseen languages. This study is the first comprehensive examination of cross-lingual evaluation using LLMs, presenting a scalable and effective approach for multilingual assessment. All code, datasets, and models will be publicly available to enable further research in this important area.

Authors (6)

Sumanth Doddapaneni (16 papers)
Mohammed Safi Ur Rahman Khan (8 papers)
Dilip Venkatesh (1 paper)
Raj Dabre (65 papers)
Anoop Kunchukuttan (45 papers)
Mitesh M. Khapra (79 papers)

Summary

The paper introduces the CIA Suite, a novel framework that employs the cross-lingual evaluator Hercule and a multilingual Recon test set to achieve human-aligned LLM evaluations.
Experimental results show that Hercule outperforms proprietary models, excelling in zero-shot evaluations across unseen languages.
The study paves the way for scalable, cost-effective multilingual benchmarks that reduce reliance on expensive human assessments.

Cross-Lingual Auto Evaluation for Multilingual LLMs

The evaluation of machine-generated text, particularly for non-English languages, poses a persistent challenge within the field of NLP. While substantial advancements have been made in evaluating English-language text through various automated metrics, human assessments, and LLM-based approaches, a comprehensive framework for multilingual evaluations remains elusive. The paper "Cross-Lingual Auto Evaluation for Assessing Multilingual LLMs" addresses this gap by introducing the Cross Lingual Auto Evaluation (CIA) Suite, a flexible framework designed to evaluate multilingual LLMs.

CIA Suite Framework

The CIA Suite comprises two pivotal components: evaluator LLMs named "Hercule" and the "Recon" test set. Recon is a novel multilingual benchmark, featuring 500 human-annotated instructions and judgment scores across six languages, specifically curated to evaluate multilingual LLMs. Hercule, on the other hand, is a cross-lingual evaluation model that learns to assign scores to responses based on English reference answers, effectively overcoming the scarcity of target language references.

Methodology and Experiments

The paper's experiments demonstrate that Hercule models exhibit greater alignment with human judgments compared to proprietary models, showcasing effectiveness in low-resource scenarios. These models are particularly proficient in zero-shot evaluations for unseen languages. This research represents the first comprehensive paper of cross-lingual evaluation using LLMs, offering a scalable and effective multilingual assessment approach.

CIA Suite's Recon test set provides comprehensive human-annotated benchmarks that enable meta-evaluation of Evaluator LLMs. This capability allows the benchmarking of general-purpose multilingual LLMs beyond simpler, closed tasks, addressing the need for robust multilingual benchmarks highlighted in the paper.

Implications and Future Directions

The CIA Suite has both practical and theoretical implications. Practically, it provides a standardized method for evaluating multilingual LLMs, reducing reliance on costly and time-consuming human evaluations. Theoretically, it advances the understanding of cross-lingual transfer and the limitations of LLMs in non-English contexts.

Future refinements to the CIA Suite could involve expanding the languages covered within Recon and adapting Hercule to incorporate more diverse multilingual corpora. Moreover, further exploration into the cross-lingual transfer mechanisms could provide deeper insights into enhancing multilingual models.

Conclusion

In conclusion, the "Cross-Lingual Auto Evaluation for Assessing Multilingual LLMs" paper provides a significant step forward in addressing the challenge of multilingual text evaluation in NLP. Through the introduction of the CIA Suite, it lays the groundwork for more accurate and scalable assessments, paving the way for future advancements in the development and evaluation of multilingual LLMs. The CIA Suite, by addressing current gaps and demonstrating potential efficiency, presents a notable contribution to the ongoing evolution of NLP research.

PDF Markdown

Related Papers

Tweets

https://twitter.com/sumanthd17/status/1847260348770025873

https://twitter.com/IAMJBDEL/status/1848865503223435557

https://twitter.com/arXivGPT/status/1849226471711412515