MEGA: Multilingual Evaluation of Generative AI (2303.12528v4)

Published 22 Mar 2023 in cs.CL

Abstract: Generative AI models have shown impressive performance on many Natural Language Processing tasks such as language understanding, reasoning, and language generation. An important question being asked by the AI community today is about the capabilities and limits of these models, and it is clear that evaluating generative AI is very challenging. Most studies on generative LLMs have been restricted to English and it is unclear how capable these models are at understanding and generating text in other languages. We present the first comprehensive benchmarking of generative LLMs - MEGA, which evaluates models on standard NLP benchmarks, covering 16 NLP datasets across 70 typologically diverse languages. We compare the performance of generative LLMs including Chat-GPT and GPT-4 to State of the Art (SOTA) non-autoregressive models on these tasks to determine how well generative models perform compared to the previous generation of LLMs. We present a thorough analysis of the performance of models across languages and tasks and discuss challenges in improving the performance of generative LLMs on low-resource languages. We create a framework for evaluating generative LLMs in the multilingual setting and provide directions for future progress in the field.

PDF Abstract

MEGA: Multilingual Evaluation of Generative AI

The paper "MEGA: Multilingual Evaluation of Generative AI" addresses the challenge of evaluating LLMs like ChatGPT, GPT-4, and others in a multilingual context. The paper is predicated on the notion that while generative AI has demonstrated remarkable capabilities in several tasks within NLP, its performance is predominantly evaluated in English, thereby leaving its efficacy in other languages largely unexplored.

The authors introduce MEGA, a comprehensive benchmarking framework devised to evaluate generative LLMs across a diverse linguistic spectrum encompassing 70 languages from 16 NLP datasets characterized by typological diversity. This extensive evaluation serves to compare these generative models against state-of-the-art (SOTA) non-autoregressive models. A key goal of the framework is to ascertain the comparative performance of these models not only across different languages but also across varied NLP tasks, such as classification, question answering, sequence labeling, and natural language generation.

Key Findings

Performance Disparity: MEGA illustrates a noticeable disparity in the performance of LLMs between English and non-English languages, with the gap further exacerbated in low-resource languages and those employing non-Latin scripts. Generative models, particularly state-of-the-art ones, are shown to perform significantly better in Latin-script languages, revealing a bias towards linguistically prevalent languages in their training data.
Impact of Pre-Training Data: The multilingual capabilities of these models correlate with the pre-training data's linguistic diversity. While models like GPT-4 mitigate the performance gap to a certain extent, non-English languages, especially those underrepresented in pre-training corpora, see a marked drop in performance.
Tokenization and Prompt Strategy: The paper explores the implications of tokenizer quality and prompt strategies in multilingual contexts. Poor tokenization quality of under-represented languages results in inflated token counts, which hampers model performance. Prompt adaptations such as 'translate-test' strategies offer significant improvements for low-resource languages.
Evaluation Strategies: The analysis covers multiple prompting strategies, including monolingual, zero-shot cross-lingual, and translate-test prompts, revealing varying levels of effectiveness across tasks and languages.

Implications and Future Directions

The paper implies that advancing multilingual generative AI requires a multifaceted approach focusing on enhancing pre-training data diversity, improving tokenization strategies, and developing refined multilingual prompt strategies. Future research directions include the assessment of low-resource languages' specific needs in NLP applications, leading to more equitable AI systems.

A significant practical implication of this work lies in pushing the AI research agenda toward creating inclusive technology that admirably serves linguistically diverse users. The MEGA framework not only establishes a benchmark but opens inquiries into better methods for evaluating multilingual generative models, particularly in languages that lack extensive resources.

In conclusion, the MEGA benchmark sets a precedent for comprehensive multilingual evaluations, underlining both the current achievements and limitations of generative AI across languages. It also fosters further research toward refining AI models that equitably perform across the linguistic spectrum, plugging gaps in capabilities for lesser-represented languages.

PDF Markdown Bookmark Chat (Pro)

Authors (12)

Kabir Ahuja (18 papers)
Harshita Diddee (12 papers)
Rishav Hada (9 papers)
Millicent Ochieng (8 papers)
Krithika Ramesh (7 papers)
Prachi Jain (12 papers)
Akshay Nambi (14 papers)
Tanuja Ganu (22 papers)
Sameer Segal (4 papers)
Maxamed Axmed (4 papers)
Kalika Bali (27 papers)
Sunayana Sitaram (54 papers)

Citations (216)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/shaily99/status/1767910760141730213