mGPT: Few-Shot Learners Go Multilingual (2204.07580v2)

Published 15 Apr 2022 in cs.CL and cs.AI

Abstract: Recent studies report that autoregressive LLMs can successfully solve many NLP tasks via zero- and few-shot learning paradigms, which opens up new possibilities for using the pre-trained LLMs. This paper introduces two autoregressive GPT-like models with 1.3 billion and 13 billion parameters trained on 60 languages from 25 language families using Wikipedia and Colossal Clean Crawled Corpus. We reproduce the GPT-3 architecture using GPT-2 sources and the sparse attention mechanism; Deepspeed and Megatron frameworks allow us to parallelize the training and inference steps effectively. The resulting models show performance on par with the recently released XGLM models by Facebook, covering more languages and enhancing NLP possibilities for low resource languages of CIS countries and Russian small nations. We detail the motivation for the choices of the architecture design, thoroughly describe the data preparation pipeline, and train five small versions of the model to choose the most optimal multilingual tokenization strategy. We measure the model perplexity in all covered languages and evaluate it on the wide spectre of multilingual tasks, including classification, generative, sequence labeling and knowledge probing. The models were evaluated with the zero-shot and few-shot methods. Furthermore, we compared the classification tasks with the state-of-the-art multilingual model XGLM. source code and the mGPT XL model are publicly released.

PDF Abstract

Overview of mGPT: Few-Shot Learners Go Multilingual

The paper "mGPT: Few-Shot Learners Go Multilingual" introduces an ambitious effort to extend the capabilities of GPT-like models beyond the monolingual confines primarily focusing on English, into a multilingual context encompassing 61 languages across 25 distinct language families. This endeavor is particularly noteworthy as it attempts to address linguistic communities, including economically endangered and underrepresented languages of the Commonwealth of Independent States (CIS) and the smaller peoples in Russia.

Key Technical Contributions

At its core, the paper examines few-shot learning methodologies from the perspective of multilingual LLMs (MLMs) through the creation of mGPT versions: mGPT1.3B with 1.3 billion parameters and mGPT13B with 13 billion parameters. Following the design principles of GPT-3, these autoregressive LMs are trained on a typologically diverse corpus drawn from both Wikipedia and the Colossal Clean Crawled Corpus (C4). The technical undertaking of pretraining mGPT on such a multilingual corpus is notable for its aim to balance resource allocation among high-resource and low-resource languages, seeking an equitable distribution that has historically been overlooked.

Evaluation Framework

Intrinsic and extrinsic evaluations unfold across several dimensions. The intrinsic evaluation gauges LLMing abilities via perplexity measures across all pre-trained languages, with perplexity scores demonstrating that mGPT13B significantly enhances language generation abilities compared to its smaller counterpart. Additionally, an extrinsic evaluation involving cross-lingual natural language understanding (NLU) datasets and benchmarks in 33 languages augments the overall assessment framework.

The analysis highlights:

mGPT1.3B's comparable performance to XGLM1.7B despite its enhanced multilingual coverage.
The capabilities of mGPT in handling Austronesian, Austro-Asiatic, Japonic, Germanic, and Romance languages across multiple tasks, albeit with observed performance degradation when additional demonstration examples are included.
The ineffectiveness of zero-shot and few-shot setups concerning hate speech detection, suggesting challenges with the model's generalization in this specific domain.

Implications and Future Prospects

The paper provides insights into how multilingual autoregressive models may be leveraged to facilitate cross-lingual knowledge transfer and address annotation and computational resource challenges inherent to low-resource languages. Practically, the development of mGPT stands to impact linguistic inclusivity and diversity within NLP research, promoting applications and engagement from languages often sidelined by mainline ML and NLP advancements, notably aiding in reducing harmful carbon footprints through efficient model utilization.

Theoretically, the findings pose questions on optimal strategies for balancing linguistic variance against model capacity constraints, heralding further explorations into tokenization practices and prompt design. Moreover, the paper tentatively paves the way for discussions around the ethical facets of deploying such potent multilingual models, addressing potential biases and misuse.

Conclusion

Overall, mGPT represents a significant stride in the landscape of multilingual NLP, marrying the complexities of linguistic diversity with cutting-edge machine learning paradigms. The research marks an initial but crucial step towards achieving greater equity in language representation within AI systems, presenting an open invitation for further communal efforts in enhancing the performance and applicability of multilingual generative LLMs.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Oleh Shliazhko (4 papers)
Alena Fenogenova (17 papers)
Maria Tikhonova (10 papers)
Vladislav Mikhailov (31 papers)
Anastasia Kozlova (3 papers)
Tatiana Shavrina (18 papers)

Citations (133)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos