BLOOM+1: Adding Language Support to BLOOM for Zero-Shot Prompting (2212.09535v3)

Published 19 Dec 2022 in cs.CL, cs.AI, and cs.LG

Abstract: The BLOOM model is a large publicly available multilingual LLM, but its pretraining was limited to 46 languages. To extend the benefits of BLOOM to other languages without incurring prohibitively large costs, it is desirable to adapt BLOOM to new languages not seen during pretraining. In this work, we apply existing language adaptation strategies to BLOOM and benchmark its zero-shot prompting performance on eight new languages in a resource-constrained setting. We find language adaptation to be effective at improving zero-shot performance in new languages. Surprisingly, we find that adapter-based finetuning is more effective than continued pretraining for large models. In addition, we discover that prompting performance is not significantly affected by language specifics, such as the writing system. It is primarily determined by the size of the language adaptation data. We also add new languages to BLOOMZ, which is a multitask finetuned version of BLOOM capable of following task instructions zero-shot. We find including a new language in the multitask fine-tuning mixture to be the most effective method to teach BLOOMZ a new language. We conclude that with sufficient training data language adaptation can generalize well to diverse languages. Our code is available at https://github.com/bigscience-workshop/multilingual-modeling.

PDF Abstract

Overview of the BLOOM+1 Paper

The paper "BLOOM+1: Adding Language Support to BLOOM for Zero-Shot Prompting" addresses the challenge of extending the BLOOM multilingual LLM to support additional languages beyond the 46 included during its original pretraining. The authors apply language adaptation techniques to BLOOM, evaluating zero-shot performance on eight new languages using limited data resources.

Key Insights

Language Adaptation Strategies: The research evaluates the effectiveness of language adaptation strategies like continued pretraining, MAD-X adapters, and (IA)³ adapters on BLOOM across different scales, ranging from 560 million to 7.1 billion parameters. It highlights that adapter-based fine-tuning outperforms continued pretraining for larger models in resource-constrained settings.
Model Performance: The paper finds that while smaller models benefit more from continued pretraining, larger models (>3B parameters) achieve superior performance when adapted using adapter strategies such as MAD-X or (IA)³. Additionally, model performance scales with the number of parameters, demonstrating the applicability of scaling laws.
Data Utilization: The research underscores the importance of having sufficient adaptation data. It shows that approximately 100 million tokens of good quality data are required for effective language adaptation in zero-shot prompting scenarios.
Adaptation Outcomes on New Languages: Performance gains were observed for additional languages regardless of their script or linguistic family. Notably, adapted BLOOM outperformed or matched performance with other baseline models like mGPT and XGLM in several tasks and languages.
Instruction-Tuning with New Languages: The paper also introduces the concept of adding new language support in models trained on multitask prompts like BLOOMZ, showing positive results when new languages are included in the multitask fine-tuning mixture.

Implications and Future Speculations

Scalability and Efficiency: Adapter-based methods like MAD-X and (IA)³ could offer a scalable and efficient path forward for adapting very large models (>100B parameters) to new languages without significant computational burdens, promoting broader accessibility with reduced resource requirements.
Cross-Lingual Generalization: The research presents insights on cross-lingual generalization capabilities of large-scale LLMs and suggests that multilingual adaptability can be achieved through selective data augmentation and parameter-efficient methods.
Applicability to Low-Resource Languages: The findings advocate for further exploration of data-efficient strategies to extend LLMs to truly low-resource languages, which often lack sufficient unlabeled data.
Future Directions in Multilingual Models: The results suggest potential advancements in adapting models to be more inclusive, supporting languages that have traditionally been underrepresented in the corpus used for pretraining.

This research contributes significantly to the understanding of how large-scale LLMs can be fine-tuned to accommodate more diverse languages without the impractical costs of complete retraining, paving the way for more inclusive AI language technologies.