CroissantLLM: A Truly Bilingual French-English Language Model (2402.00786v5)

Published 1 Feb 2024 in cs.CL and cs.LG

Abstract: We introduce CroissantLLM, a 1.3B LLM pretrained on a set of 3T English and French tokens, to bring to the research and industrial community a high-performance, fully open-sourced bilingual model that runs swiftly on consumer-grade local hardware. To that end, we pioneer the approach of training an intrinsically bilingual model with a 1:1 English-to-French pretraining data ratio, a custom tokenizer, and bilingual finetuning datasets. We release the training dataset, notably containing a French split with manually curated, high-quality, and varied data sources. To assess performance outside of English, we craft a novel benchmark, FrenchBench, consisting of an array of classification and generation tasks, covering various orthogonal aspects of model performance in the French Language. Additionally, rooted in transparency and to foster further LLM research, we release codebases, and dozens of checkpoints across various model sizes, training data distributions, and training steps, as well as fine-tuned Chat models, and strong translation models. We evaluate our model through the FMTI framework, and validate 81 % of the transparency criteria, far beyond the scores of even most open initiatives. This work enriches the NLP landscape, breaking away from previous English-centric work in order to strengthen our understanding of multilinguality in LLMs.

References (74)

Citations (26)

View on Semantic Scholar

Summary

The paper presents CroissantLLM, a 1.3 billion parameter model pre-trained on an equal mix of French and English text to address language bias in NLP.
The paper achieves an 81% score on the Foundation Model Transparency Index, highlighting its commitment to openness and clear data provenance.
The paper demonstrates that CroissantLLM outperforms monolingual French models and rivals specialized models in translation tasks, broadening its practical applicability.

Introduction

The landscape of NLP has been dominantly shaped by LLMs which have been primarily focused on English. This focus has led to a scarcity of resources and tools available for other languages, including French. Addressing this gap, we present CroissantLLM, a 1.3 billion parameter LLM, that has been pre-trained on an equal mix of English and French text, totaling 3 trillion tokens. CroissantLLM represents a significant shift from the English-centric approach, aiming to balance performance across both languages, while ensuring the model is manageable even on consumer-grade hardware.

Transparency and Bias

CroissantLLM embeds transparency into its development process, a contrast to the often-secrecy shrouding the training of state-of-the-art models. The model has achieved an 81% score on the Foundation Model Transparency Index (FMTI), demonstrating a commitment to openness exceeding that of most open initiatives. This forefronted transparency aligns with current debates on the construction and use of LLMs, as well as growing demands for clarity in AI, including clear data provenance and usage policies. The bias toward English in LLMs has not only skewed performance but also cultural representation. CroissantLLM mitigates this through its bilingual corpus, designed to foster diverse cultural knowledge, albeit with some limitations, like any model of its size, in capturing the full scope of human language diversity.

Performance Benchmarking

CroissantLLM's performance showcases the successful integration of bilingual data into pre-training. English benchmarks place it inline with models such as TinyLlama (1.1B), and French assessments reveal it outperforms monolingual French models. Translation tasks are a particular stronghold for CroissantLLM, where it rivals specialized NLLB 1.3B models when fine-tuned. Importantly, its efficiency and dimensions make it accessible for broad use and continued training beyond its initial release.

Implications and Future Work

The efficiency of CroissantLLM opens avenues for its widespread adoption in both research and industrial applications. This adoption, coupled with an open-source approach, is anticipated to spark innovation and further NLP advancements in French language processing. Future work might extend CroissantLLM's approach to other language pairs, ideally in a manner that tackles the non-trivial challenge of balancing the quality and quantity of multilingual corpora. The hope is for CroissantLLM and its successors to increasingly reflect the linguistic and cultural diversity of global users, enhancing the accessibility and relevance of NLP technology.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_akhaliq/status/1753266216511189447

https://twitter.com/SashaMTL/status/1755169489237201084

https://twitter.com/ManuelFaysse/status/1782151505836933546

https://twitter.com/tmdanis/status/1753472505438777461

https://twitter.com/ManuelFaysse/status/1872972087109009878

https://twitter.com/ManuelFaysse/status/1782155838120997291

YouTube

Show All Videos