Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CroissantLLM: A Truly Bilingual French-English Language Model (2402.00786v5)

Published 1 Feb 2024 in cs.CL and cs.LG

Abstract: We introduce CroissantLLM, a 1.3B LLM pretrained on a set of 3T English and French tokens, to bring to the research and industrial community a high-performance, fully open-sourced bilingual model that runs swiftly on consumer-grade local hardware. To that end, we pioneer the approach of training an intrinsically bilingual model with a 1:1 English-to-French pretraining data ratio, a custom tokenizer, and bilingual finetuning datasets. We release the training dataset, notably containing a French split with manually curated, high-quality, and varied data sources. To assess performance outside of English, we craft a novel benchmark, FrenchBench, consisting of an array of classification and generation tasks, covering various orthogonal aspects of model performance in the French Language. Additionally, rooted in transparency and to foster further LLM research, we release codebases, and dozens of checkpoints across various model sizes, training data distributions, and training steps, as well as fine-tuned Chat models, and strong translation models. We evaluate our model through the FMTI framework, and validate 81 % of the transparency criteria, far beyond the scores of even most open initiatives. This work enriches the NLP landscape, breaking away from previous English-centric work in order to strengthen our understanding of multilinguality in LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (74)
  1. Towards a cleaner document-oriented multilingual crawled corpus. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 4344–4355, Marseille, France. European Language Resources Association.
  2. Steering large language models for machine translation with finetuning and in-context learning. arXiv preprint arXiv:2310.13448.
  3. Palm 2 technical report.
  4. Qwen technical report.
  5. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966.
  6. The Belebele benchmark: a parallel reading comprehension dataset in 122 language variants.
  7. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 610–623.
  8. Pythia: A suite for analyzing large language models across training and scaling.
  9. PIQA: Reasoning about physical commonsense in natural language.
  10. The foundation model transparency index.
  11. Searching for needles in a haystack: On the role of incidental bilingualism in PaLM’s translation capability. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9432–9452, Toronto, Canada. Association for Computational Linguistics.
  12. The secret sharer: Evaluating and testing unintended memorization in neural networks. In 28th USENIX Security Symposium (USENIX Security 19), pages 267–284, Santa Clara, CA. USENIX Association.
  13. Black-box access is insufficient for rigorous AI audits.
  14. PaLM: scaling language modeling with pathways.
  15. Think you have solved question answering? try arc, the ai2 reasoning challenge.
  16. Cyrile Delestre. 2023. [link].
  17. Fquad: French question answering dataset.
  18. Enhancing chat language models by scaling high-quality instructional conversations. arXiv preprint arXiv:2305.14233.
  19. Barthez: a skilled pretrained french sequence-to-sequence model.
  20. Revisiting instruction fine-tuned model evaluation to guide industrial applications. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
  21. Scaling laws for multilingual neural machine translation. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 10053–10071. PMLR.
  22. A framework for few-shot language model evaluation.
  23. Hallucinations in large multilingual translation models. Transactions of the Association for Computational Linguistics, 11:1500–1517.
  24. xcomet: Transparent machine translation evaluation through fine-grained error detection. arXiv preprint arXiv:2310.10482.
  25. Michael Hart. 1971. Project gutenberg.
  26. Training compute-optimal large language models.
  27. Sleeper agents: Training deceptive llms that persist through safety training.
  28. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38.
  29. Mistral 7b.
  30. Mixtral of experts.
  31. Scaling laws for neural language models. CoRR, abs/2001.08361.
  32. Scaling laws for neural language models.
  33. The stack: 3 tb of permissively licensed source code.
  34. Taku Kudo. 2018. Subword regularization: Improving neural network translation models with multiple subword candidates.
  35. Pagnol: An extra-large french generative model.
  36. The bigscience roots corpus: A 1.6tb composite multilingual dataset.
  37. Starcoder: may the source be with you!
  38. Textbooks are all you need ii: phi-1.5 technical report.
  39. Competition-level code generation with alphacode. Science, 378(6624):1092–1097.
  40. Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
  41. Estimating the carbon footprint of bloom, a 176b parameter language model.
  42. FinGPT: Large generative models for a small language. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2710–2726, Singapore. Association for Computational Linguistics.
  43. Scaling data-constrained language models.
  44. Crosslingual generalization through multitask finetuning. arXiv preprint arXiv:2211.01786.
  45. Crows-pairs: A challenge dataset for measuring social biases in masked language models.
  46. Biases in large language models: Origins, inventory, and discussion. J. Data and Information Quality, 15(2).
  47. NewYorkTimes. 2023. The times sues openai and microsoft over a.i. use of copyrighted work. https://www.nytimes.com/2023/12/27/business/media/new-york-times-open-ai-microsoft-lawsuit.html.
  48. Culturax: A cleaned, enormous, and multilingual dataset for large language models in 167 languages.
  49. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318.
  50. Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191, Belgium, Brussels. Association for Computational Linguistics.
  51. CometKiwi: IST-unbabel 2022 submission for the quality estimation shared task. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 634–645, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
  52. How good is your tokenizer? on the monolingual performance of multilingual language models.
  53. Pamela Samuelson. 2023. Generative ai meets copyright. Science, 381(6654):158–161.
  54. Whose opinions do language models reflect?
  55. Nikhil Sardana and Jonathan Frankle. 2023. Beyond chinchilla-optimal: Accounting for inference in language model scaling laws.
  56. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100.
  57. Overview of the SPMRL 2013 shared task: A cross-framework evaluation of parsing morphologically rich languages. In Proceedings of the Fourth Workshop on Statistical Parsing of Morphologically-Rich Languages, pages 146–182, Seattle, Washington, USA. Association for Computational Linguistics.
  58. Neural machine translation of rare words with subword units.
  59. mgpt: Few-shot learners go multilingual.
  60. Antoine Simoulin and Benoit Crabbé. 2021. Un modèle Transformer Génératif Pré-entrainé pour le ______ français. In Traitement Automatique des Langues Naturelles, pages 246–255, Lille, France. ATALA.
  61. SlimPajama: A 627B token cleaned and deduplicated version of RedPajama. https://www.cerebras.net/blog/slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama.
  62. No language left behind: Scaling human-centered machine translation.
  63. Jörg Tiedemann. 2012. Parallel data, tools and interfaces in opus. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), Istanbul, Turkey. European Language Resources Association (ELRA).
  64. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  65. Llama 2: Open foundation and fine-tuned chat models.
  66. Finetuned language models are zero-shot learners.
  67. Emergent abilities of large language models.
  68. Crowdsourcing multiple choice science questions.
  69. Hellaswag: Can a machine really finish your sentence?
  70. Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414.
  71. Tinyllama: An open-source small language model.
  72. Opt: Open pre-trained transformer language models.
  73. (inthe)wildchat: 570k chatGPT interaction logs in the wild. In The Twelfth International Conference on Learning Representations.
  74. Judging llm-as-a-judge with mt-bench and chatbot arena.
Citations (26)

Summary

  • The paper presents CroissantLLM, a 1.3 billion parameter model pre-trained on an equal mix of French and English text to address language bias in NLP.
  • The paper achieves an 81% score on the Foundation Model Transparency Index, highlighting its commitment to openness and clear data provenance.
  • The paper demonstrates that CroissantLLM outperforms monolingual French models and rivals specialized models in translation tasks, broadening its practical applicability.

Introduction

The landscape of NLP has been dominantly shaped by LLMs which have been primarily focused on English. This focus has led to a scarcity of resources and tools available for other languages, including French. Addressing this gap, we present CroissantLLM, a 1.3 billion parameter LLM, that has been pre-trained on an equal mix of English and French text, totaling 3 trillion tokens. CroissantLLM represents a significant shift from the English-centric approach, aiming to balance performance across both languages, while ensuring the model is manageable even on consumer-grade hardware.

Transparency and Bias

CroissantLLM embeds transparency into its development process, a contrast to the often-secrecy shrouding the training of state-of-the-art models. The model has achieved an 81% score on the Foundation Model Transparency Index (FMTI), demonstrating a commitment to openness exceeding that of most open initiatives. This forefronted transparency aligns with current debates on the construction and use of LLMs, as well as growing demands for clarity in AI, including clear data provenance and usage policies. The bias toward English in LLMs has not only skewed performance but also cultural representation. CroissantLLM mitigates this through its bilingual corpus, designed to foster diverse cultural knowledge, albeit with some limitations, like any model of its size, in capturing the full scope of human language diversity.

Performance Benchmarking

CroissantLLM's performance showcases the successful integration of bilingual data into pre-training. English benchmarks place it inline with models such as TinyLlama (1.1B), and French assessments reveal it outperforms monolingual French models. Translation tasks are a particular stronghold for CroissantLLM, where it rivals specialized NLLB 1.3B models when fine-tuned. Importantly, its efficiency and dimensions make it accessible for broad use and continued training beyond its initial release.

Implications and Future Work

The efficiency of CroissantLLM opens avenues for its widespread adoption in both research and industrial applications. This adoption, coupled with an open-source approach, is anticipated to spark innovation and further NLP advancements in French language processing. Future work might extend CroissantLLM's approach to other language pairs, ideally in a manner that tackles the non-trivial challenge of balancing the quality and quantity of multilingual corpora. The hope is for CroissantLLM and its successors to increasingly reflect the linguistic and cultural diversity of global users, enhancing the accessibility and relevance of NLP technology.

Youtube Logo Streamline Icon: https://streamlinehq.com