EuroLLM: Multilingual Language Models for Europe (2409.16235v1)

Published 24 Sep 2024 in cs.CL

Abstract: The quality of open-weight LLMs has seen significant improvement, yet they remain predominantly focused on English. In this paper, we introduce the EuroLLM project, aimed at developing a suite of open-weight multilingual LLMs capable of understanding and generating text in all official European Union languages, as well as several additional relevant languages. We outline the progress made to date, detailing our data collection and filtering process, the development of scaling laws, the creation of our multilingual tokenizer, and the data mix and modeling configurations. Additionally, we release our initial models: EuroLLM-1.7B and EuroLLM-1.7B-Instruct and report their performance on multilingual general benchmarks and machine translation.

Authors (15)

Pedro Henrique Martins (11 papers)
Patrick Fernandes (32 papers)
João Alves (84 papers)
Nuno M. Guerreiro (27 papers)
Ricardo Rei (34 papers)
Duarte M. Alves (7 papers)
José Pombal (15 papers)
Amin Farajian (5 papers)
Manuel Faysse (12 papers)
Mateusz Klimaszewski (8 papers)
Pierre Colombo (48 papers)
Barry Haddow (59 papers)
José G. C. de Souza (12 papers)
Alexandra Birch (67 papers)
André F. T. Martins (113 papers)

Citations (4)

View on Semantic Scholar

Summary

Overview of the EuroLLM Project: Multilingual LLMs for Europe

The paper, "EuroLLM: Multilingual LLMs for Europe," authored by a team of researchers from institutions such as Unbabel, Instituto de Telecomunicações, Carnegie Mellon University, and others, presents an extensive paper and development of LLMs tailored to the diverse linguistic landscapes of Europe. The paper addresses the crucial gap in the availability of robust, multilingual LLMs capable of understanding and generating text in all official European Union languages and several additional important languages.

Contributions and Methodologies

The EuroLLM project is set against the backdrop of the notable advancements seen in LLMs, as demonstrated by models such as OpenAI’s GPT series and Anthropic's Claude. However, these models have predominantly been focused on English and a handful of high-resource languages, leaving many European languages underrepresented. The primary contributions of the EuroLLM project include:

Data Collection and Filtering: A meticulous process was undertaken to collect and filter multilingual data from various sources. This data was categorized into web data, parallel data, code/math data, and high-quality data. The team utilized a combination of heuristic and perplexity filtering, deduplication processes, and specific quality thresholds to ensure the data's relevance and quality for training (e.g., FineWeb-edu dataset for English).
Data Mixture Decisions: The paper details the considerations for balancing the composition of the training corpus, which involved scaling laws and optimal proportions of parallel and high-quality data. Through empirical analysis, a mixture including 20% parallel data and strategic repetition of high-quality data was chosen to maximize model performance across languages.
Multilingual Tokenizer: The researchers developed a multilingual tokenizer with a vocabulary of 128,000 pieces, ensuring lower fertilities for European languages compared to existing models like Mistral and LLaMa-3. This optimization was crucial for efficient tokenization across all target languages, balancing the trade-offs between vocabulary size and embedding parameter count.
Modeling Choices: EuroLLM adopted a standard, dense Transformer architecture with specific enhancements such as Grouped Query Attention (GQA), rotary positional embeddings (RoPE), and RMSNorm, among other features. These choices aimed to strike an optimal balance between training stability, computational efficiency, and downstream task performance.
Training and Fine-Tuning: The initial model, EuroLLM-1.7B, was pre-trained on 4 trillion tokens with a learning rate scheduled through a trapezoid scheduler, shown to be more effective than a cosine scheduler across benchmarks. Further, the model was fine-tuned using a dataset named EuroBlocks, comprising 1M samples for various languages and tasks, to create EuroLLM-1.7B-Instruct—a model specialized in following natural language instructions.

Results and Benchmarks

The EuroLLM models were evaluated on multiple fronts:

General Benchmarks: The models were tested on the Hellaswag and Arc Challenge benchmarks. EuroLLM-1.7B and EuroLLM-1.7B-Instruct outperformed comparable models (e.g., Gemma-2b, TinyLlama) in multilinguality and overall task performance, showcasing their efficacy in commonsense reasoning and science exam question answering.
Machine Translation: EuroLLM-1.7B-Instruct demonstrated superior performance in machine translation on datasets like Flores-200, WMT-23, and WMT-24, evaluated using Comet-22 scores. It showed significant improvements over Gemma-2b-Instruct and was competitive with Gemma-7b-Instruct across language pairs, illustrating its capability in generating accurate translations for European languages.

Implications and Future Work

The development of EuroLLM models signifies a crucial step towards inclusive multilingual AI systems that cater to the linguistic diversity of Europe. These models not only advance the capabilities in understanding and generating text across numerous languages but also represent a commitment to open science and reproducible research, contrasting with the proprietary nature of many advanced models.

Future endeavors in the EuroLLM project will likely focus on scaling the model parameters and further refining the data quality to enhance the models' performance. Additionally, there is a potential for exploring other downstream applications and fine-tuning tasks, expanding the utility and impact of these multilingual models in various domains, including translation, education, and beyond.

In conclusion, the EuroLLM project presents significant advancements in the development of multilingual LLMs, showcasing strong numerical results across multilingual benchmarks and translating diverse European languages. This work lays a foundational framework for ongoing and future developments in creating inclusive, high-performing LLMs rooted in the principles of open research and multilingual capability.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/_akhaliq/status/1838805180336292015

https://twitter.com/ManuelFaysse/status/1872972083602600379

https://twitter.com/WikiResearch/status/1841554789085122893

https://twitter.com/niedakh/status/1886546833373081907

https://twitter.com/GuglielmoIozzia/status/1843769479056208065

YouTube

Show All Videos