Papers
Topics
Authors
Recent
Search
2000 character limit reached

EuroLLM: Multilingual Language Models for Europe

Published 24 Sep 2024 in cs.CL | (2409.16235v1)

Abstract: The quality of open-weight LLMs has seen significant improvement, yet they remain predominantly focused on English. In this paper, we introduce the EuroLLM project, aimed at developing a suite of open-weight multilingual LLMs capable of understanding and generating text in all official European Union languages, as well as several additional relevant languages. We outline the progress made to date, detailing our data collection and filtering process, the development of scaling laws, the creation of our multilingual tokenizer, and the data mix and modeling configurations. Additionally, we release our initial models: EuroLLM-1.7B and EuroLLM-1.7B-Instruct and report their performance on multilingual general benchmarks and machine translation.

Citations (4)

Summary

  • The paper introduces a multilingual language model tailored to Europe’s diverse linguistic needs by addressing underrepresented EU languages.
  • It employs rigorous data collection, a balanced mix with 20% parallel data, and a specialized multilingual tokenizer along with Transformer enhancements.
  • It achieves superior results on benchmarks such as Hellaswag, Arc, and Flores-200, demonstrating notable improvements in translation and commonsense reasoning.

Overview of the EuroLLM Project: Multilingual LLMs for Europe

The paper, "EuroLLM: Multilingual LLMs for Europe," authored by a team of researchers from institutions such as Unbabel, Instituto de Telecomunicações, Carnegie Mellon University, and others, presents an extensive study and development of LLMs tailored to the diverse linguistic landscapes of Europe. The paper addresses the crucial gap in the availability of robust, multilingual LLMs capable of understanding and generating text in all official European Union languages and several additional important languages.

Contributions and Methodologies

The EuroLLM project is set against the backdrop of the notable advancements seen in LLMs, as demonstrated by models such as OpenAI’s GPT series and Anthropic's Claude. However, these models have predominantly been focused on English and a handful of high-resource languages, leaving many European languages underrepresented. The primary contributions of the EuroLLM project include:

  1. Data Collection and Filtering: A meticulous process was undertaken to collect and filter multilingual data from various sources. This data was categorized into web data, parallel data, code/math data, and high-quality data. The team utilized a combination of heuristic and perplexity filtering, deduplication processes, and specific quality thresholds to ensure the data's relevance and quality for training (e.g., FineWeb-edu dataset for English).
  2. Data Mixture Decisions: The paper details the considerations for balancing the composition of the training corpus, which involved scaling laws and optimal proportions of parallel and high-quality data. Through empirical analysis, a mixture including 20% parallel data and strategic repetition of high-quality data was chosen to maximize model performance across languages.
  3. Multilingual Tokenizer: The researchers developed a multilingual tokenizer with a vocabulary of 128,000 pieces, ensuring lower fertilities for European languages compared to existing models like Mistral and LLaMa-3. This optimization was crucial for efficient tokenization across all target languages, balancing the trade-offs between vocabulary size and embedding parameter count.
  4. Modeling Choices: EuroLLM adopted a standard, dense Transformer architecture with specific enhancements such as Grouped Query Attention (GQA), rotary positional embeddings (RoPE), and RMSNorm, among other features. These choices aimed to strike an optimal balance between training stability, computational efficiency, and downstream task performance.
  5. Training and Fine-Tuning: The initial model, EuroLLM-1.7B, was pre-trained on 4 trillion tokens with a learning rate scheduled through a trapezoid scheduler, shown to be more effective than a cosine scheduler across benchmarks. Further, the model was fine-tuned using a dataset named EuroBlocks, comprising 1M samples for various languages and tasks, to create EuroLLM-1.7B-Instruct—a model specialized in following natural language instructions.

Results and Benchmarks

The EuroLLM models were evaluated on multiple fronts:

  1. General Benchmarks: The models were tested on the Hellaswag and Arc Challenge benchmarks. EuroLLM-1.7B and EuroLLM-1.7B-Instruct outperformed comparable models (e.g., Gemma-2b, TinyLlama) in multilinguality and overall task performance, showcasing their efficacy in commonsense reasoning and science exam question answering.
  2. Machine Translation: EuroLLM-1.7B-Instruct demonstrated superior performance in machine translation on datasets like Flores-200, WMT-23, and WMT-24, evaluated using Comet-22 scores. It showed significant improvements over Gemma-2b-Instruct and was competitive with Gemma-7b-Instruct across language pairs, illustrating its capability in generating accurate translations for European languages.

Implications and Future Work

The development of EuroLLM models signifies a crucial step towards inclusive multilingual AI systems that cater to the linguistic diversity of Europe. These models not only advance the capabilities in understanding and generating text across numerous languages but also represent a commitment to open science and reproducible research, contrasting with the proprietary nature of many advanced models.

Future endeavors in the EuroLLM project will likely focus on scaling the model parameters and further refining the data quality to enhance the models' performance. Additionally, there is a potential for exploring other downstream applications and fine-tuning tasks, expanding the utility and impact of these multilingual models in various domains, including translation, education, and beyond.

In conclusion, the EuroLLM project presents significant advancements in the development of multilingual LLMs, showcasing strong numerical results across multilingual benchmarks and translating diverse European languages. This work lays a foundational framework for ongoing and future developments in creating inclusive, high-performing LLMs rooted in the principles of open research and multilingual capability.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 7 tweets with 222 likes about this paper.