Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Judging Quality Across Languages: A Multilingual Approach to Pretraining Data Filtering with Language Models (2505.22232v2)

Published 28 May 2025 in cs.CL, cs.AI, and cs.LG

Abstract: High-quality multilingual training data is essential for effectively pretraining LLMs. Yet, the availability of suitable open-source multilingual datasets remains limited. Existing state-of-the-art datasets mostly rely on heuristic filtering methods, restricting both their cross-lingual transferability and scalability. Here, we introduce JQL, a systematic approach that efficiently curates diverse and high-quality multilingual data at scale while significantly reducing computational demands. JQL distills LLMs' annotation capabilities into lightweight annotators based on pretrained multilingual embeddings. These models exhibit robust multilingual and cross-lingual performance, even for languages and scripts unseen during training. Evaluated empirically across 35 languages, the resulting annotation pipeline substantially outperforms current heuristic filtering methods like Fineweb2. JQL notably enhances downstream model training quality and increases data retention rates. Our research provides practical insights and valuable resources for multilingual data curation, raising the standards of multilingual dataset development.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (18)
  1. Mehdi Ali (11 papers)
  2. Manuel Brack (25 papers)
  3. Max Lübbering (4 papers)
  4. Elias Wendt (1 paper)
  5. Abbas Goher Khan (2 papers)
  6. Richard Rutmann (4 papers)
  7. Alex Jude (3 papers)
  8. Maurice Kraus (7 papers)
  9. Alexander Arno Weber (4 papers)
  10. David Kaczér (2 papers)
  11. Florian Mai (17 papers)
  12. Lucie Flek (36 papers)
  13. Rafet Sifa (32 papers)
  14. Nicolas Flores-Herr (10 papers)
  15. Joachim Köhler (14 papers)
  16. Patrick Schramowski (48 papers)
  17. Michael Fromm (24 papers)
  18. Kristian Kersting (205 papers)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com