Ultra-FineWeb: Efficient Data Filtering and Verification for High-Quality LLM Training Data (2505.05427v1)

Published 8 May 2025 in cs.CL

Abstract: Data quality has become a key factor in enhancing model performance with the rapid development of LLMs. Model-driven data filtering has increasingly become a primary approach for acquiring high-quality data. However, it still faces two main challenges: (1) the lack of an efficient data verification strategy makes it difficult to provide timely feedback on data quality; and (2) the selection of seed data for training classifiers lacks clear criteria and relies heavily on human expertise, introducing a degree of subjectivity. To address the first challenge, we introduce an efficient verification strategy that enables rapid evaluation of the impact of data on LLM training with minimal computational cost. To tackle the second challenge, we build upon the assumption that high-quality seed data is beneficial for LLM training, and by integrating the proposed verification strategy, we optimize the selection of positive and negative samples and propose an efficient data filtering pipeline. This pipeline not only improves filtering efficiency, classifier quality, and robustness, but also significantly reduces experimental and inference costs. In addition, to efficiently filter high-quality data, we employ a lightweight classifier based on fastText, and successfully apply the filtering pipeline to two widely-used pre-training corpora, FineWeb and Chinese FineWeb datasets, resulting in the creation of the higher-quality Ultra-FineWeb dataset. Ultra-FineWeb contains approximately 1 trillion English tokens and 120 billion Chinese tokens. Empirical results demonstrate that the LLMs trained on Ultra-FineWeb exhibit significant performance improvements across multiple benchmark tasks, validating the effectiveness of our pipeline in enhancing both data quality and training efficiency.

Summary

Ultra-FineWeb: Efficient Data Filtering and Verification for High-Quality LLM Training Data

The paper "Ultra-FineWeb: Efficient Data Filtering and Verification for High-Quality LLM Training Data" presents advancements in refining training datasets for LLMs and addresses inherent challenges seen in data-driven model training processes. As LLMs continue to grow in size and capability, the quality of the training dataset becomes a decisive factor for optimizing performance, pushing researchers to establish robust methods for data filtering and verification.

The work begins with an acknowledgment of the vital role data quality plays in enhancing LLM capabilities across various domains, including code generation and scientific research. However, the paper identifies two prominent hurdles in current approaches: inefficient data verification strategies and subjective criteria in seed data selection for training classifiers, which often rely heavily on human expertise.

The authors offer solutions by introducing a novel efficient verification strategy which enables rapid evaluation of data impact on LLM training at reduced computational costs. Additionally, they propose a pragmatic data filtering pipeline optimized for selecting positive and negative samples. This pipeline employs fastText, a lightweight classifier known for its efficiency, and integrates it with an advanced data verification system to ensure improved robustness and classifier quality.

They critically apply this pipeline to existing datasets like FineWeb and Chinese FineWeb, resulting in the Ultra-FineWeb dataset, which boasts around 1 trillion English tokens and 120 billion Chinese tokens. Empirical evidence shows that Ultra-FineWeb-trained models significantly outperform prior iterations on several benchmark tasks. The computational resources required to filter data (formerly quite substantial) are drastically reduced, enhancing both the speed and quality of data acquisition.

The methodology encompasses several key stages. Initially, it establishes efficient verification strategies to reduce experimental costs while maintaining evaluation accuracy; the strategy leverages a nearly-trained LLM for rapid data quality assessment. Consequently, high-quality seed data for classifiers is selected by evaluating feedback on LLM performance. The iterative nature of data filtering allows for dynamic updating of the seed pool and classifier parameters.

The paper highlights strong numerical results: Ultra-FineWeb achieves significant performance improvements (notably in the MMLU and CMMLU evaluations) compared to datasets filtered using conventional strategies. The classifier pipeline, enhanced by diverse training seeds and recipes, yields better inference efficiency, scalability, and data quality assurance.

The implications of this research are multifaceted. Practically, it reduces the computational overhead for model training, allowing researchers to access higher-quality data sets without exhausting resources. Theoretically, it strengthens the notion that high-quality data selection can lead to remarkable improvements in model robustness and performance. Future developments may expand these principles into specialized domains, refining tailored data filtering strategies for purposes running the gamut from technical fields to complex problem-solving environments.

In conclusion, Ultra-FineWeb positions itself as an effective paradigm shift for data optimization in LLM training, addressing limitations seen in previous methodologies with its efficient verification and filtering pipeline. By bolstering the caliber of training data and subsequently model efficacy, this work sets the stage for continued growth and innovation in AI model development. Future work should extend these insights into domain-specific applications and explore advanced evaluation metrics to better quantify data quality impacts.

Tweets

https://twitter.com/AdinaYakup/status/1924192521778864540

Ultra-FineWeb: Efficient Data Filtering and Verification for High-Quality LLM Training Data (2505.05427v1)

Summary

Ultra-FineWeb: Efficient Data Filtering and Verification for High-Quality LLM Training Data

Related Papers

Tweets