Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

QuaDMix: Quality-Diversity Balanced Data Selection for Efficient LLM Pretraining (2504.16511v2)

Published 23 Apr 2025 in cs.CL

Abstract: Quality and diversity are two critical metrics for the training data of LLMs, positively impacting performance. Existing studies often optimize these metrics separately, typically by first applying quality filtering and then adjusting data proportions. However, these approaches overlook the inherent trade-off between quality and diversity, necessitating their joint consideration. Given a fixed training quota, it is essential to evaluate both the quality of each data point and its complementary effect on the overall dataset. In this paper, we introduce a unified data selection framework called QuaDMix, which automatically optimizes the data distribution for LLM pretraining while balancing both quality and diversity. Specifically, we first propose multiple criteria to measure data quality and employ domain classification to distinguish data points, thereby measuring overall diversity. QuaDMix then employs a unified parameterized data sampling function that determines the sampling probability of each data point based on these quality and diversity related labels. To accelerate the search for the optimal parameters involved in the QuaDMix framework, we conduct simulated experiments on smaller models and use LightGBM for parameters searching, inspired by the RegMix method. Our experiments across diverse models and datasets demonstrate that QuaDMix achieves an average performance improvement of 7.2% across multiple benchmarks. These results outperform the independent strategies for quality and diversity, highlighting the necessity and ability to balance data quality and diversity.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Fengze Liu (18 papers)
  2. Weidong Zhou (12 papers)
  3. Binbin Liu (13 papers)
  4. Zhimiao Yu (4 papers)
  5. Yifan Zhang (245 papers)
  6. Haobin Lin (3 papers)
  7. Yifeng Yu (34 papers)
  8. Xiaohuan Zhou (13 papers)
  9. Taifeng Wang (22 papers)
  10. Yong Cao (33 papers)
  11. Bingni Zhang (3 papers)

Summary

QuaDMix: Quality-Diversity Balanced Data Selection for Efficient LLM Pretraining

Fengze Liu et al. present a novel unified data selection framework, QuaDMix, which addresses the inherent trade-off between quality and diversity in LLMs pretraining. This paper introduces a methodology for optimizing data distribution focusing on the dual objectives of data quality and diversity.

Core Concepts and Methodology

The traditional approach of addressing quality and diversity separately in LLM pretraining often leads to suboptimal configurations due to the complex interplay between these factors. QuaDMix is designed to simultaneously optimize for both, recognizing their essential influence on model performance. The proposed framework employs multiple quality measurement criteria and domain classification to manage diversity within the dataset. The sampling function, which dictates the sampling probability of each data point, is parameterized to balance quality and diversity. A significant innovation is the use of simulated experiments on smaller models to optimize the parameters efficiently, leveraging LightGBM for robust parameter searching.

Experimental Validation

The experimental setup involved training 3,000 models with datasets generated using various QuaDMix parameters, achieving a performance improvement of 7.2% across multiple benchmarks. This outperformed the conventional independent strategies for data quality or diversity, thus validating the necessity of a joint optimization approach. The paper further demonstrates that different quality criteria impact downstream task performance differently, and combining them results in consistent improvements across various tasks.

Results and Insights

Results indicate that the optimal data mixture varies depending on the quality criteria employed, underscoring the importance of joint optimization. The framework also allows task-focused data selection by using a regression model trained on proxy models' results, effectively guiding preference for specific downstream tasks. QuaDMix-BMK configuration, targeting the benchmark training sets in the optimization process, exhibited superior results, confirming the framework's adaptability and efficacy.

Implications and Future Developments

The QuaDMix framework marks a significant step forward in optimizing data configurations for LLM pretraining, addressing both theoretical and practical aspects of data selection. The methodology offers a robust and efficient mechanism for enhancing the performance of LLMs by refining data quality and distribution strategies. Future developments could explore further tuning of the parameter space and enhancing the proxy modeling techniques to offer more accurate predictions without exhaustive large-scale training.

In conclusion, the QuaDMix framework provides compelling insights into the continuous advancement of AI training methodologies, particularly in achieving efficient and balanced pretraining for LLMs. As LLM architectures evolve, frameworks like QuaDMix will be integral in optimizing their training processes, ensuring both quality and diversity are maximized for improved model capabilities.

Youtube Logo Streamline Icon: https://streamlinehq.com