Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Optimizing Pretraining Data Mixtures with LLM-Estimated Utility (2501.11747v1)

Published 20 Jan 2025 in cs.CL and cs.AI

Abstract: LLMs improve with increasing amounts of high-quality training data. However, leveraging larger datasets requires balancing quality, quantity, and diversity across sources. After evaluating nine baseline methods under both compute- and data-constrained scenarios, we find token-count heuristics outperform manual and learned mixes, indicating that simple approaches accounting for dataset size and diversity are surprisingly effective. Building on this insight, we propose two complementary approaches: UtiliMax, which extends token-based heuristics by incorporating utility estimates from reduced-scale ablations, achieving up to a 10.6x speedup over manual baselines; and Model Estimated Data Utility (MEDU), which leverages LLMs to estimate data utility from small samples, matching ablation-based performance while reducing computational requirements by $\sim$200x. Together, these approaches establish a new framework for automated, compute-efficient data mixing that is robust across training regimes.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. William Held (17 papers)
  2. Bhargavi Paranjape (18 papers)
  3. Punit Singh Koura (10 papers)
  4. Mike Lewis (78 papers)
  5. Frank Zhang (22 papers)
  6. Todor Mihaylov (23 papers)

Summary

An Overview of Data Mixing Optimization for LLMs

The paper, "Optimizing Pretraining Data Mixtures with LLM-Estimated Utility," addresses the critical yet underexplored area of data mixing optimization during pretraining of LLMs. As these models expand in size and capability, the importance of the quality, quantity, and diversity of training data becomes increasingly vital. This paper investigates strategies for optimizing data mixtures, presenting novel methodologies that leverage utility estimates derived from both traditional ablation studies and state-of-the-art LLMs.

Key Insights and Methods

The paper begins by identifying the limitations of existing heuristic and learned data mixing methods, which frequently lack comparison in similar data and model configurations and display varying performance across compute and data budget constraints. The fundamental objective is to determine how to best sample from multiple corpora to achieve superior model performance across different downstream tasks.

Two strategies are central to this paper: UtiliMax and Model Estimated Data Utility (MEDU).

  1. UtiliMax: Expanding on token-count heuristics, UtiliMax incorporates utility estimates from reduced-scale ablations. Drawing on principles from portfolio optimization, particularly the Markowitz model, UtiliMax considers dataset utility, size, and a risk prevention strategy that interpolates between diversity and utility maximization. This method offers a significant computational speedup compared to manual baselines, making it a robust choice across various training regimes.
  2. Model Estimated Data Utility (MEDU): This innovative approach employs existing LLMs to estimate data utility from smaller samples, reducing computational demands substantially—by approximately 200 times. MEDU enables data classification by prompting LLMs to describe utility based on benchmark descriptions, aligning estimates with performance-enhancing strategies without necessitating extensive ablation runs.

Experimental Results and Baseline Comparison

The researchers conducted experiments to measure the efficacy of different data mixing strategies across compute-constrained and data-constrained scenarios using the Dolma V1.7 dataset. Through rigorous experimentation, it was found that:

  • The token-size heuristic method, particularly UniMax, demonstrated superior performance due to its emphasis on maintaining diversity under epoching constraints.
  • UtiliMax, when paired with robust utility estimates, consistently outperformed the baselines, showing effective use of ablation results.
  • MEDU's incorporation into UtiliMax outperformed ablation-derived utility estimates, highlighting the potential for LLMs to efficiently direct data mixing strategies through model-based estimation.

Implications and Future Directions

From a practical standpoint, this research outlines a scalable methodology that can optimize the data pretraining mixture, thereby enhancing LLM efficiency and performance on downstream tasks. The application of LLMs for utility estimation exemplifies a promising avenue for future developments, as it suggests a reduced computational cost without sacrificing output quality. Theoretically, the adaptation of portfolio optimization frameworks into data selection proposes a structured approach that balances numerous factors influencing model pretraining.

Given the paper's findings, prospective research could delve into enhancing the precision of model-based utility estimates or explore hybrid models incorporating multiple LLMs for improved utility prediction. Furthermore, expanding this framework to accommodate multilingual datasets or domain-specific applications could extend its utility and applicability across diverse NLP challenges.

In conclusion, the paper advances the discourse on pretraining strategies for LLMs by proposing methodologies that not only optimize data usage but also leverage advanced model capabilities to refine these processes further. This approach could significantly influence future paradigms in AI training and application.