Performance Gains from Domain Upsampling at the End of Training: A Summary
LLMs often rely on diverse pretraining datasets to ensure robust performance across various benchmarks. These datasets typically include vast amounts of CommonCrawl (CC) web data complemented by smaller, domain-specific sources. Optimizing the balance between these diverse data sources, however, is both expensive and computationally intensive, particularly at large FLOP scales. In the discussed paper, the authors introduce the concept of "domain upsampling" at the end of training, a technique that strategically increases the representation of domain-specific datasets to enhance model performance on specific benchmarks while maintaining general capabilities.
Key Contributions
The authors present several notable contributions:
- Baseline Data Mix Construction: A baseline data mix was devised using publicly available datasets, structured into four broad categories: Large-Scale CC, Small-Scale CC, Domain Specific data, and Code datasets. These proportions were chosen heuristically to maintain a balance between information density and diversity.
- Implementation of Domain Upsampling: Domain upsampling was introduced as a pretraining intervention where domain-specific datasets are upsampled in the final stages of training. This method demonstrated performance improvements up to 6.90 pp on MMLU, 8.26 pp on GSM8K, and 6.17 pp on HumanEval for a 7B model.
- Ablation Study on Duration: An extensive ablation paper examined the duration of domain upsampling from 5% to 30% of the training period. Optimal results were achieved with 10%-20% upsampling, carefully balancing general LLMing capabilities and performance on targeted benchmarks.
- Characterization of Dataset Utility: The paper utilized domain upsampling as a cost-effective method to characterize the utility of individual datasets on model performance. For instance, removing math-heavy data subsets during the upsampling phase provided insights into their contribution to benchmarks like GSM8K.
Training Details
The experiments were conducted on 7 billion parameter models trained for 1 trillion tokens, using the MPT architecture. Key hyperparameters included the LionW optimizer, a learning rate of 0.00012, and an inverse square root learning schedule. Evaluations were conducted using the Gauntlet v0.3, which aggregates performance across 35 popular in-context learning tasks.
Results
Baseline Model Performance
The baseline data mix demonstrated competitive performance relative to Llama-2 models, with error rates on MMLU, GSM8K, HumanEval, and the Gauntlet v0.3 Core Average all lying on or below the scaling line of the Llama-2 models. Notably, the baseline model outperformed the Llama-2 7B model on GSM8K and HumanEval despite being trained on half the number of tokens (1T vs. 2T).
Impact of Domain Upsampling
Domain upsampling applied during the final 20% of training notably improved model performance across challenging benchmarks, achieving scores competitive with Llama-2 (7B) but with approximately half the training FLOPs. This intervention particularly boosted scores on GSM8K and HumanEval by 8.26 pp and 6.17 pp, respectively.
Ablation Study
Ablating the duration of domain upsampling revealed that extending upsampling beyond 20% of training peaked performance on specific tasks but introduced trade-offs in general LLMing capabilities. Shorter upsampling durations (5%-10%) provided optimal trade-offs, enhancing targeted benchmarks without compromising other domains.
Dataset Utility Characterization
Domain upsampling also proved useful in isolating the impact of individual datasets. For example, removing math-related datasets during the upsampling phase resulted in lower performance on math and reasoning benchmarks, validating their crucial role in these areas.
Implications and Future Directions
The implications of this research are multifaceted:
- Cost Efficiency: Domain upsampling offers a cost-effective approach to enhance model performance on targeted benchmarks, potentially steering future experimentation with pretraining datasets.
- Dataset Characterization: This technique allows researchers to isolate and understand the impact of specific datasets on model capabilities, guiding more informed dataset selection in pretraining.
- Optimizing Pretraining Mixtures: The findings suggest that domain upsampling can navigate the trade-off between general-purpose capabilities and domain-specific improvements, offering a scalable strategy for pretraining LLMs.
Future work could refine the domain upsampling proportions, further exploiting their potential to enhance performance across broader benchmarks. Additionally, integrating domain upsampling with alternative dataset optimization algorithms could expand its utility in pretraining LLMs at even larger scales.
By ensuring comprehensive yet cost-effective improvements in LLM pretraining, domain upsampling represents a significant step forward in the strategic utilization of diverse data sources, yielding models with enhanced performance across diverse tasks.