Optimizing LLM Training Through Adaptive Data Domain Weighting
Introduction to DoReMi
Training data composition significantly influences the performance of LLMs (LMs). This paper introduces a novel method called Domain Reweighting with Minimax Optimization (DoReMi) which optimizes mixture proportions of pretraining data domains to enhance LM performance across a broad range of tasks. By deploying DoReMi, the process dynamically adjusts domain weights (mixture proportions) without requiring prior knowledge of downstream task specifics, thus streamlining the pretraining efficiency of LMs. DoReMi's validation is presented through experiments on The Pile and the GLaM dataset, showcasing its capacity to refine perplexity scores, hasten convergence, and equate or outperform models trained with downstream task-tuned domain weights, all with a fraction of the computational overhead typically entailed.
Methodology Breakdown
DoReMi's procedure is initiated by training a small reference model using initial reference domain weights, which could be heuristically selected or proportional to domain sizes. Subsequently, it leverages Group Distributionally Robust Optimization (Group DRO) over domains to train a proxy model. Unlike typical DRO outcomes that directly use the robust model, DoReMi instead extracts the optimized domain weights for retuning the larger, full-sized model.
Key to DoReMi's process is its adaptability to dynamically modify domain weights based on loss discrepancies (excess loss) across domains. This adaptation ensures a training focus on improving domains with suboptimal learning, guaranteeing a balanced performance uplift across all domains rather than overfitting to a set of specific domains.
Empirical Validation and Insights
DoReMi's empirical evaluation reveals its prowess in improving average downstream accuracy and accelerating baseline accuracy achievement with considerably fewer training steps, specifically under the diverse domains of The Pile and GLaM datasets.
- On The Pile, DoReMi presents a notable reduction in perplexity across all domains, enhances average downstream accuracy by 6.5 percentage points on generative few-shot tasks, and reaches baseline downstream accuracy 2.6 times faster than the baseline configuration.
- When applied to the GLaM dataset, an intriguing observation is DoReMi's ability to tune domain weights that achieve similar performance as weights explicitly optimized for downstream tasks - a process entirely independent of downstream task exposure.
Theoretical and Practical Implications
This research has multifaceted implications for the broader LM and AI community. Theoretically, it elucidates the significant impact of data domain proportions on model performance and presents a robust method to optimize these proportions in a principled manner. Practically, DoReMi offers a tangible strategy to enhance the training efficiency of large-scale LMs without the prohibitive computational cost often associated with domain weight optimization using downstream tasks.
Future Directions
The paper speculates on several avenues for further research, including exploring the effects of varying proxy model sizes, and the feasibility of scaling domain weights across different model sizes. Additionally, iterated DoReMi presents an intriguing framework for improving domain weight optimization through successive refinements, poising it as a potentially fruitful area for exploration.
Conclusion
DoReMi marks a significant step forward in data-centric approaches to LM training, underscoring the critical role of training data composition. By efficiently optimizing domain weights in a task-agnostic manner, it promises enhanced LM performance and training efficiency—key metrics in the rapidly advancing field of generative AI and LLMs.