DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining (2305.10429v4)

Published 17 May 2023 in cs.CL and cs.LG

Abstract: The mixture proportions of pretraining data domains (e.g., Wikipedia, books, web text) greatly affect LLM (LM) performance. In this paper, we propose Domain Reweighting with Minimax Optimization (DoReMi), which first trains a small proxy model using group distributionally robust optimization (Group DRO) over domains to produce domain weights (mixture proportions) without knowledge of downstream tasks. We then resample a dataset with these domain weights and train a larger, full-sized model. In our experiments, we use DoReMi on a 280M-parameter proxy model to set the domain weights for training an 8B-parameter model (30x larger) more efficiently. On The Pile, DoReMi improves perplexity across all domains, even when it downweights a domain. DoReMi improves average few-shot downstream accuracy by 6.5% points over a baseline model trained using The Pile's default domain weights and reaches the baseline accuracy with 2.6x fewer training steps. On the GLaM dataset, DoReMi, which has no knowledge of downstream tasks, even matches the performance of using domain weights tuned on downstream tasks.

Authors (10)

Sang Michael Xie (21 papers)
Hieu Pham (35 papers)
Xuanyi Dong (28 papers)
Nan Du (66 papers)
Hanxiao Liu (35 papers)
Yifeng Lu (16 papers)
Percy Liang (239 papers)
Quoc V. Le (128 papers)
Tengyu Ma (117 papers)
Adams Wei Yu (23 papers)

Citations (130)

View on Semantic Scholar

Summary

Optimizing LLM Training Through Adaptive Data Domain Weighting

Introduction to DoReMi

Training data composition significantly influences the performance of LLMs (LMs). This paper introduces a novel method called Domain Reweighting with Minimax Optimization (DoReMi) which optimizes mixture proportions of pretraining data domains to enhance LM performance across a broad range of tasks. By deploying DoReMi, the process dynamically adjusts domain weights (mixture proportions) without requiring prior knowledge of downstream task specifics, thus streamlining the pretraining efficiency of LMs. DoReMi's validation is presented through experiments on The Pile and the GLaM dataset, showcasing its capacity to refine perplexity scores, hasten convergence, and equate or outperform models trained with downstream task-tuned domain weights, all with a fraction of the computational overhead typically entailed.

Methodology Breakdown

DoReMi's procedure is initiated by training a small reference model using initial reference domain weights, which could be heuristically selected or proportional to domain sizes. Subsequently, it leverages Group Distributionally Robust Optimization (Group DRO) over domains to train a proxy model. Unlike typical DRO outcomes that directly use the robust model, DoReMi instead extracts the optimized domain weights for retuning the larger, full-sized model.

Key to DoReMi's process is its adaptability to dynamically modify domain weights based on loss discrepancies (excess loss) across domains. This adaptation ensures a training focus on improving domains with suboptimal learning, guaranteeing a balanced performance uplift across all domains rather than overfitting to a set of specific domains.

Empirical Validation and Insights

DoReMi's empirical evaluation reveals its prowess in improving average downstream accuracy and accelerating baseline accuracy achievement with considerably fewer training steps, specifically under the diverse domains of The Pile and GLaM datasets.

On The Pile, DoReMi presents a notable reduction in perplexity across all domains, enhances average downstream accuracy by 6.5 percentage points on generative few-shot tasks, and reaches baseline downstream accuracy 2.6 times faster than the baseline configuration.
When applied to the GLaM dataset, an intriguing observation is DoReMi's ability to tune domain weights that achieve similar performance as weights explicitly optimized for downstream tasks - a process entirely independent of downstream task exposure.

Theoretical and Practical Implications

This research has multifaceted implications for the broader LM and AI community. Theoretically, it elucidates the significant impact of data domain proportions on model performance and presents a robust method to optimize these proportions in a principled manner. Practically, DoReMi offers a tangible strategy to enhance the training efficiency of large-scale LMs without the prohibitive computational cost often associated with domain weight optimization using downstream tasks.

Future Directions

The paper speculates on several avenues for further research, including exploring the effects of varying proxy model sizes, and the feasibility of scaling domain weights across different model sizes. Additionally, iterated DoReMi presents an intriguing framework for improving domain weight optimization through successive refinements, poising it as a potentially fruitful area for exploration.

Conclusion

DoReMi marks a significant step forward in data-centric approaches to LM training, underscoring the critical role of training data composition. By efficiently optimizing domain weights in a task-agnostic manner, it promises enhanced LM performance and training efficiency—key metrics in the rapidly advancing field of generative AI and LLMs.