DoGE: Domain Reweighting with Generalization Estimation (2310.15393v2)

Published 23 Oct 2023 in cs.LG, cs.AI, and cs.CL

Abstract: The coverage and composition of the pretraining data significantly impacts the generalization ability of LLMs. Despite its importance, recent LLMs still rely on heuristics and trial and error to increase or reduce the influence of data-domains. We propose DOmain reweighting with Generalization Estimation (DoGE), which optimizes the probability of sampling from each domain (domain weights) in a principled way. Our approach is a two-stage process consisting of (i) training a proxy model to obtain domain weights using a bi-level optimization algorithm; (ii) training a larger base model by sampling training domains according to the learned domain weights. In our experiments, we extensively show how DoGE improves the generalization of the base model to any target data mixture. On the SlimPajama dataset, our base model gets better perplexity and few-shot reasoning accuracies across $6$ tasks compared to baseline methods. Moreover, aiming to generalize to out-of-domain target tasks, which is unseen in the pretraining corpus (OOD domain), DoGE can effectively identify inter-domain dependencies, and consistently achieves better test perplexity on the target domain.

Citations (16)

View on Semantic Scholar

Summary

The paper introduces a novel two-stage method that uses proxy model training to optimize domain sampling weights for LLMs.
It employs a first-order approximation to reduce computational overhead compared to traditional second-order approaches.
Empirical results demonstrate improved perplexity and few-shot reasoning, highlighting enhanced generalization for in-domain and out-of-domain tasks.

Overview of "D O GE: Domain Reweighting with Generalization Estimation"

In the paper "D O GE: Domain Reweighting with Generalization Estimation," the authors address the significant impact that the composition of pretraining datasets has on the generalization abilities of LLMs. Despite the recognized importance of managing domain influence within training data, many current LLM methodologies still rely on heuristic approaches to modulate the contribution of data sourced from various domains. This work proposes a novel method called Domain Reweighting with Generalization Estimation (D OGE), which seeks to optimize domain weights—probabilities of sampling from each domain—in a principled manner.

Methodology

The D OGE framework is a two-stage process designed to enhance LLM generalization:

Proxy Model Training: It commences with training a proxy model utilizing a bi-level optimization algorithm to determine domain weights. This involves adjusting the domain sampling weights so that they minimize the average validation loss on a chosen set of target domains. This approach specifically up-weights domains contributing positively to the learning process on the targets.
Larger Base Model Training: Subsequently, a larger base model is trained by sampling from training domains according to the domain weights acquired from the proxy model.

The innovation of D OGE lies in avoiding complex multi-step gradient calculations by adopting a first-order approximation of domain weight adjustments. This methodology scales effectively and reduces the computational burden typically associated with second-order derivative calculations, a common limitation in earlier models like DOREMI.

Numerical Results

Empirical evaluations show that D OGE significantly improves the generalization abilities of models across a range of tasks. For instance, when evaluated on the SlimPajama dataset, D OGE demonstrated superior perplexity and few-shot reasoning accuracies in six different tasks compared to existing baselines. Notably, D OGE enhances the model's ability to generalize to out-of-domain (OOD) tasks that were not part of the pretraining corpus, consistently attaining better test perplexity.

Across tested scenarios, D OGE demonstrated less dependency on the proxy model's capacity, requiring only a single proxy model for effective optimization. This contrasts with DOREMI, which relies heavily on both a well-trained reference model and a proxy model, thereby increasing the cost and risk of sub-optimal performance due to dependency on model capacity.

Implications and Future Directions

The implications of D OGE are twofold. Practically, it offers a scalable and efficient method for adjusting domain sampling in LLM training, thereby optimizing model performance for both in-domain and OOD tasks. Theoretically, it provides an insightful connection between domain contributions and generalization abilities, suggesting potential pathways for further enhancing model robustness and adaptability.

Future research could build on this framework by exploring alternative methods for estimating generalization contributions, scaling strategies for massively large datasets, and extending domain-specific reweighting to fine-grained instances within domains. Additionally, investigating the application of D OGE to domains with highly unbalanced datasets or to tasks with dynamic target domains could prove beneficial.

Conclusion

The paper presents D O GE as a formidable advancement in the optimization of training data composition for LLMs, successfully addressing the challenges associated with domain weighting through a method that is both computationally efficient and effective in enhancing model generalization. Whether applied to universal generalization settings or more nuanced OOD scenarios, D OGE sets a precedent for future exploration into dataset composition optimization, promising a more informed approach to LLM pretraining.

PDF Markdown

Related Papers

Tweets

https://twitter.com/Olivia61368522/status/1755000624268132483

https://twitter.com/teortaxesTex/status/1754559383386669199