Scaling Laws for Downstream Task Performance of Large Language Models (2402.04177v1)

Published 6 Feb 2024 in cs.CL, cs.LG, and stat.ML

Abstract: Scaling laws provide important insights that can guide the design of LLMs. Existing work has primarily focused on studying scaling laws for pretraining (upstream) loss. However, in transfer learning settings, in which LLMs are pretrained on an unsupervised dataset and then finetuned on a downstream task, we often also care about the downstream performance. In this work, we study the scaling behavior in a transfer learning setting, where LLMs are finetuned for machine translation tasks. Specifically, we investigate how the choice of the pretraining data and its size affect downstream performance (translation quality) as judged by two metrics: downstream cross-entropy and BLEU score. Our experiments indicate that the size of the finetuning dataset and the distribution alignment between the pretraining and downstream data significantly influence the scaling behavior. With sufficient alignment, both downstream cross-entropy and BLEU score improve monotonically with more pretraining data. In such cases, we show that it is possible to predict the downstream BLEU score with good accuracy using a log-law. However, there are also cases where moderate misalignment causes the BLEU score to fluctuate or get worse with more pretraining, whereas downstream cross-entropy monotonically improves. By analyzing these observations, we provide new practical insights for choosing appropriate pretraining data.

PDF Abstract

Introduction

Scaling laws constitute a substantial area of paper in the development of LLMs, offering crucial guidance for various aspects of model cultivation, from architectural decisions to training data selection. Predominantly, existing research has been invested in identifying these laws for upstream metrics, like perplexity or cross-entropy loss on the pretraining corpus. Interestingly though, the spotlight has been shifting towards examining scaling laws in a transfer learning landscape, particularly within the realms of specialized downstream tasks such as machine translation.

Scaling Laws and Transfer Learning

In the arena of transfer learning, where LLMs are eagerly employed, the intricate relationship between the volume of pretraining data and downstream task performance has been scrutinized. This paper proposes two novel scaling laws: a log-law for translation quality as measured by the BLEU score and a power-law for downstream cross-entropy, an indicator of LLM performance on unseen data. These laws successfully model the empirical data, demonstrating the potential of scaling laws to transcend from theoretical constructs into practical, applicable tools in AI research.

Insights and Practical Implications

The research finds that the alignment between pretraining and downstream data significantly impacts model performance. Notably, well-aligned distributions see a monotonic improvement in both BLEU score and downstream cross-entropy as pretraining data increases. Conversely, moderately misaligned datasets reveal a non-monotonic fluctuation in the BLEU score, despite a continuous improvement in cross-entropy, signaling that cross-entropy may not always serve reliably as a proxy for downstream task performance. Additionally, the paper observes that when the downstream dataset surpasses a certain threshold in size, the benefits of pretraining diminish, even reaching a point where pretraining offers no tangible improvements over models trained solely on downstream tasks.

Evaluating Relevance and Selection of Pretraining Data

A critical byproduct of this inquiry is a set of pragmatic guidelines for evaluating the relevance of pretraining data in respect to a specific downstream task. These guidelines leverage the proposed scaling laws to inform decisions about continuing pretraining and predicting downstream task performance. Such assessment tools play a pivotal role in efficient resource allocation and model development strategies—especially when considering the overhead costs associated with pretraining large models.

Final Thoughts

Scaling laws for downstream performance of LLMs offer a quantifiable method to forecast model behavior, centering on the efficacy of pretraining data in transfer learning scenarios. Pioneering in its approach, this paper lays the groundwork for the broader adoption of scaling laws beyond upstream tasks, inspiring future research that could potentially reshape practices in AI and machine learning. As this field continues to expand, the insights provided may prompt revised methodologies in the training of LLMs, emphasizing the calculated balance between data alignment, pretraining quantities, and downstream task requirements.