Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
173 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Establishing Task Scaling Laws via Compute-Efficient Model Ladders (2412.04403v1)

Published 5 Dec 2024 in cs.CL and cs.AI

Abstract: We develop task scaling laws and model ladders to predict the individual task performance of pretrained LLMs (LMs) in the overtrained setting. Standard power laws for LLMing loss cannot accurately model task performance. Therefore, we leverage a two-step prediction approach: first use model and data size to predict a task-specific loss, and then use this task loss to predict task performance. We train a set of small-scale "ladder" models, collect data points to fit the parameterized functions of the two prediction steps, and make predictions for two target models: a 7B model trained to 4T tokens and a 13B model trained to 5T tokens. Training the ladder models only costs 1% of the compute used for the target models. On four multiple-choice tasks written in ranked classification format, we can predict the accuracy of both target models within 2 points of absolute error. We have higher prediction error on four other tasks (average absolute error 6.9) and find that these are often tasks with higher variance in task metrics. We also find that using less compute to train fewer ladder models tends to deteriorate predictions. Finally, we empirically show that our design choices and the two-step approach lead to superior performance in establishing scaling laws.

Summary

  • The paper introduces a two-step prediction method using model ladders to estimate task-specific loss based on model and data sizes.
  • The approach achieves high prediction accuracy, with target models' performance predicted within a 2-point error on multiple-choice tasks while using only 1% of compute.
  • Robust ablation studies validate the method's effectiveness and highlight its potential for resource-efficient language model scaling and performance insight.

Establishing Task Scaling Laws via Compute-Efficient Model Ladders

This paper addresses the challenge of predicting the performance of large pretrained LLMs (LMs) on specific tasks, particularly in overtrained settings. Traditional power laws that predict LLMing loss fall short in accurately modeling task performance. Therefore, the authors propose a two-step prediction approach involving "model ladders" to predict task-specific performance metrics more efficiently than current methods.

In the proposed methodology, two steps are involved. First, model and data size are used to predict a task-specific metric known as "task loss." Following this, task loss predicts task performance. By training a set of small-scale ladder models, which cost approximately 1% of the compute used for target models, the authors collect data points to fit the parameterized functions of these two steps.

The authors predict the performance of two target models—a 7 billion parameter model trained on 4 trillion tokens and a 13 billion parameter model trained on 5 trillion tokens. Notably, on four multiple-choice tasks formatted for ranked classification, the method predicts target model accuracy within a 2-point absolute error. However, on four other tasks, the prediction error averaged 6.9 points, often due to tasks exhibiting higher variance in task metrics.

Several analyses were conducted to understand the underlying dynamics of the proposed method. The predictability of model ladders on varying task types was assessed through standard deviation analyses conducted over final checkpoints. A notable observation was that using less compute to train the ladder models leads to inferior prediction accuracy.

The method's robustness and adaptability are also evaluated through ablation studies on different design choices. These studies tested the model's performance using alternatives such as compute-FLOPs instead of model and data sizes (N,D)(N, D), and different loss metrics. The results consistently demonstrated the superiority of the two-step approach leveraging task-specific loss to predict task performance.

Implications and Future Work

The implications of this work rest both in practical and theoretical domains. Practically, this approach allows for efficient resource allocation in LLM experimentation by predicting large models' performance without bearing substantial computational costs. Theoretically, this paper provides a foundation for understanding the emergent properties of task performance as a function of various training scales and data sizes.

Future developments in this research field can explore enhancing predictions in more varied task settings, including multiple-choice formats other than ranked classification and generative task performance. Further work can also involve scaling the methodology to larger LLMs, potentially improving the understanding of scaling laws for LMs and refining the principles governing efficient pretraining and task adaptation. In conclusion, this two-step method offers a promising direction for predictively navigating the landscape of LM task performance, enabling more informed and efficient model training decisions in natural language processing.