Compute Optimal Scaling of Skills: Knowledge vs Reasoning (2503.10061v3)

Published 13 Mar 2025 in cs.LG, cs.AI, and cs.CL

Abstract: Scaling laws are a critical component of the LLM development pipeline, most famously as a way to forecast training decisions such as 'compute-optimally' trading-off parameter count and dataset size, alongside a more recent growing list of other crucial decisions. In this work, we ask whether compute-optimal scaling behaviour can be skill-dependent. In particular, we examine knowledge and reasoning-based skills such as knowledge-based QA and code generation, and we answer this question in the affirmative: scaling laws are skill-dependent. Next, to understand whether skill-dependent scaling is an artefact of the pretraining datamix, we conduct an extensive ablation of different datamixes and find that, also when correcting for datamix differences, knowledge and code exhibit fundamental differences in scaling behaviour. We conclude with an analysis of how our findings relate to standard compute-optimal scaling using a validation set, and find that a misspecified validation set can impact compute-optimal parameter count by nearly 50%, depending on its skill composition.

PDF Abstract

Introduction

Compute-optimal scaling is a critical aspect of training LLMs, focusing on the efficient allocation of computational resources. Scaling laws provide a framework for predicting training outcomes and optimizing the balance between model parameters and training dataset size, allowing for the best possible performance within a given compute budget. The paper "Compute Optimal Scaling of Skills: Knowledge vs Reasoning" (Roberts et al., 13 Mar 2025 ) explores whether compute-optimal scaling is skill-dependent, specifically examining knowledge-based question answering (QA) and code generation tasks to determine if different tasks have different scaling requirements.

Skill-Dependent Scaling Laws

The central finding of "Compute Optimal Scaling of Skills: Knowledge vs Reasoning" (Roberts et al., 13 Mar 2025 ) is that scaling laws in LLMs are indeed skill-dependent. The optimal way to scale model size and training data depends on the specific skills the model should acquire. The paper differentiates between knowledge-based question answering (QA) tasks, which rely on memorization and information retrieval, and code generation tasks. Knowledge QA is capacity-hungry, meaning its performance benefits more from increasing the model's parameter count, allowing it to store and access more facts. Conversely, code generation tasks are more data-hungry; for a fixed compute budget, increasing the training dataset size yields a larger performance increase than increasing the model size, as code generation benefits from exposure to a wider range of examples and patterns, enabling the model to generalize and generate new, correct code.

These differences have implications for compute-optimal training strategies. Traditional approaches often seek a one-size-fits-all trade-off between model and dataset size. However, the paper suggests this is suboptimal. To optimize a model for both knowledge QA and code generation, consider training separate models for each skill and ensembling them, or carefully balancing dataset composition and model size. Awareness of skill-dependent scaling laws can lead to more effective LLM training.

Impact of Datamix on Skill-Dependent Scaling

To investigate whether differences in scaling laws between knowledge and reasoning are due to the pretraining datamix, the paper "Compute Optimal Scaling of Skills: Knowledge vs Reasoning" (Roberts et al., 13 Mar 2025 ) conducts ablation studies, systematically varying the proportions of skill-relevant data (knowledge vs. code) in the pretraining mix. This isolates the impact of datamix composition on the optimal model and dataset sizes for different skills.

The authors manipulated the pretraining data to contain varying ratios of knowledge-centric data (e.g., text, factual information) and code-centric data. By training models on these different datamixes and evaluating their performance on knowledge QA and code generation tasks, they assessed whether the relative performance of each skill changed as the datamix shifted. Even when the amount of knowledge-specific data was reduced, knowledge QA tasks still benefited more from increased model capacity, and even with less code data, code generation still scaled better with increased dataset size. This indicates that the differences in scaling aren't solely driven by the prevalence of each skill's data in the pretraining mix, but reflect intrinsic differences in the nature of the skills themselves.

The paper suggests that knowledge is inherently harder to compress than code. Knowledge often consists of a vast amount of unstructured factual information that requires significant model capacity to memorize and recall. Code, being more structured and rule-based, can be learned more efficiently from a given amount of data. This explains why knowledge-based tasks are more capacity-hungry and benefit more from larger models, even when controlling for the datamix.

The Role of the Validation Set

Standard compute-optimal scaling relies on a single validation set to guide the selection of optimal model and dataset sizes. Performance on this validation set serves as a proxy for the model's generalization ability. However, "Compute Optimal Scaling of Skills: Knowledge vs Reasoning" (Roberts et al., 13 Mar 2025 ) demonstrates that this can be problematic when training models expected to perform well on skills with different scaling properties.

A misspecified validation set, which doesn't accurately reflect the desired skill composition, can significantly impact the compute-optimal parameter count. For example, if the validation set is heavily weighted towards knowledge-based tasks while the target application requires a balance, the scaling process may favor larger, more capacity-heavy models than truly optimal. The paper quantifies this impact, noting that in smaller-scale experiments (compute budgets of $6\times10^{18}$ ), the optimal parameter count can vary by nearly 50% depending on the datamixes and by more than 30% across validation sets, depending on its skill composition, with differences exceeding 10% even at larger scales.

Therefore, choosing a validation set that adequately represents the target skills is critical. If the goal is to train a model that excels in both knowledge QA and code generation, the validation set should include a representative sample of both task types. An alternative approach is to use skill-specific validation sets, monitoring the model's performance on separate validation sets for each skill and adjusting the model and dataset sizes to optimize performance on both.

Implications for Training LLMs

The findings of "Compute Optimal Scaling of Skills: Knowledge vs Reasoning" (Roberts et al., 13 Mar 2025 ) have significant implications for training LLMs. The traditional approach of relying on a single aggregate performance estimator (APE) overlooks the skill-dependent scaling behaviors, which can lead to suboptimal training and resource allocation, especially for models intended to excel in diverse skill sets.

Skill-aware scaling strategies are recommended. Instead of treating LLM training as a monolithic optimization problem, consider the specific skills desired and tailor the training process accordingly. If a model is intended for both knowledge-intensive tasks and code generation, the allocation of compute and data should reflect the distinct scaling properties of each skill, allocating more parameters for knowledge-intensive tasks and more data for code generation.

Careful validation set selection is also crucial, as a misspecified validation set can lead to significant deviation in the compute-optimal parameter count. Construct validation sets that accurately represent the desired skill composition of the final model or use skill-specific validation sets to monitor performance in particular skills. Monitoring performance on these skill-specific validation sets during training can provide insights into the scaling behavior of different skills and guide resource allocation.

Future research could explore more sophisticated skill-aware scaling techniques, such as adaptive training algorithms that dynamically adjust the training data mix or model architecture based on the observed scaling behavior of different skills. Methods for automatically identifying and characterizing the skills embedded in a given dataset could also be explored, as could the interplay between different skills and how they influence each other during training.

Conclusion

"Compute Optimal Scaling of Skills: Knowledge vs Reasoning" (Roberts et al., 13 Mar 2025 ) demonstrates that scaling laws are skill-dependent, with knowledge QA being more capacity-hungry and code generation being more data-hungry. It's crucial to consider skill-specific scaling when training LLMs and selecting validation sets, as a misspecified validation set can significantly impact the compute-optimal parameter count. Ignoring these skill-specific scaling laws can lead to suboptimal training decisions and inefficient resource allocation, underscoring the importance of skill-aware scaling strategies.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Nicholas Roberts (24 papers)
Niladri Chatterji (7 papers)
Sharan Narang (31 papers)
Mike Lewis (78 papers)
Dieuwke Hupkes (49 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/nick11roberts/status/1902875088438833291

https://twitter.com/nick11roberts/status/1916361748522684525

https://twitter.com/nick11roberts/status/1902875111130104051

https://twitter.com/fly51fly/status/1900669007449166257

https://twitter.com/menhguin/status/1933576149940252844

https://twitter.com/osanpochuudayo/status/1903409410178949164