Not-Just-Scaling Laws: Towards a Better Understanding of the Downstream Impact of Language Model Design Decisions (2503.03862v2)

Published 5 Mar 2025 in cs.CL and cs.AI

Abstract: Improvements in LLM capabilities are often attributed to increasing model size or training data, but in some cases smaller models trained on curated data or with different architectural decisions can outperform larger ones trained on more tokens. What accounts for this? To quantify the impact of these design choices, we meta-analyze 92 open-source pretrained models across a wide array of scales, including state-of-the-art open-weights models as well as less performant models and those with less conventional design decisions. We find that by incorporating features besides model size and number of training tokens, we can achieve a relative 3-28% increase in ability to predict downstream performance compared with using scale alone. Analysis of model design decisions reveal insights into data composition, such as the trade-off between language and code tasks at 15-25\% code, as well as the better performance of some architectural decisions such as choosing rotary over learned embeddings. Broadly, our framework lays a foundation for more systematic investigation of how model development choices shape final capabilities.

PDF Abstract

This paper, "Not-Just-Scaling Laws: Towards a Better Understanding of the Downstream Impact of LLM Design Decisions" (Liu et al., 5 Mar 2025 ), investigates the factors beyond model size and training data volume that influence the performance of LLMs on downstream tasks. The authors argue that traditional scaling laws, which primarily consider the number of parameters ( $N$ ) and training tokens ( $D$ ), are insufficient to fully explain performance variations, as smaller models with specific design choices can outperform larger ones.

To address this, the researchers curated a database of 92 open-source pretrained decoder-only LLMs (ranging from 11M to 110B parameters), meticulously documenting their architectural details, pretraining data composition, and even features derived from text generated by the models themselves ("free-generations").

Key Methodological Steps:

Database Creation:
- Included only distinct pretrained base models (decoder-only transformers).
- Required publicly available metadata, including parameter count and training token count.
- Models spanned from 2019-2024.
Feature Engineering:
- Architectural Features ( $\mathcal{A}$ ): Total parameters, number of layers, embedding/feed-forward dimensions, layer normalization type (e.g., RMSNorm), attention variant, positional embedding type (e.g., RoPE, ALiBi), presence of biases, sequence length.
- Data Features ( $\mathcal{D}$ ): Total training tokens, percentage breakdown of tokens from domains like web, code, books, reference, academic (based on a defined taxonomy), and proportion of English-language tokens.
- Free-generation Features ( $\mathcal{F}$ ): To overcome missing data composition details, models were prompted with a BOS token to generate text. These generations (5-10k per model) were analyzed for:
  - Domain classification (e.g., \% generated web-like text, \% generated code-like text) using GPT-4o-mini.
  - Low-level statistics (e.g., words per sentence, constituency tree depth, dependency length, ratio of question words, \% English generated).
Evaluation Suite:
- Model performance was assessed on 12 diverse LLM benchmarks covering commonsense reasoning (ANLI, HellaSwag, Winogrande, XNLI), math/logic (GSM8K, LogiQA2, MathQA), general knowledge (ARC Challenge, Lambada, MMLU), and other tasks (TruthfulQA, HumanEval).
- Metrics included accuracy (pass@1 for HumanEval) and Brier score (for tasks where smaller models struggle with accuracy).
Predictive Modeling:
- XGBoost regression models were trained for each benchmark to predict performance using the collected features.
- 3-fold cross-validation with nested inner cross-validation for hyperparameter tuning was used.
- Predictors were evaluated using Mean Absolute Error (MAE).
- Iterative Feature Selection: Features were added greedily to a base model (containing only log parameters and log tokens) if they reduced MAE.
- Two main models were compared:
  - Scaling-Laws Model: Used only total parameters and total training tokens.
  - All-Features Model: Used the greedily selected optimal set of features.

Key Findings and Results:

Improved Prediction with More Features:
- The "all-features" predictor significantly outperformed the "scaling-laws-only" predictor on all 12 benchmarks, achieving a relative MAE reduction of 3-28%.
- The largest improvements were seen in LLMing (Lambada, 28% improvement) and code generation (HumanEval, 15% improvement).
Feature Importance (SHAP Analysis):
- Scaling factors (parameters and tokens) remained highly influential.
- Percentage of Code in Pretraining: This was a crucial non-scaling feature.
  - Higher code proportion (>~20-25%) boosted performance on code tasks like HumanEval but negatively impacted NLI tasks (Arc Challenge, Hellaswag, Winogrande, Lambada).
  - A moderate code proportion (15-25%) appeared optimal for balancing performance across natural language and code tasks, refining prior estimates.
- Other Data Domains:
  - Higher percentage of reference-like or question-loaded generations correlated with better accuracy on tasks like Arc Challenge and Winogrande. This suggests generation patterns can act as a "fingerprint" of pretraining data biases.
  - Models generating more web-like data tended to perform worse on TruthfulQA.
- Architectural Decisions (Non-Scale): These had minor but sometimes significant effects.
  - Type of layer normalization (e.g., RMSNorm often better) and positional embeddings (e.g., RoPE often better than learned embeddings) were influential in some cases.
  - Embedding dimension was also important, though it's related to scale.

Practical Implications and Contributions:

Beyond Simple Scaling: The work empirically demonstrates that design decisions beyond just scaling parameters and data significantly impact LLM capabilities.
Actionable Insights for Developers:
- Data Curation is Key: The composition of pretraining data, especially the amount of code, has a tangible and predictable effect on specific downstream skills. The 15-25% code ratio offers a practical guideline.
- Generation Fingerprinting: Analyzing a model's free-generations can offer clues about its training data and potential performance biases, especially when full data details are unavailable. This can be a cheaper way to probe models.
- Architectural Choices Matter: While data is dominant, choices like layer normalization and positional embeddings can provide performance edges.
Framework for Systematic Investigation: The paper provides a methodology and a public database for the community to systematically analyze and understand how various pretraining choices shape LLM performance.
Hypothesis Generation: The observational findings can guide more controlled, causal experiments by identifying promising axes of variation (e.g., specific data mixtures or architectural tweaks).

Limitations Acknowledged:

Sample Size: 92 models, with fewer at the >50B parameter scale, limits conclusions about very large models.
Observational Nature: The paper observes correlations; causal claims require controlled experiments.
Extrapolation: Tree-based models (XGBoost) are better for interpolation within the observed data range rather than extrapolation.
Scope: Focused on dense, decoder-only, primarily English-language transformers, excluding MoEs, non-transformers, and instruction-tuned models. The feature set, while extensive, might miss some relevant details (e.g., optimization specifics).

In conclusion, the paper provides strong evidence that a more holistic view of LLM design, incorporating data composition and architectural nuances, is necessary for accurately predicting and understanding downstream performance. It offers a valuable resource and framework for practitioners aiming to build more effective LLMs by learning from the collective experience of the open-source community.

PDF Markdown Bookmark Chat (Pro)

Authors (12)

Emmy Liu (17 papers)
Amanda Bertsch (14 papers)
Lintang Sutawika (14 papers)
Lindia Tjuatja (9 papers)
Patrick Fernandes (32 papers)
Lara Marinov (1 paper)
Michael Chen (24 papers)
Shreya Singhal (1 paper)
Carolin Lawrence (29 papers)
Aditi Raghunathan (56 papers)
Kiril Gashteovski (19 papers)
Graham Neubig (342 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/_emliu/status/1899488431761150272

https://twitter.com/fly51fly/status/1898125628710469911

https://twitter.com/GptMaestro/status/1899851640095887642