- The paper introduces a novel latent skills framework that decouples benchmark performance from family-specific variations in training and data processing.
- It employs a translog production function and shared coefficients along with learned factor loadings to capture interactions between model size and training tokens.
- Experimental evaluations on 12 benchmarks demonstrate Sloth’s superior predictive accuracy and interpretability across diverse LLM families.
The paper introduces Sloth (Skills Scaling Laws), a novel approach for predicting the performance of LLMs across different benchmarks and model families. Existing scaling laws struggle to generalize across model families due to variations in training and data processing. Family-specific scaling laws, while more accurate, are computationally expensive, requiring training models of various sizes within each family. Sloth addresses these limitations by introducing a scaling law based on latent skills rather than directly on benchmark performance.
The core idea is that LLM performance is driven by a small set of underlying skills, such as reasoning, knowledge, and instruction following. These skills are influenced by computational resources like model size (number of parameters) and training data size (number of tokens). Sloth assumes that the relationship between computational resources and skill development is consistent across model families, with the only difference being the efficiency of each family in converting resources into skill levels. This efficiency is captured by a family-specific intercept term.
The Sloth model predicts benchmark performance as a function of these latent skills. The model involves several key components:
- Latent Skills (θ): These are the underlying skills assumed to drive performance. A smaller number of skills than the number of benchmarks (d << J) leads to a more parsimonious and interpretable model. These skills are modeled using a translog production function, commonly used in economics, allowing for interactions between model size and training tokens in their impact on skill development. Crucially, the coefficients of the translog function (except for the intercept) are shared across families, reflecting the assumption that skills scale similarly with compute resources across different LLMs.
- Factor Loadings (Λ): A matrix that maps the latent skills to benchmark performance. It captures the correlation structure between benchmarks, indicating which benchmarks measure similar or distinct skills. This component is analogous to factor loadings in factor analysis.
- Bias Term (b): A benchmark-specific bias term.
- Activation Function (σ): A benchmark-specific, trainable, monotonically increasing neural network. This allows for flexibility in modeling the non-linear relationship between skills and performance. A simpler sigmoid function can also be used.
- Lower Asymptote (γ): A benchmark-specific parameter accounting for the probability of random correct answers.
Sloth's parameters are learned by minimizing the Huber loss between the predicted and actual benchmark scores across multiple model families. The paper provides a theoretical proof of the identifiability of the model parameters under certain conditions (fixed, invertible activation function and fixed lower asymptote).
The authors evaluate Sloth on a dataset of LLM performance across 12 benchmarks from the Open LLM Leaderboard v1/v2. They demonstrate that Sloth can accurately predict the performance of larger models, even when only the smallest model of a given family is available during training. They compare Sloth against several baseline models, including those based solely on FLOPs and a PCA-based approach, showing Sloth’s superior performance.
Furthermore, they showcase the interpretability of Sloth by analyzing the learned factor loadings and the relationship between computational resources and skills. This analysis provides insights into which skills are most important for each benchmark and how different resources influence skill development. Finally, they demonstrate the use of Sloth for predicting performance on downstream tasks like coding and emotional intelligence, by first predicting latent skills using Sloth and then using these predicted skills to predict downstream task performance. This approach allows them to estimate performance on complex tasks with relatively few observations.