Sloth: scaling laws for LLM skills to predict multi-benchmark performance across families (2412.06540v4)

Published 9 Dec 2024 in cs.LG, cs.AI, and stat.ML

Abstract: Scaling laws for LLMs predict model performance based on parameters like size and training data. However, differences in training configurations and data processing across model families lead to significant variations in benchmark performance, making it difficult for a single scaling law to generalize across all LLMs. On the other hand, training family-specific scaling laws requires training models of varying sizes for every family. In this work, we propose Skills Scaling Laws (SSLaws, pronounced as Sloth), a novel scaling law that leverages publicly available benchmark data and assumes LLM performance is driven by low-dimensional latent skills, such as reasoning and instruction following. These latent skills are influenced by computational resources like model size and training tokens but with varying efficiencies across model families. Sloth exploits correlations across benchmarks to provide more accurate and interpretable predictions while alleviating the need to train multiple LLMs per family. We present both theoretical results on parameter identification and empirical evaluations on 12 prominent benchmarks, from Open LLM Leaderboard v1/v2, demonstrating that Sloth predicts LLM performance efficiently and offers insights into scaling behaviors for complex downstream tasks and increased test-time compute.

Summary

The paper introduces a novel latent skills framework that decouples benchmark performance from family-specific variations in training and data processing.
It employs a translog production function and shared coefficients along with learned factor loadings to capture interactions between model size and training tokens.
Experimental evaluations on 12 benchmarks demonstrate Sloth’s superior predictive accuracy and interpretability across diverse LLM families.

The paper introduces Sloth (Skills Scaling Laws), a novel approach for predicting the performance of LLMs across different benchmarks and model families. Existing scaling laws struggle to generalize across model families due to variations in training and data processing. Family-specific scaling laws, while more accurate, are computationally expensive, requiring training models of various sizes within each family. Sloth addresses these limitations by introducing a scaling law based on latent skills rather than directly on benchmark performance.

The core idea is that LLM performance is driven by a small set of underlying skills, such as reasoning, knowledge, and instruction following. These skills are influenced by computational resources like model size (number of parameters) and training data size (number of tokens). Sloth assumes that the relationship between computational resources and skill development is consistent across model families, with the only difference being the efficiency of each family in converting resources into skill levels. This efficiency is captured by a family-specific intercept term.

The Sloth model predicts benchmark performance as a function of these latent skills. The model involves several key components:

Latent Skills (θ): These are the underlying skills assumed to drive performance. A smaller number of skills than the number of benchmarks (d << J) leads to a more parsimonious and interpretable model. These skills are modeled using a translog production function, commonly used in economics, allowing for interactions between model size and training tokens in their impact on skill development. Crucially, the coefficients of the translog function (except for the intercept) are shared across families, reflecting the assumption that skills scale similarly with compute resources across different LLMs.
Factor Loadings (Λ): A matrix that maps the latent skills to benchmark performance. It captures the correlation structure between benchmarks, indicating which benchmarks measure similar or distinct skills. This component is analogous to factor loadings in factor analysis.
Bias Term (b): A benchmark-specific bias term.
Activation Function (σ): A benchmark-specific, trainable, monotonically increasing neural network. This allows for flexibility in modeling the non-linear relationship between skills and performance. A simpler sigmoid function can also be used.
Lower Asymptote (γ): A benchmark-specific parameter accounting for the probability of random correct answers.

Sloth's parameters are learned by minimizing the Huber loss between the predicted and actual benchmark scores across multiple model families. The paper provides a theoretical proof of the identifiability of the model parameters under certain conditions (fixed, invertible activation function and fixed lower asymptote).

The authors evaluate Sloth on a dataset of LLM performance across 12 benchmarks from the Open LLM Leaderboard v1/v2. They demonstrate that Sloth can accurately predict the performance of larger models, even when only the smallest model of a given family is available during training. They compare Sloth against several baseline models, including those based solely on FLOPs and a PCA-based approach, showing Sloth’s superior performance.

Furthermore, they showcase the interpretability of Sloth by analyzing the learned factor loadings and the relationship between computational resources and skills. This analysis provides insights into which skills are most important for each benchmark and how different resources influence skill development. Finally, they demonstrate the use of Sloth for predicting performance on downstream tasks like coding and emotional intelligence, by first predicting latent skills using Sloth and then using these predicted skills to predict downstream task performance. This approach allows them to estimate performance on complex tasks with relatively few observations.