- The paper demonstrates that reasoning performance follows a U-shaped curve, with an optimal model size achieving the best balance between underfitting and overfitting.
- It uses controlled pretraining on knowledge graph triples to isolate reasoning abilities, revealing the impact of training steps, data volume, and graph complexity on optimal model size.
- The study introduces graph search entropy as a novel metric, establishing a linear relationship to estimate the optimal number of parameters required for effective multi-hop reasoning.
This paper investigates the relationship between the size of LMs and their reasoning capabilities, specifically focusing on the pretraining stage (2504.03635). It challenges the common assumption that larger models inherently possess better reasoning abilities by demonstrating a "U-shaped" scaling phenomenon in a controlled environment.
The core idea is explored by pretraining LMs of varying sizes from scratch exclusively on triples from a knowledge graph (KG). Reasoning ability is then evaluated by the model's capacity to complete missing KG triples (i.e., infer unseen edges) that require multi-hop reasoning based on latent rules encoded within the graph structure. This setup aims to isolate reasoning from other language understanding complexities.
Key Findings and Observations
- U-Shaped Reasoning Performance: When trained sufficiently on a fixed KG dataset, the test loss (negative log-likelihood on unseen, deductible triples) exhibits a U-shaped curve as model size increases. This means there exists an optimal model size that achieves the best reasoning performance (lowest loss, highest accuracy). Models smaller than this optimum lack capacity, while models larger than this optimum tend to overfit or excessively memorize the training triples, harming their ability to generalize and perform multi-hop reasoning. This contrasts with the training loss, which typically decreases monotonically with model size. This phenomenon was initially observed using the real-world FB15K-237 KG.
- Factors Influencing the Optimal Size:
- Training Steps: The optimal model size tends to decrease initially with more training steps but stabilizes after a sufficient number of steps. The maximum achievable reasoning performance (minimum loss/maximum accuracy), however, seems capped by the dataset itself, regardless of model size or training duration beyond a certain point.
- Amount of Training Data (N): Training on more triples sampled from the same underlying KG structure increases the optimal model size and improves the best achievable reasoning performance.
- Graph Complexity (Ne, Nr): Increasing the number of entities (Ne) or relations (Nr) in the KG generally increases the graph's complexity, leading to a larger optimal model size. More relations also tended to improve reasoning performance, potentially by reducing ambiguity.
- Number of Rules (Nh): The number of underlying logical rules used to generate the KG did not significantly impact the optimal model size but did affect the peak reasoning performance, suggesting an optimal rule count exists for balancing complexity and ambiguity.
- Deductible Triple Ratio (γ): A higher proportion of deductible triples (examples illustrating the reasoning rules) in the training data improved performance and increased the optimal model size up to a certain threshold.
Graph Search Entropy and Empirical Scaling Law
To quantify the complexity relevant to reasoning, the paper introduces Graph Search Entropy H(G). This metric combines the entropy rate of entities during a maximal entropy random walk on the graph (related to the graph's principal eigenvalue λ) and the entropy rate of relations conditioned on that walk.
H(G)=Ne(log(λ)+Hr(G))
where Ne is the number of entities and Hr(G) is the relation entropy rate derived from the stationary distribution and transition probabilities of the random walk.
A key contribution is the discovery of a strong linear relationship between this graph search entropy and the empirically determined optimal model size (Popt) across various synthetic KGs:
Popt≈α⋅H(G)+β
The regression suggests that approximately 124 additional parameters are required in the optimal model for every 1-bit increase in graph search entropy. This implies that reasoning over knowledge structures is significantly more parameter-intensive per unit of complexity (~0.008 bits/parameter) compared to factual memorization, which prior work estimated at ~2 bits/parameter [allen-zhu2025physics].
Practical Implementation Considerations
Limitations
The paper's primary limitation is its simplified setting using KGs directly for pretraining, rather than natural language text. While providing controlled insights, the direct applicability and the exact scaling constants might differ in real-world LLM pretraining scenarios involving noisy, unstructured text. Verifying these findings on large text corpora remains future work.
In conclusion, the paper provides compelling evidence that for reasoning tasks grounded in structured knowledge, model scaling is not monotonic. Overparameterization can hinder reasoning performance due to excessive memorization. The optimal model size appears linked to the intrinsic complexity of the underlying knowledge graph, quantifiable by graph search entropy, offering a potential avenue for optimizing model selection for reasoning-intensive applications.