Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
96 tokens/sec
Gemini 2.5 Pro Premium
48 tokens/sec
GPT-5 Medium
15 tokens/sec
GPT-5 High Premium
23 tokens/sec
GPT-4o
104 tokens/sec
DeepSeek R1 via Azure Premium
77 tokens/sec
GPT OSS 120B via Groq Premium
466 tokens/sec
Kimi K2 via Groq Premium
201 tokens/sec
2000 character limit reached

Do Larger Language Models Imply Better Reasoning? A Pretraining Scaling Law for Reasoning (2504.03635v1)

Published 4 Apr 2025 in cs.AI and cs.CL

Abstract: LLMs have demonstrated remarkable capabilities across a wide range of tasks requiring complex reasoning. However, the effects of scaling on their reasoning abilities remain insufficiently understood. In this paper, we introduce a synthetic multihop reasoning environment designed to closely replicate the structure and distribution of real-world large-scale knowledge graphs. Our reasoning task involves completing missing edges in the graph, which requires advanced multi-hop reasoning and mimics real-world reasoning scenarios. To evaluate this, we pretrain LMs from scratch solely on triples from the incomplete graph and assess their ability to infer the missing edges. Interestingly, we observe that overparameterization can impair reasoning performance due to excessive memorization. We investigate different factors that affect this U-shaped loss curve, including graph structure, model size, and training steps. To predict the optimal model size for a specific knowledge graph, we find an empirical scaling that linearly maps the knowledge graph search entropy to the optimal model size. This work provides new insights into the relationship between scaling and reasoning in LLMs, shedding light on possible ways to optimize their performance for reasoning tasks.

Summary

  • The paper demonstrates that reasoning performance follows a U-shaped curve, with an optimal model size achieving the best balance between underfitting and overfitting.
  • It uses controlled pretraining on knowledge graph triples to isolate reasoning abilities, revealing the impact of training steps, data volume, and graph complexity on optimal model size.
  • The study introduces graph search entropy as a novel metric, establishing a linear relationship to estimate the optimal number of parameters required for effective multi-hop reasoning.

This paper investigates the relationship between the size of LMs and their reasoning capabilities, specifically focusing on the pretraining stage (2504.03635). It challenges the common assumption that larger models inherently possess better reasoning abilities by demonstrating a "U-shaped" scaling phenomenon in a controlled environment.

The core idea is explored by pretraining LMs of varying sizes from scratch exclusively on triples from a knowledge graph (KG). Reasoning ability is then evaluated by the model's capacity to complete missing KG triples (i.e., infer unseen edges) that require multi-hop reasoning based on latent rules encoded within the graph structure. This setup aims to isolate reasoning from other language understanding complexities.

Key Findings and Observations

  1. U-Shaped Reasoning Performance: When trained sufficiently on a fixed KG dataset, the test loss (negative log-likelihood on unseen, deductible triples) exhibits a U-shaped curve as model size increases. This means there exists an optimal model size that achieves the best reasoning performance (lowest loss, highest accuracy). Models smaller than this optimum lack capacity, while models larger than this optimum tend to overfit or excessively memorize the training triples, harming their ability to generalize and perform multi-hop reasoning. This contrasts with the training loss, which typically decreases monotonically with model size. This phenomenon was initially observed using the real-world FB15K-237 KG.
  2. Factors Influencing the Optimal Size:
    • Training Steps: The optimal model size tends to decrease initially with more training steps but stabilizes after a sufficient number of steps. The maximum achievable reasoning performance (minimum loss/maximum accuracy), however, seems capped by the dataset itself, regardless of model size or training duration beyond a certain point.
    • Amount of Training Data (N): Training on more triples sampled from the same underlying KG structure increases the optimal model size and improves the best achievable reasoning performance.
    • Graph Complexity (Ne, Nr): Increasing the number of entities (NeN_e) or relations (NrN_r) in the KG generally increases the graph's complexity, leading to a larger optimal model size. More relations also tended to improve reasoning performance, potentially by reducing ambiguity.
    • Number of Rules (Nh): The number of underlying logical rules used to generate the KG did not significantly impact the optimal model size but did affect the peak reasoning performance, suggesting an optimal rule count exists for balancing complexity and ambiguity.
    • Deductible Triple Ratio (γ): A higher proportion of deductible triples (examples illustrating the reasoning rules) in the training data improved performance and increased the optimal model size up to a certain threshold.

Graph Search Entropy and Empirical Scaling Law

To quantify the complexity relevant to reasoning, the paper introduces Graph Search Entropy H(G)H(G). This metric combines the entropy rate of entities during a maximal entropy random walk on the graph (related to the graph's principal eigenvalue λ\lambda) and the entropy rate of relations conditioned on that walk.

H(G)=Ne(log(λ)+Hr(G))H(G) = N_e(\log(\lambda) + H^r(G))

where NeN_e is the number of entities and Hr(G)H^r(G) is the relation entropy rate derived from the stationary distribution and transition probabilities of the random walk.

A key contribution is the discovery of a strong linear relationship between this graph search entropy and the empirically determined optimal model size (PoptP_{opt}) across various synthetic KGs:

PoptαH(G)+βP_{opt} \approx \alpha \cdot H(G) + \beta

The regression suggests that approximately 124 additional parameters are required in the optimal model for every 1-bit increase in graph search entropy. This implies that reasoning over knowledge structures is significantly more parameter-intensive per unit of complexity (~0.008 bits/parameter) compared to factual memorization, which prior work estimated at ~2 bits/parameter [allen-zhu2025physics].

Practical Implementation Considerations

  • Model Selection for Reasoning: For applications requiring strong reasoning over a known knowledge domain (which could potentially be represented as a KG), simply using the largest available model might not be optimal. There could be a sweet spot determined by the complexity of the knowledge.
  • Estimating Optimal Size: The graph search entropy provides a potential heuristic. One could:

    1. Construct a KG representing the core knowledge of the pretraining corpus or target domain (using automated KG construction tools [zhong2023comprehensive]).
    2. Compute the graph search entropy H(G)H(G) for this KG. The computation involves finding the principal eigenvalue and eigenvector of the adjacency matrix and calculating transition probabilities.
    3. Use the derived linear scaling law (Popt124H(G)+βP_{opt} \approx 124 \cdot H(G) + \beta) to estimate the potentially optimal model size for reasoning over this specific knowledge structure.
  • Data Preparation: The KG triples were tokenized by assigning random character IDs to entities and relations to remove lexical cues. LMs (based on LLaMA architecture) were trained using a standard cross-entropy loss on the triple prediction task.

  • Evaluation: Reasoning was tested using a 10-option multiple-choice format for predicting the tail entity given the head and relation, focusing purely on selecting the correct inferred entity ID.
  • Training: Models require sufficient training steps (e.g., 10k steps in the synthetic experiments) for the optimal model size to stabilize. Training involved repeating the KG triples multiple times (epochs), analogous to how factual knowledge might appear repeatedly in diverse forms within large text corpora.

Limitations

The paper's primary limitation is its simplified setting using KGs directly for pretraining, rather than natural language text. While providing controlled insights, the direct applicability and the exact scaling constants might differ in real-world LLM pretraining scenarios involving noisy, unstructured text. Verifying these findings on large text corpora remains future work.

In conclusion, the paper provides compelling evidence that for reasoning tasks grounded in structured knowledge, model scaling is not monotonic. Overparameterization can hinder reasoning performance due to excessive memorization. The optimal model size appears linked to the intrinsic complexity of the underlying knowledge graph, quantifiable by graph search entropy, offering a potential avenue for optimizing model selection for reasoning-intensive applications.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.