Papers
Topics
Authors
Recent
Search
2000 character limit reached

LoRe-Bench: Evaluating Reasoning Models

Updated 22 December 2025
  • LoRe-Bench is a systematic benchmark that evaluates large reasoning models by measuring their alignment with foundational reasoning laws such as monotonicity and compositionality.
  • It employs a dual sub-benchmark approach—LoRe-Mono and LoRe-Compo—to test compute scaling and accuracy decay using generated task variants and composite questions.
  • The framework shows that fine-tuning for compositionality significantly enhances both internal compute allocation and external problem-solving accuracy across diverse domains.

LoRe-Bench is a systematic benchmark for evaluating large reasoning models (LRMs) with respect to foundational laws of reasoning: monotonicity and compositionality. Developed in the context of the Laws of Reasoning (LoRe) framework, LoRe-Bench provides a rigorous empirical methodology for assessing whether models allocate reasoning compute and accuracy in patterns consistent with theoretical ideals. It offers a reproducible, interpretable, and quantifiable standard for evaluating and improving reasoning behaviors of LRMs, addressing both intra-task difficulty scaling and multi-concept compositionality (Zhang et al., 19 Dec 2025).

1. Theoretical Background: The Laws of Reasoning

LoRe formalizes desirable reasoning patterns for LRMs as two parameterized laws relating question complexity κ(x)\kappa(x)—defined as the minimal number of primitive computation steps required for solution—to observable model behaviors:

  • Compute Law (Hypothesis 2.1): The expected chain-of-thought length, or reasoning compute Cθ(x)Erpθ(x)[(r)]C_\theta(x) \equiv \mathbb{E}_{r \sim p_\theta(\cdot|x)}[\ell(r)], grows linearly with complexity, i.e.,

Cθ(x)=αθκ(x)+o(κ(x)),C_\theta(x) = \alpha_\theta\, \kappa(x) + o(\kappa(x)),

with scaling coefficient αθ>0\alpha_\theta > 0.

  • Accuracy Law (Hypothesis 2.4): The probability of a correct answer Aθ(x)=P[ans(y)=a(x)]A_\theta(x) = \mathbb{P}[\mathrm{ans}(y) = a^\star(x)] decays exponentially with complexity:

Aθ(x)=exp(λθκ(x)),A_\theta(x) = \exp(-\lambda_\theta\, \kappa(x)),

or equivalently logAθ(x)κ(x)\log A_\theta(x) \propto -\kappa(x), where λθ0\lambda_\theta \ge 0 is the decay rate.

Direct computation or approximation of κ(x)\kappa(x), grounded in the verifier-certified solution length on a universal Turing machine, is intractable. To operationalize evaluation, LoRe-Bench relies on two tractable proxies: monotonicity and compositionality.

2. Monotonicity and Compositionality: Operationalizing the Laws

Monotonicity captures the requirement that reasoning compute and accuracy respond to ordered complexity:

  • If κ(x1)κ(x2)\kappa(x_1) \le \kappa(x_2), then Cθ(x1)Cθ(x2)C_\theta(x_1) \le C_\theta(x_2) and Aθ(x1)Aθ(x2)A_\theta(x_1) \ge A_\theta(x_2).

Compositionality (Independence) demands additive and multiplicative reasoning for independent tasks:

  • For independent x1,x2x_1, x_2 (disjoint concept sets), and composite x1x2x_1 \oplus x_2:

κ(x1x2)=κ(x1)+κ(x2) Cθ(x1x2)=Cθ(x1)+Cθ(x2)+o(κ1+κ2) Aθ(x1x2)=Aθ(x1)Aθ(x2)\begin{aligned} &\kappa(x_1 \oplus x_2) = \kappa(x_1) + \kappa(x_2)\ &C_\theta(x_1 \oplus x_2) = C_\theta(x_1) + C_\theta(x_2) + o(\kappa_1+\kappa_2)\ &A_\theta(x_1 \oplus x_2) = A_\theta(x_1) \cdot A_\theta(x_2) \end{aligned}

Under mild regularity, strict monotonicity and compositionality jointly imply the compute and accuracy laws (see Propositions 3.1 & 3.2 in (Zhang et al., 19 Dec 2025)).

3. LoRe-Bench Structure and Methodology

LoRe-Bench consists of two complementary sub-benchmarks to assess monotonicity and compositionality:

LoRe-Mono (Monotonicity):

  • Domains: Mathematics, science, language, and code.
  • Seed Templates: 10 per domain.
  • Variants: 30 per seed, generated by stacking NN primitive operations (e.g., NN matrix multiplications, string mutations), ensuring κ\kappa scales with NN.
  • Verification: Programmatic generation plus manual spot checks to prevent trivial algorithmic shortcuts.
  • Evaluation: For each model and NN, sample 8 outputs to estimate E[(r)]\mathbb{E}[\ell(r)], recording both compute and accuracy. Compute Spearman rank correlations ρ\rho between variant index (proxy for κ\kappa) and CθC_\theta (expected near +1+1), and between κ\kappa and logAθ\log A_\theta (expected near 1-1).

LoRe-Compo (Compositionality):

  • Data Source: 250 question pairs from MATH500, pairs selected from distinct subjects to ensure conceptual independence.
  • Composite Construction: Sub-questions concatenated by a fixed phrase (“Answer these in order: ...”).
  • Metrics: For fθf_\theta being either CθC_\theta or logAθ\log A_\theta, compute mean absolute deviation

MADf=1Dfθ(x1x2)[fθ(x1)+fθ(x2)],\mathrm{MAD}_f = \frac{1}{|D|}\sum |f_\theta(x_1 \oplus x_2) - [f_\theta(x_1) + f_\theta(x_2)]|,

and normalize to yield nMADf=MADf/Sfn\mathrm{MAD}_f = \mathrm{MAD}_f / S_f with Sf=fθ(x1)+fθ(x2)S_f = \sum |f_\theta(x_1) + f_\theta(x_2)|. Smaller nMADn\mathrm{MAD} indicates stronger compositionality.

Table 1: LoRe-Bench Sub-Benchmark Features

Sub-benchmark Domain Coverage Evaluation Metric
LoRe-Mono Math, science, language, code Spearman ρ (compute/accuracy)
LoRe-Compo Math (independent pairs) nMAD (compute, log accuracy)

4. Empirical Assessment of State-of-the-Art Models

Ten open-source LRMs (e.g., Qwen, LLaMA, Phi-4-mini, OpenReasoning, Sky-T1) and two reasoning-length-controlled variants (Thinkless, AdaptThink) were evaluated:

  • Monotonicity: Most models reach ρ(Cθ,κ)0.95\rho(C_\theta, \kappa) \approx 0.95–$0.99$ in mathematics and science. Weak monotonicity (including non-monotonicity) is observed in some domains for smaller models (e.g., Qwen-1.5B in language and code).
  • Compositionality: All evaluated models exhibit high nMADn\mathrm{MAD} (0.32–0.53 for compute, 0.7–2.4 for log accuracy), with many Cθ(x1x2)C_\theta(x_1 \oplus x_2) values deviating considerably from the expected sums, indicating both under- and over-allocation of compute on composite tasks. Adaptation mechanisms based solely on inference-time dynamic reasoning length do not yield compositional chains.
  • Scatter Characteristics: Scatter plots of Cθ(x1x2)C_\theta(x_1 \oplus x_2) against Cθ(x1)+Cθ(x2)C_\theta(x_1) + C_\theta(x_2) for composite pairs reveal widespread departure from the ideal diagonal, diagnostic of compositionality failures.

5. Enforcing Reasoning Laws via Fine-Tuning

SFT-Compo is a supervised fine-tuning method designed to align composite compute to the sum of compute on sub-questions:

  • Triplet Sampling: Generate independent triplets (x1,x2,x1x2)(x_1, x_2, x_1 \oplus x_2) using DeepScaler; sample K=8K = 8 chain-of-thought (CoT) + answer outputs from a strong teacher (DeepSeek-14B).
  • Selection: For each triplet of correct answers, select (r1,r2,r12)=argmin(r1)+(r2)(r12)(r_1^*, r_2^*, r_{12}^*) = \arg\min | \ell(r_1) + \ell(r_2) - \ell(r_{12}) | to minimize length discrepancy between composite and summative sub-chains.
  • Supervision: Collect Q={(x1,o1),(x2,o2),(x12,o12)}Q = \{(x_1, o_1^*), (x_2, o_2^*), (x_{12}, o_{12}^*)\} (approx. 3.9k examples).
  • Fine-tuning: Train the student LRM on QQ for 5 epochs with standard cross-entropy, batch size 16, and grid search over learning rates in {1×106,5×106,5×105}\{1 \times 10^{-6}, 5 \times 10^{-6}, 5 \times 10^{-5}\}. No additional regularization is applied.

6. Impact of Compositionality Fine-Tuning on Reasoning Performance

  • Compositionality Gains: Post SFT-Compo, nMADCn\mathrm{MAD}_C decreases by 40.5% (1.5B) and 22.5% (8B), with composite vs. sub-sum compute converging to the ideal diagonal.
  • Generalization: Across GSM8K, MATH500, AIME’24/’25, AMC’23, OlympiadBench, pass@1 accuracy uniformly improves: +4.8 percentage points (1.5B) and +5.0 (8B), consistently outperforming vanilla SFT based on randomly sampled correct rationales.
  • Law Synergies: Enforcing compositionality not only improves compositionality metrics but also enhances monotonicity (e.g., Qwen-1.5B Spearman ρ\rho in code rises from 0.151 to 0.914), and reduces nMADlogAn\mathrm{MAD}_{\log A} by 71.1% (1.5B) and 35.4% (7B). This suggests enforcing one law can reinforce others, advancing models towards theoretically ideal reasoning patterns.

7. Significance and Conclusion

LoRe-Bench demonstrates that abstract, theoretically principled laws of reasoning can be instantiated in practical, reproducible benchmarks for large reasoning models. By identifying monotonicity and compositionality as tractable, diagnostic properties, and showing that targeted fine-tuning along these axes yields measurable improvements in both internal compute allocation and external problem-solving accuracy, LoRe-Bench bridges foundational theory and applied model assessment (Zhang et al., 19 Dec 2025). A plausible implication is that further advances in LRM architectures and training regimes may benefit from explicit alignment with LoRe-style behavioral laws, offering a path forward for the systematic development of more predictable and robust reasoning models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LoRe-Bench.