LoRe-Bench: Evaluating Reasoning Models

Updated 22 December 2025

LoRe-Bench is a systematic benchmark that evaluates large reasoning models by measuring their alignment with foundational reasoning laws such as monotonicity and compositionality.
It employs a dual sub-benchmark approach—LoRe-Mono and LoRe-Compo—to test compute scaling and accuracy decay using generated task variants and composite questions.
The framework shows that fine-tuning for compositionality significantly enhances both internal compute allocation and external problem-solving accuracy across diverse domains.

LoRe-Bench is a systematic benchmark for evaluating large reasoning models (LRMs) with respect to foundational laws of reasoning: monotonicity and compositionality. Developed in the context of the Laws of Reasoning (LoRe) framework, LoRe-Bench provides a rigorous empirical methodology for assessing whether models allocate reasoning compute and accuracy in patterns consistent with theoretical ideals. It offers a reproducible, interpretable, and quantifiable standard for evaluating and improving reasoning behaviors of LRMs, addressing both intra-task difficulty scaling and multi-concept compositionality (Zhang et al., 19 Dec 2025).

1. Theoretical Background: The Laws of Reasoning

LoRe formalizes desirable reasoning patterns for LRMs as two parameterized laws relating question complexity $\kappa(x)$ —defined as the minimal number of primitive computation steps required for solution—to observable model behaviors:

Compute Law (Hypothesis 2.1): The expected chain-of-thought length, or reasoning compute $C_\theta(x) \equiv \mathbb{E}_{r \sim p_\theta(\cdot|x)}[\ell(r)]$ , grows linearly with complexity, i.e.,

$C_\theta(x) = \alpha_\theta\, \kappa(x) + o(\kappa(x)),$

with scaling coefficient $\alpha_\theta > 0$ .

Accuracy Law (Hypothesis 2.4): The probability of a correct answer $A_\theta(x) = \mathbb{P}[\mathrm{ans}(y) = a^\star(x)]$ decays exponentially with complexity:

$A_\theta(x) = \exp(-\lambda_\theta\, \kappa(x)),$

or equivalently $\log A_\theta(x) \propto -\kappa(x)$ , where $\lambda_\theta \ge 0$ is the decay rate.

Direct computation or approximation of $\kappa(x)$ , grounded in the verifier-certified solution length on a universal Turing machine, is intractable. To operationalize evaluation, LoRe-Bench relies on two tractable proxies: monotonicity and compositionality.

2. Monotonicity and Compositionality: Operationalizing the Laws

Monotonicity captures the requirement that reasoning compute and accuracy respond to ordered complexity:

If $\kappa(x_1) \le \kappa(x_2)$ , then $C_\theta(x_1) \le C_\theta(x_2)$ and $A_\theta(x_1) \ge A_\theta(x_2)$ .

Compositionality (Independence) demands additive and multiplicative reasoning for independent tasks:

For independent $x_1, x_2$ (disjoint concept sets), and composite $x_1 \oplus x_2$ :

$\begin{aligned} &\kappa(x_1 \oplus x_2) = \kappa(x_1) + \kappa(x_2)\ &C_\theta(x_1 \oplus x_2) = C_\theta(x_1) + C_\theta(x_2) + o(\kappa_1+\kappa_2)\ &A_\theta(x_1 \oplus x_2) = A_\theta(x_1) \cdot A_\theta(x_2) \end{aligned}$

Under mild regularity, strict monotonicity and compositionality jointly imply the compute and accuracy laws (see Propositions 3.1 & 3.2 in (Zhang et al., 19 Dec 2025)).

3. LoRe-Bench Structure and Methodology

LoRe-Bench consists of two complementary sub-benchmarks to assess monotonicity and compositionality:

LoRe-Mono (Monotonicity):

Domains: Mathematics, science, language, and code.
Seed Templates: 10 per domain.
Variants: 30 per seed, generated by stacking $N$ primitive operations (e.g., $N$ matrix multiplications, string mutations), ensuring $\kappa$ scales with $N$ .
Verification: Programmatic generation plus manual spot checks to prevent trivial algorithmic shortcuts.
Evaluation: For each model and $N$ , sample 8 outputs to estimate $\mathbb{E}[\ell(r)]$ , recording both compute and accuracy. Compute Spearman rank correlations $\rho$ between variant index (proxy for $\kappa$ ) and $C_\theta$ (expected near $+1$ ), and between $\kappa$ and $\log A_\theta$ (expected near $-1$ ).

LoRe-Compo (Compositionality):

Data Source: 250 question pairs from MATH500, pairs selected from distinct subjects to ensure conceptual independence.
Composite Construction: Sub-questions concatenated by a fixed phrase (“Answer these in order: ...”).
Metrics: For $f_\theta$ being either $C_\theta$ or $\log A_\theta$ , compute mean absolute deviation

$\mathrm{MAD}_f = \frac{1}{|D|}\sum |f_\theta(x_1 \oplus x_2) - [f_\theta(x_1) + f_\theta(x_2)]|,$

and normalize to yield $n\mathrm{MAD}_f = \mathrm{MAD}_f / S_f$ with $S_f = \sum |f_\theta(x_1) + f_\theta(x_2)|$ . Smaller $n\mathrm{MAD}$ indicates stronger compositionality.

Table 1: LoRe-Bench Sub-Benchmark Features

Sub-benchmark	Domain Coverage	Evaluation Metric
LoRe-Mono	Math, science, language, code	Spearman ρ (compute/accuracy)
LoRe-Compo	Math (independent pairs)	nMAD (compute, log accuracy)

4. Empirical Assessment of State-of-the-Art Models

Ten open-source LRMs (e.g., Qwen, LLaMA, Phi-4-mini, OpenReasoning, Sky-T1) and two reasoning-length-controlled variants (Thinkless, AdaptThink) were evaluated:

Monotonicity: Most models reach $\rho(C_\theta, \kappa) \approx 0.95$ –$0.99$ in mathematics and science. Weak monotonicity (including non-monotonicity) is observed in some domains for smaller models (e.g., Qwen-1.5B in language and code).
Compositionality: All evaluated models exhibit high $n\mathrm{MAD}$ (0.32–0.53 for compute, 0.7–2.4 for log accuracy), with many $C_\theta(x_1 \oplus x_2)$ values deviating considerably from the expected sums, indicating both under- and over-allocation of compute on composite tasks. Adaptation mechanisms based solely on inference-time dynamic reasoning length do not yield compositional chains.
Scatter Characteristics: Scatter plots of $C_\theta(x_1 \oplus x_2)$ against $C_\theta(x_1) + C_\theta(x_2)$ for composite pairs reveal widespread departure from the ideal diagonal, diagnostic of compositionality failures.

5. Enforcing Reasoning Laws via Fine-Tuning

SFT-Compo is a supervised fine-tuning method designed to align composite compute to the sum of compute on sub-questions:

Triplet Sampling: Generate independent triplets $(x_1, x_2, x_1 \oplus x_2)$ using DeepScaler; sample $K = 8$ chain-of-thought (CoT) + answer outputs from a strong teacher (DeepSeek-14B).
Selection: For each triplet of correct answers, select $(r_1^*, r_2^*, r_{12}^*) = \arg\min | \ell(r_1) + \ell(r_2) - \ell(r_{12}) |$ to minimize length discrepancy between composite and summative sub-chains.
Supervision: Collect $Q = \{(x_1, o_1^*), (x_2, o_2^*), (x_{12}, o_{12}^*)\}$ (approx. 3.9k examples).
Fine-tuning: Train the student LRM on $Q$ for 5 epochs with standard cross-entropy, batch size 16, and grid search over learning rates in $\{1 \times 10^{-6}, 5 \times 10^{-6}, 5 \times 10^{-5}\}$ . No additional regularization is applied.

6. Impact of Compositionality Fine-Tuning on Reasoning Performance

Compositionality Gains: Post SFT-Compo, $n\mathrm{MAD}_C$ decreases by 40.5% (1.5B) and 22.5% (8B), with composite vs. sub-sum compute converging to the ideal diagonal.
Generalization: Across GSM8K, MATH500, AIME’24/’25, AMC’23, OlympiadBench, pass@1 accuracy uniformly improves: +4.8 percentage points (1.5B) and +5.0 (8B), consistently outperforming vanilla SFT based on randomly sampled correct rationales.
Law Synergies: Enforcing compositionality not only improves compositionality metrics but also enhances monotonicity (e.g., Qwen-1.5B Spearman $\rho$ in code rises from 0.151 to 0.914), and reduces $n\mathrm{MAD}_{\log A}$ by 71.1% (1.5B) and 35.4% (7B). This suggests enforcing one law can reinforce others, advancing models towards theoretically ideal reasoning patterns.

7. Significance and Conclusion

LoRe-Bench demonstrates that abstract, theoretically principled laws of reasoning can be instantiated in practical, reproducible benchmarks for large reasoning models. By identifying monotonicity and compositionality as tractable, diagnostic properties, and showing that targeted fine-tuning along these axes yields measurable improvements in both internal compute allocation and external problem-solving accuracy, LoRe-Bench bridges foundational theory and applied model assessment (Zhang et al., 19 Dec 2025). A plausible implication is that further advances in LRM architectures and training regimes may benefit from explicit alignment with LoRe-style behavioral laws, offering a path forward for the systematic development of more predictable and robust reasoning models.

Markdown Report Issue Upgrade to Chat

References (1)

When Reasoning Meets Its Laws (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LoRe-Bench.