LoRe-Bench: Evaluating Reasoning Models
- LoRe-Bench is a systematic benchmark that evaluates large reasoning models by measuring their alignment with foundational reasoning laws such as monotonicity and compositionality.
- It employs a dual sub-benchmark approach—LoRe-Mono and LoRe-Compo—to test compute scaling and accuracy decay using generated task variants and composite questions.
- The framework shows that fine-tuning for compositionality significantly enhances both internal compute allocation and external problem-solving accuracy across diverse domains.
LoRe-Bench is a systematic benchmark for evaluating large reasoning models (LRMs) with respect to foundational laws of reasoning: monotonicity and compositionality. Developed in the context of the Laws of Reasoning (LoRe) framework, LoRe-Bench provides a rigorous empirical methodology for assessing whether models allocate reasoning compute and accuracy in patterns consistent with theoretical ideals. It offers a reproducible, interpretable, and quantifiable standard for evaluating and improving reasoning behaviors of LRMs, addressing both intra-task difficulty scaling and multi-concept compositionality (Zhang et al., 19 Dec 2025).
1. Theoretical Background: The Laws of Reasoning
LoRe formalizes desirable reasoning patterns for LRMs as two parameterized laws relating question complexity —defined as the minimal number of primitive computation steps required for solution—to observable model behaviors:
- Compute Law (Hypothesis 2.1): The expected chain-of-thought length, or reasoning compute , grows linearly with complexity, i.e.,
with scaling coefficient .
- Accuracy Law (Hypothesis 2.4): The probability of a correct answer decays exponentially with complexity:
or equivalently , where is the decay rate.
Direct computation or approximation of , grounded in the verifier-certified solution length on a universal Turing machine, is intractable. To operationalize evaluation, LoRe-Bench relies on two tractable proxies: monotonicity and compositionality.
2. Monotonicity and Compositionality: Operationalizing the Laws
Monotonicity captures the requirement that reasoning compute and accuracy respond to ordered complexity:
- If , then and .
Compositionality (Independence) demands additive and multiplicative reasoning for independent tasks:
- For independent (disjoint concept sets), and composite :
Under mild regularity, strict monotonicity and compositionality jointly imply the compute and accuracy laws (see Propositions 3.1 & 3.2 in (Zhang et al., 19 Dec 2025)).
3. LoRe-Bench Structure and Methodology
LoRe-Bench consists of two complementary sub-benchmarks to assess monotonicity and compositionality:
LoRe-Mono (Monotonicity):
- Domains: Mathematics, science, language, and code.
- Seed Templates: 10 per domain.
- Variants: 30 per seed, generated by stacking primitive operations (e.g., matrix multiplications, string mutations), ensuring scales with .
- Verification: Programmatic generation plus manual spot checks to prevent trivial algorithmic shortcuts.
- Evaluation: For each model and , sample 8 outputs to estimate , recording both compute and accuracy. Compute Spearman rank correlations between variant index (proxy for ) and (expected near ), and between and (expected near ).
LoRe-Compo (Compositionality):
- Data Source: 250 question pairs from MATH500, pairs selected from distinct subjects to ensure conceptual independence.
- Composite Construction: Sub-questions concatenated by a fixed phrase (“Answer these in order: ...”).
- Metrics: For being either or , compute mean absolute deviation
and normalize to yield with . Smaller indicates stronger compositionality.
Table 1: LoRe-Bench Sub-Benchmark Features
| Sub-benchmark | Domain Coverage | Evaluation Metric |
|---|---|---|
| LoRe-Mono | Math, science, language, code | Spearman ρ (compute/accuracy) |
| LoRe-Compo | Math (independent pairs) | nMAD (compute, log accuracy) |
4. Empirical Assessment of State-of-the-Art Models
Ten open-source LRMs (e.g., Qwen, LLaMA, Phi-4-mini, OpenReasoning, Sky-T1) and two reasoning-length-controlled variants (Thinkless, AdaptThink) were evaluated:
- Monotonicity: Most models reach –$0.99$ in mathematics and science. Weak monotonicity (including non-monotonicity) is observed in some domains for smaller models (e.g., Qwen-1.5B in language and code).
- Compositionality: All evaluated models exhibit high (0.32–0.53 for compute, 0.7–2.4 for log accuracy), with many values deviating considerably from the expected sums, indicating both under- and over-allocation of compute on composite tasks. Adaptation mechanisms based solely on inference-time dynamic reasoning length do not yield compositional chains.
- Scatter Characteristics: Scatter plots of against for composite pairs reveal widespread departure from the ideal diagonal, diagnostic of compositionality failures.
5. Enforcing Reasoning Laws via Fine-Tuning
SFT-Compo is a supervised fine-tuning method designed to align composite compute to the sum of compute on sub-questions:
- Triplet Sampling: Generate independent triplets using DeepScaler; sample chain-of-thought (CoT) + answer outputs from a strong teacher (DeepSeek-14B).
- Selection: For each triplet of correct answers, select to minimize length discrepancy between composite and summative sub-chains.
- Supervision: Collect (approx. 3.9k examples).
- Fine-tuning: Train the student LRM on for 5 epochs with standard cross-entropy, batch size 16, and grid search over learning rates in . No additional regularization is applied.
6. Impact of Compositionality Fine-Tuning on Reasoning Performance
- Compositionality Gains: Post SFT-Compo, decreases by 40.5% (1.5B) and 22.5% (8B), with composite vs. sub-sum compute converging to the ideal diagonal.
- Generalization: Across GSM8K, MATH500, AIME’24/’25, AMC’23, OlympiadBench, pass@1 accuracy uniformly improves: +4.8 percentage points (1.5B) and +5.0 (8B), consistently outperforming vanilla SFT based on randomly sampled correct rationales.
- Law Synergies: Enforcing compositionality not only improves compositionality metrics but also enhances monotonicity (e.g., Qwen-1.5B Spearman in code rises from 0.151 to 0.914), and reduces by 71.1% (1.5B) and 35.4% (7B). This suggests enforcing one law can reinforce others, advancing models towards theoretically ideal reasoning patterns.
7. Significance and Conclusion
LoRe-Bench demonstrates that abstract, theoretically principled laws of reasoning can be instantiated in practical, reproducible benchmarks for large reasoning models. By identifying monotonicity and compositionality as tractable, diagnostic properties, and showing that targeted fine-tuning along these axes yields measurable improvements in both internal compute allocation and external problem-solving accuracy, LoRe-Bench bridges foundational theory and applied model assessment (Zhang et al., 19 Dec 2025). A plausible implication is that further advances in LRM architectures and training regimes may benefit from explicit alignment with LoRe-style behavioral laws, offering a path forward for the systematic development of more predictable and robust reasoning models.