Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

92 tokens/sec

Gemini 2.5 Pro Premium

51 tokens/sec

GPT-5 Medium

32 tokens/sec

GPT-5 High Premium

25 tokens/sec

GPT-4o

103 tokens/sec

DeepSeek R1 via Azure Premium

64 tokens/sec

GPT OSS 120B via Groq Premium

469 tokens/sec

Kimi K2 via Groq Premium

227 tokens/sec

2000 character limit reached

LIMO Hypothesis: Less-Is-More Reasoning

Updated 2 August 2025

LIMO Hypothesis is a framework asserting that sophisticated reasoning in LLMs emerges from a minimal set of high-quality cognitive templates, not large-scale fine-tuning.
Empirical studies demonstrate that models fine-tuned with curated reasoning chains achieve superior accuracy and generalization across benchmarks like AIME24 and MATH500.
The approach emphasizes data efficiency and precise exemplar selection, challenging traditional paradigms by leveraging robust pre-training combined with targeted post-training demonstrations.

The LIMO Hypothesis, or Less-Is-More Reasoning Hypothesis, postulates that in LLMs with a comprehensive pre-trained knowledge base, the emergence of sophisticated reasoning capabilities does not depend on mass-scale supervised fine-tuning. Rather, it can be effectively elicited through a minimal set of carefully designed, high-quality instructional examples that serve as "cognitive templates" for reasoning processes. Two central factors determine this threshold: the completeness of pre-training and the effectiveness of these post-training templates, rather than the intrinsic complexity of the task or the scale of fine-tuning data (Ye et al., 5 Feb 2025).

1. Foundational Principles of the LIMO Hypothesis

The LIMO Hypothesis asserts that, provided a foundation model (e.g., an LLM pretrained on a wide corpus) already encodes the majority of domain-relevant concepts, its ability to perform complex, stepwise reasoning is primarily a function of post-training demonstration quality and strategic exemplar selection. In particular, cognitive processes—such as mathematical deduction chains—are "activated" not by dataset scale, but by exposure to a small number of high-quality demonstrations that exemplify multistep logical inference.

This principle directly challenges the conventional paradigm that associates reasoning performance with the size of the fine-tuning corpus. The LIMO Hypothesis reframes reasoning induction as a process of effective transfer from latent pre-trained knowledge to explicit execution, triggered by a small set of demonstrations.

2. Methodological Approach and Curation of Cognitive Templates

To empirically test the LIMO Hypothesis, an explicit supervised fine-tuning methodology is adopted. A large candidate corpus of mathematical problems is filtered by multi-stage difficulty assessments using prior LLMs to remove trivial or redundant instances. Surviving questions, denoted $q \in \mathcal{Q}$ , are mapped to detailed reasoning chains $r$ with intermediate steps and an answer $a \in \mathcal{A}$ , captured as $f: \mathcal{Q} \to (\mathcal{R} \times \mathcal{A})$ .

A rule-based scoring framework ranks sample reasoning chains by criteria such as elaboration, self-verification, exploratory phraseology, and adaptive step granularity. Only those chains exhibiting the most faithful analogs of robust human reasoning are retained, forming a focused LIMO dataset with as few as 800 exemplars, representing merely 1% or less of the training data used by alternative fine-tuned models.

Fine-tuning is executed via full-parameter supervised training (on a 32B-parameter LLM) for 15 epochs, leveraging techniques such as DeepSpeed ZeRO-3 and FlashAttention-2, and a cosine-decay learning schedule with skip-phase warm-up, to adapt quickly to the high signal-to-noise ratio of the selected cognitive templates.

3. Empirical Performance and Generalization

The model fine-tuned with LIMO cognitive templates exhibits superior in-domain and out-of-domain reasoning accuracy compared with models trained on orders of magnitude more examples. On the AIME24 benchmark, LIMO achieves 63.3% accuracy, and on MATH500 achieves 95.6%, in contrast to 6.5% and 59.2% attained by earlier fine-tuned models (e.g., NuminaMath-100k) (Ye et al., 5 Feb 2025).

Out-of-distribution generalization is robust: LIMO outperforms on OlympiadBench, CHMath, Gaokao, Kaoyan, and GradeSchool, often exceeding baseline models by 45.8% absolute average improvement. These results establish that strategic curation—rather than brute-force data expansion—produces models with wide generalization and better transfer to novel problems.

4. Comparative Analysis with Data-Intensive Approaches

Contrasted with large-scale fine-tuning approaches, the LIMO paradigm demonstrates that data efficiency is attainable by optimizing exemplar selection and demonstration quality. Models fine-tuned on tens or hundreds of thousands of examples do not achieve comparable in-domain or cross-domain benchmarks. Ablation studies indicate that only the highest-scoring (Level 5) reasoned chains yield top performance, with diminished accuracy when lower-quality demonstrations are used.

A plausible implication is that diminishing returns—and even negative transfer—may occur when supplementary data lacks the requisite reasoning fidelity, supporting LIMO's central tenet that the granularity and quality of post-training templates are decisive.

5. Factors Governing Reasoning Elicitation

The LIMO Hypothesis recognizes two essential determinants for sophisticated reasoning emergence:

Pre-trained Knowledge Base Completeness: Foundation models with richer pre-training (e.g., Qwen2.5 vs. Qwen1.5) yield higher accuracy when subjected to minimal template fine-tuning (e.g., a 54.1% gain on AIME24 with a more comprehensive pre-training regimen).
Post-training Template Effectiveness: Cognitive templates that demonstrate elaborate reasoning, self-checking, and adaptive granularity function as highly efficient catalysts for model capability unlocking. Rigorous multi-stage filtering ensures only exemplars that maximize reasoning transparency are propagated during fine-tuning.

6. Implications, Extensions, and Research Trajectory

The LIMO Hypothesis compels a paradigm shift in LLM reasoning research. Rather than focusing on extensive supervised fine-tuning, future directions should concentrate on cognitive template identification, active learning for potent exemplar selection, and data-efficient curriculum design. The approach also motivates investigation into interactions between pre-training data richness and post-hoc template efficacy, potentially extending these lessons to non-mathematical domains.

The LIMO finding further suggests potential benefits for RL-based approaches (Li et al., 17 Feb 2025), where precise sample selection via automated impact metrics (e.g., Learning Impact Measurement) enables even smaller and more impactful RL training sets.

7. Test-Time Scaling and Inference Optimization

Subsequent analyses (Zeng et al., 17 Feb 2025) employing the LIMO framework address inference-phase reasoning efficiency. Studies reveal that solution accuracy is typically higher for shorter, more succinct chains of thought (CoT), and that methods such as Shortest Majority Vote—which combine candidate answer frequency with brevity—yield higher performance and computational efficiency than majority voting over self-revised, verbose outputs. This reinforces the LIMO Hypothesis's assertion of "less is more" both in training and inference: resource-efficient strategies deliver scalable, accurate reasoning in practice.

In summary, the LIMO Hypothesis redefines the data requirements for eliciting complex reasoning in LLMs. It demonstrates—across empirical, comparative, and practical dimensions—that high-quality cognitive template demonstrations, coupled with comprehensive pre-training, are sufficient to activate advanced reasoning, yielding models that are both data- and computation-efficient and excel in accuracy and generalization.

PDF Markdown Chat (Upgrade)

References (3)

LIMO: Less is More for Reasoning (2025)

LIMR: Less is More for RL Scaling (2025)

Revisiting the Test-Time Scaling of o1-like Models: Do they Truly Possess Test-Time Scaling Capabilities? (2025)

LIMO Hypothesis: Less-Is-More Reasoning

1. Foundational Principles of the LIMO Hypothesis

2. Methodological Approach and Curation of Cognitive Templates

3. Empirical Performance and Generalization

4. Comparative Analysis with Data-Intensive Approaches

5. Factors Governing Reasoning Elicitation

6. Implications, Extensions, and Research Trajectory

7. Test-Time Scaling and Inference Optimization

Follow-up Questions

Don't miss out on important new AI/ML research

LIMO Hypothesis: Less-Is-More Reasoning

1. Foundational Principles of the LIMO Hypothesis

2. Methodological Approach and Curation of Cognitive Templates

3. Empirical Performance and Generalization

4. Comparative Analysis with Data-Intensive Approaches

5. Factors Governing Reasoning Elicitation

6. Implications, Extensions, and Research Trajectory

7. Test-Time Scaling and Inference Optimization

Follow-up Questions

Related Topics

Don't miss out on important new AI/ML research