LIMO: Curated Math Reasoning Dataset

Updated 2 August 2025

LIMO dataset is a curated collection of 800 advanced math problems paired with detailed reasoning chains drawn from competitive and academic sources.
The dataset is constructed using rigorous filtering and manual evaluation, enabling efficient fine-tuning of LLMs with minimal, high-impact examples over 15 epochs.
LIMO-trained models achieve notable performance (e.g., 63.3% on AIME24, 95.6% on MATH500) and demonstrate strong out-of-distribution generalization across multiple benchmarks.

The LIMO dataset is a highly curated collection of mathematical problems and detailed reasoning chains designed to probe and elicit advanced mathematical reasoning in LLMs, with a guiding philosophy that emphasizes qualitative impact over sheer scale. Reminiscent of the “Less-Is-More Reasoning Hypothesis,” the LIMO dataset serves as a testbed and training resource for achieving competition-level problem solving and robust generalization using a minimal amount of high-value data.

1. Dataset Construction and Selection Pipeline

The LIMO dataset comprises 800 (question, reasoning chain, answer) triplets, meticulously selected from an initial pool comprising tens of millions of math problems spanning diverse sources: high school examinations, international competitions (AIME, AMC, MATH), and major Chinese mathematics benchmarks (e.g., CHMath, Gaokao, Kaoyan) (Ye et al., 5 Feb 2025). Trivial or easily solvable questions were filtered out using the Qwen2.5-Math-7B-Instruct model, after which a stronger model (DeepSeek-R1-Distill-Qwen-32B) performed 32 solution attempts per problem. Only questions with consistently low empirical success rates proceeded to manual evaluation.

For each candidate, the best reasoning chain was selected based on a rule-based scoring rubric prioritizing:

Length of solution (indicating elaboration)
Inclusion of validation phrases (self-verification)
Presence of exploratory language and connective phrases (adaptive granularity)

This process ensures the LIMO dataset is not just a set of correct answers, but consists of diverse, high-quality exemplars of extended mathematical reasoning—serving as “cognitive templates” for downstream models.

2. Training Methodology

LIMO’s central claim is that, for foundation models with a comprehensive pre-trained knowledge base, sophisticated mathematical reasoning can be efficiently induced using a minimal set of highly-informative training demonstrations. The fine-tuning methodology entails full-parameter supervised fine-tuning (SFT) of Qwen2.5-32B-Instruct (using DeepSpeed ZeRO-3, FlashAttention-2, cosine-decay scheduling, learning rate 5×10⁻⁶, no warmup, 15 epochs, batch size 64), directly on the 800 triplets (Ye et al., 5 Feb 2025).

Unlike approaches reliant on massive SFT datasets (≥ 100,000 samples), LIMO demonstrates that carefully curated, strategically selected (“showcase”) examples suffice to elicit extended chain-of-thought (CoT) reasoning in LLMs. The training objective is not rote memorization, but the absorption of modular, meta-computational “cognitive templates.”

3. Benchmark Performance and Evaluation Protocol

LIMO-trained models were evaluated on challenging mathematical reasoning benchmarks. Notable performance metrics include:

63.3% accuracy on AIME24
95.6% accuracy on MATH500

These results use the pass@1 metric, defined as:

$\text{pass@1} = \frac{\text{Number of Correct Responses}}{\text{Total Number of Questions}} \times 100\%$

For small benchmarks (fewer than 50 items), 4 samples were generated with temperature 0.6 and unbiased pass@1 was calculated; for larger benchmarks, greedy decoding (one sample per item) was used. Rule-based and LLM-based evaluators handled answer validation, accommodating both numeric and complex answer formats.

4. Out-of-Distribution Generalization

LIMO-trained models were further tested on a suite of OOD (out-of-distribution) datasets, including OlympiadBench, CHMath, Gaokao, Kaoyan, GradeSchool, and multi-disciplinary challenge sets such as Minerva and GPQA. In these settings, LIMO models exhibited strong generalization, outperforming both baseline models and those trained with more than 100× the data. In some cases, LIMO-based fine-tuning led to more than 45.8 absolute percentage points of improvement over competitors, reinforcing the claim that the quality and representativeness of “cognitive templates” supersede the absolute scale in eliciting systematic reasoning (Ye et al., 5 Feb 2025).

5. Theoretical Framework: Less-Is-More Reasoning Hypothesis

The central hypothesis underpinning the LIMO dataset asserts:

For LLMs with a sufficiently comprehensive latent knowledge base, the threshold for sophisticated reasoning is governed by (I) the extent of domain knowledge acquired during pre-training and (II) the informativeness of post-training demonstrations.
“Cognitive templates”—a small set of quality-assured, representative stepwise solutions—are sufficient to induce robust, transferable reasoning capabilities.

This hypothesis shifts the paradigm of model alignment and task adaptation toward demonstration quality, challenging the assumption that complex reasoning skills require massive data and promoting a cost-effective route to advanced performance in data-constrained settings.

6. Implications, Applications, and Future Directions

The LIMO dataset embodies a shift away from scale-driven training toward strategic sample selection and data efficiency:

It is particularly well-suited for developing mathematical and scientific reasoning systems in domains where annotated data is limited.
The approach allows highly efficient adaptation and rapid deployment for specialized applications (e.g., educational technologies, scientific research assistants), with substantial reductions in environmental and computational costs.
The methodology encourages future work in intelligent data selection (e.g., active learning, impact-based filtering (Li et al., 17 Feb 2025)), suggesting that “impact ranking” of training problems may further enhance outcomes.

Open questions remain on generalizing these principles to other domains (e.g., programming, legal reasoning), the interaction between template diversity and transfer, and the limits imposed by the coverage of the base model’s pre-training.

7. Practical Considerations for Dataset Use

When applying the LIMO dataset, several evaluation and usage considerations arise:

Solution length should be carefully interpreted, as empirical evidence indicates correct solutions are, on average, shorter than incorrect ones (Zeng et al., 17 Feb 2025). This suggests post-processing protocols that balance reasoning thoroughness with conciseness, for instance via aggregation schemes such as Shortest Majority Vote.
Incorporating impact-based data selection techniques (e.g., LIMR (Li et al., 17 Feb 2025)) could further distill the dataset by identifying and prioritizing only the most educationally beneficial exemplars, increasing training efficiency and downstream transfer even further.
The relevance and applicability of the dataset are amplified for model configurations with extensive mathematical and logical prior training; models with less comprehensive pre-training may benefit less from such data-efficient schemes.

In summary, the LIMO dataset demonstrates that, for models with sufficient domain pre-training, a small, expertly curated set of reasoning exemplars suffices to invoke advanced, transferable mathematical reasoning. This resource establishes new best practices for dataset design and model alignment in reasoning-intensive domains.

PDF Markdown Chat (Pro)

References (3)

LIMO: Less is More for Reasoning (2025)

LIMR: Less is More for RL Scaling (2025)

Revisiting the Test-Time Scaling of o1-like Models: Do they Truly Possess Test-Time Scaling Capabilities? (2025)

Follow Topic

Get notified by email when new papers are published related to LIMO Dataset.