MiMo: Unlocking the Reasoning Potential of Language Model -- From Pretraining to Posttraining (2505.07608v2)

Published 12 May 2025 in cs.CL, cs.AI, and cs.LG

Abstract: We present MiMo-7B, a LLM born for reasoning tasks, with optimization across both pre-training and post-training stages. During pre-training, we enhance the data preprocessing pipeline and employ a three-stage data mixing strategy to strengthen the base model's reasoning potential. MiMo-7B-Base is pre-trained on 25 trillion tokens, with additional Multi-Token Prediction objective for enhanced performance and accelerated inference speed. During post-training, we curate a dataset of 130K verifiable mathematics and programming problems for reinforcement learning, integrating a test-difficulty-driven code-reward scheme to alleviate sparse-reward issues and employing strategic data resampling to stabilize training. Extensive evaluations show that MiMo-7B-Base possesses exceptional reasoning potential, outperforming even much larger 32B models. The final RL-tuned model, MiMo-7B-RL, achieves superior performance on mathematics, code and general reasoning tasks, surpassing the performance of OpenAI o1-mini. The model checkpoints are available at https://github.com/xiaomimimo/MiMo.

Summary

The paper introduces MiMo-7B, a 7B parameter model optimized for reasoning through enhanced pretraining and reinforcement learning posttraining.
It details a three-stage data mixture strategy and architectural innovations such as Multi-Token Prediction to improve reasoning performance.
Post-training with refined RL techniques, including a test difficulty-driven reward system, yields superior results on math and code benchmarks.

This paper introduces MiMo-7B, a 7B parameter LLM specifically designed and optimized for reasoning tasks throughout its pre-training and post-training phases. The authors aim to unlock the reasoning potential of smaller models, traditionally considered less capable in this area than larger counterparts.

The core approach involves enhancing reasoning capabilities at two stages:

Pre-training: Building a strong base model with inherent reasoning potential.
Post-training: Further boosting reasoning performance through reinforcement learning (RL).

Pre-training Strategies (MiMo-7B-Base)

The pre-training phase focuses on curating a high-quality, reasoning-rich dataset and incorporating architectural enhancements.

Data Construction:
- Optimized HTML extraction tools are developed to better preserve mathematical equations and code snippets from web pages, increasing the density of reasoning patterns.
- Fast global deduplication using URL and MinHash is performed efficiently.
- Multi-dimensional filtering, using small LLMs as quality taggers instead of traditional rule-based heuristics, helps retain high-quality data often misclassified by standard filters.
- Extensive synthetic reasoning data is generated by prompting advanced models on STEM content, math/code problems, and creative writing tasks. Synthetic data is found to be suitable for training for many epochs.
- A three-stage data mixture strategy is employed over 25 trillion tokens:
- Stage 1: General mixture with downsampling of low-quality content and upsampling of professional domains.
- Stage 2: Significant increase (to ~70%) in mathematics and code-related data, trained with an 8,192-token context.
- Stage 3: Incorporation of ~10% synthetic reasoning data and extension of context length to 32,768 tokens.
Model Architecture:
- Based on the standard decoder-only Transformer with features like GQA, pre-RMSNorm, SwiGLU, and RoPE.
- Multi-Token Prediction (MTP) is added as an auxiliary training objective. In pre-training, a single MTP layer is used. For inference speedup via speculative decoding, this layer is replicated post-pre-training and fine-tuned while the main model is frozen. High acceptance rates (90% for the first layer, >75% for the third) demonstrate MTP's effectiveness for faster generation, especially for long reasoning outputs.

Evaluation of MiMo-7B-Base shows its superior reasoning potential compared to other open-source models of comparable size (7B-9B) and even some larger models (32B baseline) on benchmarks like BBH, SuperGPQA, LiveCodeBench, and AIME. The Pass@k metric highlights its higher capability boundary. It also demonstrates strong long-context comprehension on the RULER benchmark within its 32K context window.

Post-training Strategies (MiMo-7B-RL)

The post-training phase leverages Reinforcement Learning to further refine the model's reasoning abilities.

Supervised Fine-Tuning (SFT): An SFT version of MiMo-7B-Base is created using a curated dataset of ~500K samples, filtered for quality, diversity, and eval benchmark overlap. This SFT model serves as a starting point for the final MiMo-7B-RL.
RL Data Curation:
- A dataset of 130K verifiable mathematics (100K) and code (30K) problems is collected.
- Rigorous cleaning involves filtering out problems that cannot be verified, removing those easily solved by advanced models (pass rate > 90% in 16 rollouts of an SFT model), and deduplicating against evaluation benchmarks.
- An online judge environment is developed for efficient, parallel execution of potentially hundreds of test cases per code problem during reward computation.
Reward Function:
- Only rule-based accuracy rewards are used (Math-Verify for math, test cases for code).
- For code problems, a Test Difficulty Driven Reward is introduced to alleviate sparse rewards on difficult tasks. Inspired by IOI scoring, test cases are grouped by difficulty based on their pass rates across multiple model rollouts.
- Strict Reward: Score granted only if all tests in a difficulty group and lower groups pass.
- Soft Reward: Total group score is distributed among its tests, and the final reward is the sum of scores for all passed tests. The soft scheme proves more effective in experiments.
RL Training Recipe:
- A modified version of Group Relative Policy Optimization (GRPO) is used.
- Enhancements from recent research are incorporated: Removal of KL loss, Dynamic Sampling (over-sampling/filtering prompts with pass rate 0 or 1), and Clip-Higher (increasing $\varepsilon_{\mathrm{high}$ in the clipping objective).
- Challenges addressed during training:
- Sparse Rewards for Code: Handled by the Test Difficulty Driven Reward (detailed above).
- Diminishing Sampling Efficiency: Handled by an Easy Data Filter and Re-Sampling strategy. Problems perfectly solved (pass rate 1) are stored in an easy data pool. During rollouts, there's a probability ( $\alpha=10\%$ ) of sampling from this pool to stabilize policy updates and maintain efficiency, especially in later training stages.
RL Infrastructures:
- Seamless Rollout Engine: Developed to accelerate RL training and validation by minimizing GPU idle time. Components include:
- Continuous Rollout: Proactively handles completed tasks and initiates new ones without synchronization barriers.
- Asynchronous Reward Computation: Uses Ray to run reward computation (especially slow for code) concurrently with rollout, using dedicated servers for code judging.
- Early Termination: Manages ongoing tasks efficiently when sufficient valid samples are collected, using a FIFO strategy to preserve data distribution and avoid abruptly cutting off long sequences. Experimental results show significant speedups (2.29 $\times$ training, 1.96 $\times$ validation) and reduced GPU idle time.
- vLLM-based Inference Engine: vLLM is used for inference within the RL system. MTP support is implemented, and robustness for the external launch mode is enhanced (e.g., better KVCache consistency).

Post-training Evaluation Results

MiMo-7B-RL is evaluated against strong models, including OpenAI o1-mini, on various reasoning and general benchmarks.

MiMo-7B-RL achieves state-of-the-art performance among comparable size models on mathematics (MATH500, AIME) and coding (LiveCodeBench v5/v6) tasks.
It outperforms OpenAI o1-mini on AIME 2025 (55.4 vs 50.7) and LiveCodeBench v5 (57.8 vs 53.8) and v6 (49.3 vs 46.8), demonstrating robust algorithm code generation.
It maintains competitive performance on general benchmarks like GPQA, SuperGPQA, and DROP.
Comparison of MiMo variants shows that RL from the base model (MiMo-7B-RL-Zero) has strong growth potential, but RL from the SFT model (MiMo-7B-RL) achieves a higher overall performance ceiling.

Discussion and Practical Insights

The authors share lessons learned from their post-training experiments:

A "light-weight" SFT focused only on format alignment is insufficient; a "heavier" SFT is necessary for better initial alignment and higher final performance ceiling with RL.
Balancing performance across domains (math and code) during RL can be challenging, especially with a base model that might "hack" rewards on certain tasks (e.g., math). High-quality math problems are crucial.
Designing effective language mixing penalties is difficult, as legitimate content (math/code) can contain words from other languages, risking unintended penalties or reward hacking.

Conclusion

MiMo-7B demonstrates that it is possible to train a 7B parameter model with exceptional reasoning capabilities by carefully optimizing both pre-training data/objectives and post-training RL strategies. The MiMo-7B-Base model establishes strong reasoning potential, and the MiMo-7B-RL model, trained with novel techniques like test difficulty driven rewards and a seamless rollout engine, achieves leading performance on math and code reasoning benchmarks, surpassing models like OpenAI o1-mini. The authors open-source the MiMo-7B series checkpoints.