Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 71 tok/s
Gemini 2.5 Pro 52 tok/s Pro
GPT-5 Medium 18 tok/s Pro
GPT-5 High 15 tok/s Pro
GPT-4o 101 tok/s Pro
Kimi K2 196 tok/s Pro
GPT OSS 120B 467 tok/s Pro
Claude Sonnet 4 37 tok/s Pro
2000 character limit reached

MiroMind-M1: An Open-Source Advancement in Mathematical Reasoning via Context-Aware Multi-Stage Policy Optimization (2507.14683v1)

Published 19 Jul 2025 in cs.CL

Abstract: LLMs have recently evolved from fluent text generation to advanced reasoning across diverse domains, giving rise to reasoning LLMs. Among these domains, mathematical reasoning serves as a representative benchmark as it requires precise multi-step logic and abstract reasoning, which can be generalized to other tasks. While closed-source RLMs such as GPT-o3 demonstrate impressive reasoning capabilities, their proprietary nature limits transparency and reproducibility. Although many open-source projects aim to close this gap, most of them lack sufficient openness by omitting critical resources such as datasets and detailed training configurations, which hinders reproducibility. To contribute toward greater transparency in RLM development, we introduce the MiroMind-M1 series, a set of fully open-source RLMs built on the Qwen-2.5 backbone that match or exceed the performance of existing open-source RLMs. Specifically, our models are trained in two stages: SFT on a carefully curated corpus of 719K math-reasoning problems with verified CoT trajectories, followed by RLVR on 62K challenging and verifiable problems. To enhance the robustness and efficiency of the RLVR process, we introduce Context-Aware Multi-Stage Policy Optimization, an algorithm that integrates length-progressive training with an adaptive repetition penalty to encourage context-aware RL training. Our model achieves state-of-the-art or competitive performance and superior token efficiency among Qwen-2.5-based open-source 7B and 32B models on the AIME24, AIME25, and MATH benchmarks. To facilitate reproducibility, we release the complete stack: models (MiroMind-M1-SFT-7B, MiroMind-M1-RL-7B, MiroMind-M1-RL-32B); datasets (MiroMind-M1-SFT-719K, MiroMind-M1-RL-62K); and all training and evaluation configurations. We hope these resources will support further research and foster community advancement.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces a novel CAMPO algorithm that leverages a two-stage training approach combining supervised fine-tuning and reinforcement learning with verifiable rewards for enhanced mathematical reasoning.
  • It demonstrates competitive performance on benchmarks like AIME24, AIME25, and MATH500 while optimizing token efficiency through dynamic repetition penalties.
  • The research emphasizes transparent data curation and reproducibility, establishing MiroMind-M1 as a robust open-source resource for advanced reasoning language models.

MiroMind-M1: Advancements in Mathematical Reasoning via Context-Aware Multi-Stage Policy Optimization

Introduction

This paper introduces MiroMind-M1, a suite of open-source reasoning LLMs (RLMs) designed to enhance mathematical reasoning capabilities through a transparent and reproducible research paradigm. Built on the Qwen-2.5 backbone, MiroMind-M1 employs a two-stage training protocol: Supervised Fine-Tuning (SFT) using a carefully curated dataset and Reinforcement Learning with Verifiable Rewards (RLVR). Notably, the paper presents a novel algorithm, Context-Aware Multi-Stage Policy Optimization (CAMPO), aimed at optimizing training efficiency and model performance, particularly in generating token-efficient reasoning paths.

Training Methodology

Supervised Fine-Tuning

The SFT phase focuses on leveraging a high-quality dataset comprising 719K mathematical reasoning problems with verified chain-of-thought (CoT) trajectories. This dataset is constructed through rigorous data curation involving de-duplication, decontamination, and difficulty-based pruning. Figure 1

Figure 1

Figure 1: Initial composition distribution. Big-Math comprises HARP and reformulated machine outputs.

Significant emphasis is placed on ensuring the depth and semantic richness of CoT traces, which are critical in enhancing model performance. Training configurations include a peak learning rate of 5.0×10−55.0 \times 10^{-5}, a batch size of 128, and an increased max_position_embeddings limit of 32,768, to accommodate long reasoning sequences.

Reinforcement Learning with CAMPO

The CAMPO algorithm introduces a multi-stage training strategy. Initially, the model is constrained to a shorter maximum response length, progressively increasing during training to accommodate more complex reasoning as it develops. Figure 2

Figure 2: Inclusion-Exclusion Criteria. Overview of the filtering strategy used to construct the final training dataset.

Additionally, a dynamic repetition penalty is implemented to limit redundancy and ensure training stability. This penalty is calibrated by a repetition critic that assesses the occurrence of repeated motifs in the model's output sequence, optimizing for token efficiency without compromising reasoning quality.

Experimental Results

Performance Metrics

MiroMind-M1 demonstrates state-of-the-art performance against peers, utilizing benchmarks such as AIME24, AIME25, and MATH500 to validate its capability. On the AIME24 test set, the model achieves an accuracy of 77.5, compared to the 77.1 achieved by the Skywork-OR1-32B-Preview model, highlighting competitive advancement in model design. Figure 3

Figure 3

Figure 3: Average token count of model responses conditioned on correct answers.

Moreover, MiroMind-M1 models maintain superior token efficiency, evident from significant performance in environments with constrained token outputs. This efficiency underscores the utility of the CAMPO method in promoting coherent, concise CoT paths.

Comparison with Prior Work

The table below summarizes benchmark performance for 7B and 32B models:

Model AIME24 AIME25 MATH500
DeepSeek-R1 79.8 70.0 --
MiroMind-M1-RL-32B 77.5 65.6 96.4
MiroMind-M1-RL-7B 73.4 57.8 96.7

MiroMind-M1's performance is competitive with its contemporary, Skywork-OR1-32B-Preview, despite leaning heavily on math-specific training data and without employing additional symbolic reasoning augmentations such as coded examples.

Conclusion

The MiroMind-M1 project provides a robust framework for developing advanced RLMs with a focus on transparency and reproducibility. Through meticulous data curation and the implementation of CAMPO, MiroMind-M1 stands as a resource-efficient and effective mathematical reasoning model. Future research can further enhance this model by integrating broader domain data and optimizing the computational strategies involved in RL training. The project hopes to serve as a foundation for continued exploration and enhancement in reasoning LLMs, particularly in open-source domains.

Youtube Logo Streamline Icon: https://streamlinehq.com