MiroMind-M1 Series: Open-Source RLMs

Updated 22 July 2025

MiroMind-M1 Series are open-source reasoning language models engineered for complex, multi-step mathematical problem solving.
They employ a two-stage training pipeline with supervised fine-tuning on 719K problems and reinforcement learning with a verified math verifier.
Innovative techniques such as extended token context, no-packing training, and the CAMPO algorithm set new benchmarks in transparency and efficiency.

The MiroMind-M1 Series is a collection of open-source reasoning LLMs (RLMs) developed for advanced mathematical reasoning tasks. Built upon the Qwen-2.5 architecture, these models aim to match or surpass previous open-source RLMs in both accuracy and efficiency while setting a new benchmark for transparency and reproducibility in the field of mathematical reasoning (Li et al., 19 Jul 2025).

1. Model Architecture and Design

The MiroMind-M1 series utilizes the Qwen-2.5 backbone and is available in multiple parameter sizes, notably 7 billion (7B) and 32 billion (32B). The architecture is specifically tailored for handling complex, multi-step reasoning chains required for mathematical problem-solving. Key architectural features include:

Token Context and Position Handling: Maximum positional coverage is extended up to 32,768 tokens via linear RoPE (Rotary Position Embedding) scaling, allowing the model to process exceptionally long mathematical reasoning chains without context truncation.
No-Packing Training Regime: Instead of concatenating independently sampled sequences into a single training batch (as in packing or neat-packing), MiroMind-M1 trains on single independent sequences per batch. This prevents cross-sample attention, thereby improving the fidelity of chain-of-thought (CoT) learning.
Support for Supervised and Reinforcement Learning Fine-Tuning: The architecture is optimized for both supervised imitation of reasoning traces and reinforcement learning (RL) with verifiable reward signals, enabling the model to learn both coherent and token-efficient reasoning patterns.
Logically Compact Chain-of-Thought Output: Design decisions throughout the training and architecture pipeline incentivize the generation of concise, logically valid multi-step reasoning, a property further refined through reinforcement learning.

This design positions MiroMind-M1 for robust long-form reasoning and grants it the capacity to efficiently parse and generate complex mathematical argumentation on benchmarks requiring extensive logic and precision.

2. Training Methodology

The MiroMind-M1 models are subjected to a two-stage training pipeline:

Supervised Fine-Tuning (SFT):
- The SFT stage employs a curated dataset of 719,000 math reasoning problems paired with verified chain-of-thought trajectories (MiroMind-M1-SFT-719K).
- Every training instance provides not just an answer, but a meticulously validated, stepwise thought process, teaching the model both the structure and logic of advanced mathematical reasoning via imitation learning.
Reinforcement Learning with Verifiable Reward (RLVR):
- RLVR further refines reasoning on a subset of 62,000 challenging and precisely verifiable problems (MiroMind-M1-RL-62K).
- Solutions generated during RLVR are passed through a highly accurate mathematical verifier, with only correct chain-of-thought solutions receiving reward signal.
- This process encourages the exploration and reinforcement of novel, verifiably correct reasoning chains.

A plausible implication is that this two-phase approach, combining curated CoT supervision with rigorous automated solution validation, both imparts strong mathematical priors and incentivizes high-precision logical output.

3. Algorithmic Innovations: Context-Aware Multi-Stage Policy Optimization (CAMPO)

A central contribution of the MiroMind-M1 series is the Context-Aware Multi-Stage Policy Optimization (CAMPO) algorithm, integrated during RLVR:

Length-Progressive Training: CAMPO organizes the RLVR process in discrete stages, where the maximum response length permitted by the model is increased gradually. Early phases restrict output length to enforce brevity and focus, while later stages permit longer, more elaborate reasoning chains.
Adaptive Repetition Penalty: Each generated chain receives a repetition score (0–1), reflecting the point at which redundant content first appears. Higher repetition triggers an explicit penalty, discouraging unnecessarily verbose or repetitive output.
Decoupled Clipping Ratios: Gradient clipping is adapted per stage depending on current response length and solution quality. This yields adaptive, context-dependent learning signals.
Policy Optimization Objective: CAMPO modifies the standard policy objective as follows (with $r_{i,t}(\theta)$ the policy ratio and $\hat{A}_{i,t}$ the advantage estimator):

$\min\left(r_{i,t}(\theta)\hat{A}_{i,t},\, \mathrm{clip}\left(r_{i,t}(\theta), 1-\phi_\mathrm{low}(s), 1+\phi_\mathrm{high}(s)\right)\hat{A}_{i,t}\right)$

This ensures that both length constraints and reward assignment are dynamically adapted to the evolving training context.

CAMPO directly addresses the trade-off between reasoning completeness and token efficiency, guiding the model to produce outputs that are both correct and succinct.

4. Performance Benchmarks and Evaluation

On standard mathematical reasoning benchmarks, MiroMind-M1 achieves strong results:

Variant	AIME24	AIME25	MATH500
MiroMind-SFT-7B	60.4	45.0	94.6
MiroMind-M1-RL-7B	73.4	57.8	96.7
MiroMind-M1-RL-32B	77.5	65.6	96.4

In addition to these absolute scores, RL-trained variants exhibit notable token efficiency. Under various length restrictions, MiroMind-M1-RL models solve problems with substantially shorter, logically dense chains of reasoning compared to prior models.

This suggests that CAMPO and RLVR together encourage the model not just to maximize accuracy, but to do so with parsimonious, verifiable logic chains.

5. Open Source Contributions and Reproducibility

A major motivation behind MiroMind-M1 is to advance open science in RLM development by maximizing transparency and reproducibility. The following resources are publicly released:

Model Checkpoints: Pretrained and fine-tuned variants (SFT and RL) for 7B and 32B models.
Datasets: The full MiroMind-M1-SFT-719K and MiroMind-M1-RL-62K collections, including verified CoT traces.
Training and Evaluation Configurations: Source code for data curation, SFT and RLVR training routines including CAMPO, and the enhanced automatic math verifier system.
Public Access: Resources are available on platforms such as GitHub and Hugging Face, supporting independent verification and extension.

A plausible implication is that this comprehensive release is intended to facilitate community-driven research, enable direct model replication, and allow for the comparative study of reasoning training methodologies.

6. Mathematical Reasoning Capabilities

MiroMind-M1 models demonstrate proficiency at solving intricate, multi-step mathematical problems using explicitly generated chain-of-thought solutions. Features include:

Parsing Complex Problems: The models decompose tasks into logically structured intermediate steps, as reflected in benchmark evaluations which require token-level precision.
RL Objective Function: Reward and advantage-based optimization is performed using verified solution chains and policy ratio terms, e.g.,

$r_{i,t}(\theta) = \frac{\pi_\theta(o_{i,t}|q, o_{i,<t})}{\pi_{\theta_\mathrm{old}}(o_{i,t}|q, o_{i,<t})}$

Mathematical Answer Verification: The verifier accounts for mathematical equivalence, constants, formats, and numerical tolerance, ensuring correctness beyond surface-level text matching.

This verifies the suitability of MiroMind-M1 for automated theorem-proving, competition-style math problem solving, and other domains where detailed logical explanation is indispensable.

7. Significance and Context within Reasoning LLMs

MiroMind-M1 advances the domain of open-source mathematical reasoning by:

Establishing reproducible baselines for Qwen-2.5-based models.
Demonstrating that transparency at every stage—data, code, and evaluation—can coexist with state-of-the-art performance.
Serving as a foundation for further research into efficient, verified, and context-adaptive reasoning models.

A notable aspect is that the training and verification pipeline is sufficiently documented and released to facilitate meaningful ablation, benchmarking, and application by the broader community.

The comprehensive nature of both the training methodology and open source release positions MiroMind-M1 as a primary reference implementation for future developments in mathematical and logical reasoning using LLMs (Li et al., 19 Jul 2025).

PDF Markdown Chat (Pro)

References (1)

MiroMind-M1: An Open-Source Advancement in Mathematical Reasoning via Context-Aware Multi-Stage Policy Optimization (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to MiroMind-M1 Series.