- The paper introduces MiniMax-M1 that scales test-time compute by integrating Lightning Attention with a hybrid MoE architecture, significantly reducing inference FLOPs.
- It details a three-stage training process including continual pre-training, supervised fine-tuning, and a novel reinforcement learning algorithm (CISPO) that doubles RL training speed.
- The evaluation shows that MiniMax-M1, with extended context lengths up to 80K tokens, outperforms comparable models in long-context reasoning and complex task performance.
This paper introduces MiniMax-M1, an open-weight, large-scale hybrid-attention reasoning model designed for efficient scaling of test-time compute. The model builds upon the previous MiniMax-Text-01, featuring a total of 456 billion parameters with 45.9 billion activated per token through a hybrid Mixture-of-Experts (MoE) architecture with 32 experts. A key innovation is the integration of "Lightning Attention," an I/O-aware linear attention mechanism, which allows the M1 model to natively support a context length of 1 million tokens and significantly reduces computational FLOPs during inference, especially for long generation lengths (e.g., 25% of FLOPs compared to DeepSeek R1 at 100K tokens). This makes MiniMax-M1 suitable for complex tasks requiring extensive reasoning over long inputs. The architecture alternates between Transnormer blocks with lightning attention (seven blocks) and a standard transformer block with softmax attention (one block).
The development of MiniMax-M1 involved several stages:
- Continual Pre-training: The MiniMax-Text-01 model was further pre-trained on an additional 7.5T tokens. This reasoning-intensive corpus prioritized STEM, code, book, and reasoning-related data (70%), with an emphasis on natural Question-Answer pairs and semantic deduplication. Training involved a constant learning rate followed by decay, adjustments to MoE auxiliary loss, and a staged context length extension from 32K to 1M tokens to prevent gradient explosion.
- Supervised Fine-Tuning (SFT): SFT was performed to instill specific Chain-of-Thought (CoT) patterns using high-quality examples covering diverse domains like math, coding, and QA, with math and coding samples constituting about 60% of the SFT data. This provided a strong foundation for the subsequent reinforcement learning phase.
- Reinforcement Learning (RL): This was the core stage for developing M1's reasoning capabilities.
To enhance RL efficiency, the paper proposes CISPO (Clipped IS-weight Policy Optimization), a novel RL algorithm.
- Problem with PPO/GRPO: Traditional PPO/GRPO clip token updates. The authors found that "fork" tokens in reasoning (e.g., "Recheck", "Aha"), often having low probabilities and high importance sampling (IS) ratios (ri,t​), were clipped out, hindering learning.
- CISPO's Approach: Instead of clipping token updates, CISPO clips the IS weights (sg(r^i,t​(θ))) directly in the policy gradient objective, ensuring all tokens contribute to gradient computations. The objective function is:
$\mathcal{J}_{\text{CISPO}(\theta) = \mathbb{E}_{(q,a) \sim \mathcal{D}, \{o_i\}_{i=1}^G \sim \pi_{\theta_{\text{old}(\cdot|q)}} \left[ \frac{1}{\sum_{i=1}^G |o_i|} \sum_{i=1}^G \sum_{t=1}^{|o_i|} sg(\hat{r}_{i,t}(\theta))\hat{A}_{i,t}\log \pi_\theta(o_{i,t} \mid q, o_{i,<t}) \right]$,
where r^i,t​(θ)=clip(ri,t​(θ),1−ϵlowIS​,1+ϵhighIS​).
It uses group relative advantage (A^i,t​) from GRPO and token-level loss. Empirically, CISPO showed a 2x speedup compared to DAPO on Qwen2.5-32B models for AIME 2024.
Several challenges specific to scaling RL with the hybrid lightning attention architecture were addressed:
- Computational Precision Mismatch: A discrepancy between token probabilities in training-mode versus inference-mode kernels was found, preventing reward growth. This was traced to high-magnitude activations in the LM output head. The fix involved increasing the precision of the LM output head to FP32, improving correlation from ~0.9x to ~0.99x.
- Optimizer Hyperparameter Sensitivity: Due to a wide range of gradient magnitudes (1e-18 to 1e-5) and weak correlation between gradients of adjacent iterations, AdamW hyperparameters were tuned to β1​=0.9, β2​=0.95, and ϵ=1e-15.
- Early Truncation via Repetition Detection: To handle pathologically long and repetitive responses, an early truncation rule was implemented: generation stops if 3,000 consecutive tokens each have a probability above 0.99.
RL training utilized a diverse set of data and environments:
- Rule-based Verification Tasks:
- Mathematical Reasoning: ~50K high-quality, competition-level problems, filtered for difficulty and uniqueness.
- Logical Reasoning: ~53K samples across 41 tasks (e.g., cipher, Sudoku) generated using the SynLogic framework with controlled difficulty.
- Competitive Programming: 30K problems from online judges, with LLM-generated test cases where needed.
- Software Engineering: Thousands of samples derived from SWE-bench (GitHub issues/PRs), with a containerized sandbox for execution-based rewards.
- General Domain Tasks (Model-based Feedback): 25K complex samples.
- Tasks with Ground Truth: STEM/factual problems where rule-based checking is hard. A Generative Reward Model (GenRM) was used as a verifier.
- Tasks without Ground Truth: Instruction-following, creative writing. Pairwise comparison with a reference answer was used for reward.
- Addressing GenRM Length Bias: Continuous online monitoring of length bias during RL training was implemented. If the policy exploited length bias, GenRM recalibration was triggered. RL-side techniques like reward shaping and value clipping were also used.
- Curriculum Learning: RL training started with reasoning-intensive, rule-based tasks, then gradually incorporated general domain tasks.
The RL process was further scaled to extend the maximum generation length from 40K (MiniMax-M1-40k) to 80K tokens (MiniMax-M1-80k). This involved:
- Data Curation: Prioritizing challenging math/coding problems, downsampling synthetic reasoning data that destabilized long-context RL.
- Length Scaling Strategy: Staged window expansion from 40K to 80K tokens, based on perplexity convergence and output length percentiles.
- Addressing Training Instability: During scaling, "pattern collapse" (garbled text in later parts of sequences) occurred due to negative samples growing longer faster. Solutions included: (1) early stopping for repetitive patterns, (2) combined sample-level and token-level loss normalization, (3) decreasing gradient clipping threshold and ϵhighIS​.
The full RL training for MiniMax-M1 was completed in 3 weeks on 512 H800 GPUs, costing approximately $534,700.
Evaluation Results:
MiniMax-M1 models (40k and 80k) were evaluated on various benchmarks.
- The 80k model generally outperformed the 40k model, highlighting the benefit of scaling test-time compute.
- MiniMax-M1 demonstrated strong performance, comparable or superior to models like DeepSeek-R1 and Qwen3-235B, especially in software engineering (SWE-bench: 56.0% for M1-80k), tool use (TAU-bench: M1-80k surpasses Gemini 2.5 Pro on airline task), and long-context understanding (OpenAI-MRCR 1M: 56.2% for M1-80k, surpassing o3 and Claude 4 Opus).
- On AIME 2024, M1-80k achieved 86.0%.
- The paper shows a strong correlation between accuracy gains and increased response length during RL scaling.
MiniMax-M1 models are publicly released and supported by vLLM and Transformers frameworks, with deployment guides and a commercial API available. The work positions MiniMax-M1 as a foundation for next-generation LLM agents due to its efficient scaling and long-context capabilities.