MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention (2506.13585v1)

Published 16 Jun 2025 in cs.CL and cs.LG

Abstract: We introduce MiniMax-M1, the world's first open-weight, large-scale hybrid-attention reasoning model. MiniMax-M1 is powered by a hybrid Mixture-of-Experts (MoE) architecture combined with a lightning attention mechanism. The model is developed based on our previous MiniMax-Text-01 model, which contains a total of 456 billion parameters with 45.9 billion parameters activated per token. The M1 model natively supports a context length of 1 million tokens, 8x the context size of DeepSeek R1. Furthermore, the lightning attention mechanism in MiniMax-M1 enables efficient scaling of test-time compute. These properties make M1 particularly suitable for complex tasks that require processing long inputs and thinking extensively. MiniMax-M1 is trained using large-scale reinforcement learning (RL) on diverse problems including sandbox-based, real-world software engineering environments. In addition to M1's inherent efficiency advantage for RL training, we propose CISPO, a novel RL algorithm to further enhance RL efficiency. CISPO clips importance sampling weights rather than token updates, outperforming other competitive RL variants. Combining hybrid-attention and CISPO enables MiniMax-M1's full RL training on 512 H800 GPUs to complete in only three weeks, with a rental cost of just $534,700. We release two versions of MiniMax-M1 models with 40K and 80K thinking budgets respectively, where the 40K model represents an intermediate phase of the 80K training. Experiments on standard benchmarks show that our models are comparable or superior to strong open-weight models such as the original DeepSeek-R1 and Qwen3-235B, with particular strengths in complex software engineering, tool utilization, and long-context tasks. We publicly release MiniMax-M1 at https://github.com/MiniMax-AI/MiniMax-M1.

Summary

The paper introduces MiniMax-M1 that scales test-time compute by integrating Lightning Attention with a hybrid MoE architecture, significantly reducing inference FLOPs.
It details a three-stage training process including continual pre-training, supervised fine-tuning, and a novel reinforcement learning algorithm (CISPO) that doubles RL training speed.
The evaluation shows that MiniMax-M1, with extended context lengths up to 80K tokens, outperforms comparable models in long-context reasoning and complex task performance.

This paper introduces MiniMax-M1, an open-weight, large-scale hybrid-attention reasoning model designed for efficient scaling of test-time compute. The model builds upon the previous MiniMax-Text-01, featuring a total of 456 billion parameters with 45.9 billion activated per token through a hybrid Mixture-of-Experts (MoE) architecture with 32 experts. A key innovation is the integration of "Lightning Attention," an I/O-aware linear attention mechanism, which allows the M1 model to natively support a context length of 1 million tokens and significantly reduces computational FLOPs during inference, especially for long generation lengths (e.g., 25% of FLOPs compared to DeepSeek R1 at 100K tokens). This makes MiniMax-M1 suitable for complex tasks requiring extensive reasoning over long inputs. The architecture alternates between Transnormer blocks with lightning attention (seven blocks) and a standard transformer block with softmax attention (one block).

The development of MiniMax-M1 involved several stages:

Continual Pre-training: The MiniMax-Text-01 model was further pre-trained on an additional 7.5T tokens. This reasoning-intensive corpus prioritized STEM, code, book, and reasoning-related data (70%), with an emphasis on natural Question-Answer pairs and semantic deduplication. Training involved a constant learning rate followed by decay, adjustments to MoE auxiliary loss, and a staged context length extension from 32K to 1M tokens to prevent gradient explosion.
Supervised Fine-Tuning (SFT): SFT was performed to instill specific Chain-of-Thought (CoT) patterns using high-quality examples covering diverse domains like math, coding, and QA, with math and coding samples constituting about 60% of the SFT data. This provided a strong foundation for the subsequent reinforcement learning phase.
Reinforcement Learning (RL): This was the core stage for developing M1's reasoning capabilities.

To enhance RL efficiency, the paper proposes CISPO (Clipped IS-weight Policy Optimization), a novel RL algorithm.

Problem with PPO/GRPO: Traditional PPO/GRPO clip token updates. The authors found that "fork" tokens in reasoning (e.g., "Recheck", "Aha"), often having low probabilities and high importance sampling (IS) ratios ( $r_{i,t}$ ), were clipped out, hindering learning.
CISPO's Approach: Instead of clipping token updates, CISPO clips the IS weights ( $sg(\hat{r}_{i,t}(\theta))$ ) directly in the policy gradient objective, ensuring all tokens contribute to gradient computations. The objective function is: $\mathcal{J}_{\text{CISPO}(\theta) = \mathbb{E}_{(q,a) \sim \mathcal{D}, \{o_i\}_{i=1}^G \sim \pi_{\theta_{\text{old}(\cdot|q)}} \left[ \frac{1}{\sum_{i=1}^G |o_i|} \sum_{i=1}^G \sum_{t=1}^{|o_i|} sg(\hat{r}_{i,t}(\theta))\hat{A}_{i,t}\log \pi_\theta(o_{i,t} \mid q, o_{i,<t}) \right]$, where $\hat{r}_{i,t}(\theta) = \text{clip}\left(r_{i,t}(\theta), 1-\epsilon^{IS}_{low}, 1+\epsilon^{IS}_{high}\right)$ . It uses group relative advantage ( $\hat{A}_{i,t}$ ) from GRPO and token-level loss. Empirically, CISPO showed a 2x speedup compared to DAPO on Qwen2.5-32B models for AIME 2024.

Several challenges specific to scaling RL with the hybrid lightning attention architecture were addressed:

Computational Precision Mismatch: A discrepancy between token probabilities in training-mode versus inference-mode kernels was found, preventing reward growth. This was traced to high-magnitude activations in the LM output head. The fix involved increasing the precision of the LM output head to FP32, improving correlation from ~0.9x to ~0.99x.
Optimizer Hyperparameter Sensitivity: Due to a wide range of gradient magnitudes (1e-18 to 1e-5) and weak correlation between gradients of adjacent iterations, AdamW hyperparameters were tuned to $\beta_1 = 0.9$ , $\beta_2 = 0.95$ , and $\epsilon=1\text{e-}15$ .
Early Truncation via Repetition Detection: To handle pathologically long and repetitive responses, an early truncation rule was implemented: generation stops if 3,000 consecutive tokens each have a probability above 0.99.

RL training utilized a diverse set of data and environments:

Rule-based Verification Tasks:
- Mathematical Reasoning: ~50K high-quality, competition-level problems, filtered for difficulty and uniqueness.
- Logical Reasoning: ~53K samples across 41 tasks (e.g., cipher, Sudoku) generated using the SynLogic framework with controlled difficulty.
- Competitive Programming: 30K problems from online judges, with LLM-generated test cases where needed.
- Software Engineering: Thousands of samples derived from SWE-bench (GitHub issues/PRs), with a containerized sandbox for execution-based rewards.
General Domain Tasks (Model-based Feedback): 25K complex samples.
- Tasks with Ground Truth: STEM/factual problems where rule-based checking is hard. A Generative Reward Model (GenRM) was used as a verifier.
- Tasks without Ground Truth: Instruction-following, creative writing. Pairwise comparison with a reference answer was used for reward.
- Addressing GenRM Length Bias: Continuous online monitoring of length bias during RL training was implemented. If the policy exploited length bias, GenRM recalibration was triggered. RL-side techniques like reward shaping and value clipping were also used.
Curriculum Learning: RL training started with reasoning-intensive, rule-based tasks, then gradually incorporated general domain tasks.

The RL process was further scaled to extend the maximum generation length from 40K (MiniMax-M1-40k) to 80K tokens (MiniMax-M1-80k). This involved:

Data Curation: Prioritizing challenging math/coding problems, downsampling synthetic reasoning data that destabilized long-context RL.
Length Scaling Strategy: Staged window expansion from 40K to 80K tokens, based on perplexity convergence and output length percentiles.
Addressing Training Instability: During scaling, "pattern collapse" (garbled text in later parts of sequences) occurred due to negative samples growing longer faster. Solutions included: (1) early stopping for repetitive patterns, (2) combined sample-level and token-level loss normalization, (3) decreasing gradient clipping threshold and $\epsilon^{IS}_{high}$ .

The full RL training for MiniMax-M1 was completed in 3 weeks on 512 H800 GPUs, costing approximately $534,700.

Evaluation Results:

MiniMax-M1 models (40k and 80k) were evaluated on various benchmarks.

The 80k model generally outperformed the 40k model, highlighting the benefit of scaling test-time compute.
MiniMax-M1 demonstrated strong performance, comparable or superior to models like DeepSeek-R1 and Qwen3-235B, especially in software engineering (SWE-bench: 56.0% for M1-80k), tool use (TAU-bench: M1-80k surpasses Gemini 2.5 Pro on airline task), and long-context understanding (OpenAI-MRCR 1M: 56.2% for M1-80k, surpassing o3 and Claude 4 Opus).
On AIME 2024, M1-80k achieved 86.0%.
The paper shows a strong correlation between accuracy gains and increased response length during RL scaling.

MiniMax-M1 models are publicly released and supported by vLLM and Transformers frameworks, with deployment guides and a commercial API available. The work positions MiniMax-M1 as a foundation for next-generation LLM agents due to its efficient scaling and long-context capabilities.

PDF Markdown

Follow-up Questions

Related Papers

Authors (127)

First 10 authors:

GitHub

GitHub - MiniMax-AI/MiniMax-M1: MiniMax-M1, the world's first open-weight, large-scale hybrid-attention reasoning model. (233 stars)

Tweets

https://twitter.com/samsja19/status/1941215859760038011

https://twitter.com/Grad62304977/status/1937282337731150164

https://twitter.com/kyleichan/status/1935338849330414029

https://twitter.com/_akhaliq/status/1934986530621112395

https://twitter.com/TheTuringPost/status/1936197770110812428

https://twitter.com/nileshb4u/status/1936164677400092788