MiniMax-M1 Model

Updated 25 June 2025

MiniMax-M1 is a large-scale, open-weight hybrid-attention reasoning model that advances the state of the art in efficient long-context neural reasoning. It integrates a hybrid Mixture-of-Experts (MoE) architecture with a lightning attention mechanism, enabling both parameter scalability and linear computational complexity with respect to context length. The model is trained end-to-end using reinforcement learning (RL) with a novel algorithm, CISPO, which further improves RL sample and convergence efficiency.

1. Model Architecture: Hybrid MoE and Lightning Attention

MiniMax-M1 employs a hybrid MoE Transformer backbone with 32 experts, yielding a total parameter count of 456 billion, of which 45.9 billion are activated per token. Expert selection for each token is performed via top-k MoE gating, with token routing determined by learned gating networks. This structure enables the model to scale parameter size independently of per-token computational cost, thus achieving both specialization and computational tractability.

The attention mechanism uses a hybrid stacking of self-attention blocks: one standard softmax attention block is interleaved for every seven lightning attention ("transnormer") blocks. Lightning attention operates in linear time with respect to sequence length, achieved via a kernel-based formulation

$\text{Att}(Q,K,V) = \phi(Q)[\phi(K)^\top V]$

where $\phi(\cdot)$ is a feature map (typically an efficient softmax approximation), and the sum is computed incrementally along the sequence for efficient hardware utilization.

By blending softmax and linear (lightning) attention, MiniMax-M1 achieves both retention of modeling capacity and drastic reduction in attention compute for long sequences; this hybrid design underpins the model’s ability to scale effectively to million-token contexts.

2. Scalability: Parameters, Context Length, and Compute

MiniMax-M1 is engineered for extreme scaling:

Parameters: Total of 456B, with 45.9B MoE-activated per token.
Native context window: Up to 1 million tokens, an 8× increase over DeepSeek-R1 and far exceeding the capacities of models such as Qwen3-235B.
Inference efficiency: Due to lightning attention’s linear complexity, MiniMax-M1 attains sub-linear scaling of FLOPs per token with context. For instance, at 64K context, it uses <50% of DeepSeek-R1’s compute; at 100K, this falls to ~25% (measured on standard hardware).

These properties are particularly advantageous for tasks requiring extensive long-term memory, deep reasoning, or multi-step computation, such as codebase analysis, document review, and multi-turn planning.

3. RL Training Pipeline and the CISPO Algorithm

MiniMax-M1 is trained through a multi-stage RL pipeline:

Continual pretraining: 7.5T tokens are used in continual pretraining to enhance reasoning and long-context performance.
Supervised finetuning: The model is exposed to abundant chain-of-thought examples.
Large-scale RL: Training on problem domains including mathematics, logic, software engineering, and real-world tool interaction.

Central to RL efficiency is the CISPO (Clipped Importance Sampling Policy Optimization) algorithm. Unlike PPO and its gradient-clipping RL variants, which may hamper learning in models that must explore rare but critical action paths (e.g., “fork” tokens in code or long reasoning chains), CISPO clips the importance sampling weights directly: $\hat{r}_{i,t}(\theta) = \text{clip}(r_{i,t}(\theta), 1-\epsilon^{\text{IS}}_{\text{low}}, 1+\epsilon^{\text{IS}}_{\text{high}})$ The RL objective becomes: $\mathcal{J}_{\text{CISPO}}(\theta) = \mathbb{E}\left[ \frac{1}{\sum_{i=1}^G |o_i|} \sum_{i=1}^G \sum_{t=1}^{|o_i|} sg(\hat{r}_{i,t}(\theta))\, \hat{A}_{i,t} \log\pi_\theta(o_{i,t}|q, o_{i,<t}) \right]$ where $sg(\cdot)$ denotes stop-gradient and $\hat{A}_{i,t}$ is the (relative) advantage estimator. This approach preserves all token-level gradients (avoiding the vanishing gradient issue in rare branches), stabilizes learning, and accelerates convergence—demonstrated empirically by 2× speedup over DAPO in controlled Qwen2.5-32B experiments.

The model's full RL training was completed within three weeks on 512 H800 GPUs, with a total rental cost of approximately $534,700.</p> <h2 class='paper-heading'>4. Model Versions and Thinking Budgets</h2> <p>MiniMax-M1 is released in two configurations:</p> <div class='overflow-x-auto max-w-full my-4'><table class='table border-collapse w-full' style='table-layout: fixed'><thead><tr> <th>Model Variant</th> <th>Output Token Budget</th> <th>Training Relationship</th> </tr> </thead><tbody><tr> <td>MiniMax-M1-40K</td> <td>40,000</td> <td>Intermediate checkpoint</td> </tr> <tr> <td>MiniMax-M1-80K</td> <td>80,000</td> <td>Trained from 40K, extended</td> </tr> </tbody></table></div> <p>“Thinking budget” denotes the maximum model-generated output tokens, which bounds the length of <a href="#" x-data="{ link: 'https://www.emergentmind.com/topics/agentic-reasoning' }" @click.prevent="window.location.href = link" class="assistant-link pseudo">agentic reasoning</a> or problem decomposition that the model can accomplish in a single session. The 80K version demonstrates superior performance on long-form and high-complexity problems relative to the 40K version, as evidenced in scaling experiments.</p> <h2 class='paper-heading'>5. Benchmark Results</h2> <p>MiniMax-M1 achieves state-of-the-art or highly competitive results on a broad suite of benchmarks:</p> <ul> <li><strong>Mathematical reasoning:</strong> 86.0% accuracy on AIME 2024 (second among open-weight models, close to DeepSeek-R1-0528 at 91.4%).</li> <li><strong>Software engineering (SWE-bench Verified):</strong> 56.0% (slightly below DeepSeek-R1-0528 at 57.6%, significantly ahead of Qwen3-235B and others).</li> <li><strong>Long-context tasks:</strong> 73.4% on OpenAI-MRCR(128k), 61.5% on LongBench-v2—both leading among public and open-weight models (and surpassing several proprietary offerings).</li> <li><strong>Tool/agent benchmarks:</strong> 62.0% on TAU-bench airline (higher than Gemini 2.5 Pro and other open-weight competitors).</li> <li><strong>Factuality and chain-of-thought:</strong> Performance is competitive with the best open-source models and within close range of top proprietary systems.</li> </ul> <p>The model exhibits notable specialization in complex software engineering tasks, tool integration, agentic workflows, and real-world code understanding, facilitated by its context and reasoning capacities.</p> <h2 class='paper-heading'>6. Practical Applications</h2> <p>MiniMax-M1 is architecturally positioned for applications requiring large-context reasoning, extended agent trajectories, and integration with real-world tool environments, such as:</p> <ul> <li>Complex software engineering (bug repair, code comprehension, codebase navigation)</li> <li>Agentic tool-use tasks (planners, scientific assistants, software agents)</li> <li>Deep chain-of-thought question answering and tutoring</li> <li>Multi-turn workflow automation over ultra-long contexts</li> <li>Analysis and generation in long documents (legal, financial, research)</li> <li>Autonomous research assistant systems requiring high “thinking budgets”</li> </ul> <p>These capabilities are directly supported by the model's hybrid attention layout, MoE scaling, and tailored RL fine-tuning.</p> <h2 class='paper-heading'>7. Summary Table: MiniMax-M1 Model Features</h2><div class='overflow-x-auto max-w-full my-4'><table class='table border-collapse w-full' style='table-layout: fixed'><thead><tr> <th>Aspect</th> <th>Detail</th> </tr> </thead><tbody><tr> <td>Architecture</td> <td>456B hybrid MoE, 32 experts, interleaved softmax/lightning attention</td> </tr> <tr> <td>Active Params per Token</td> <td>45.9B</td> </tr> <tr> <td>Native Context Length</td> <td>1,000,000 tokens</td> </tr> <tr> <td>Thinking Budget (output)</td> <td>40K and 80K tokens</td> </tr> <tr> <td>RL Algorithm</td> <td>CISPO (clips IS weights, not token updates)</td> </tr> <tr> <td>RL Compute</td> <td>3 weeks on 512 H800 GPUs ($534,700 rental) Top Benchmarks SWE-bench, AIME 2024, LongBench-v2, TAU-bench, MRCR Core Efficiency Feature Lightning attention (linear compute scaling) Applications Software engineering, agentic tasks, document analysis, tool use

8. Significance and Availability

MiniMax-M1 establishes a new standard for open-weight, large-scale reasoning models by enabling efficient deployment under very large context regimes. Its hybrid attention and MoE structures, along with the CISPO RL algorithm, collectively support highly efficient scaling of test-time compute for agentic reasoning, software engineering, and multi-turn long-context applications. Both model weights and documentation are publicly released at https://github.com/MiniMax-AI/MiniMax-M1, facilitating downstream research and development.

PDF Markdown Bookmark Chat (Pro)