Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts
Detailed Answer
Thorough responses based on abstracts and some paper content
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
73 tokens/sec
Gemini 2.5 Pro Pro
66 tokens/sec
o3 Pro
26 tokens/sec
GPT-4.1 Pro
71 tokens/sec
DeepSeek R1 via Azure Pro
23 tokens/sec
2000 character limit reached

MiniMax-M1 Model

Last updated: June 18, 2025

Introduction

The MiniMax-M1 model represents a substantial advance in efficient, large-scale, open-weight LLMing. It is designed to process extremely long input contexts ° and support complex reasoning tasks while controlling computational cost—addressing two core challenges in the deployment and practical use of contemporary LLMs °. MiniMax-M1 builds on prior work by combining a hybrid Mixture-of-Experts ° (MoE) transformer architecture with a scalable, efficient "lightning attention" mechanism and introduces a novel reinforcement learning (RL) fine-tuning approach, CISPO, to further improve training efficiency and reasoning depth. All statements in this article are sourced exclusively from the primary published research (MiniMax et al., 16 Jun 2025 ° ).

Background and Motivation

MiniMax-M1 is described as the first open-weight, large-scale hybrid-attention ° reasoning model °. Its two main goals are:

  • Efficient scaling of compute for both training and inference.
  • Natively supporting very long input contexts (up to 1 million tokens).

As of its release, no other open-weight LLM ° supports a 1M-token context window. MiniMax-M1 is a direct successor to MiniMax-Text-01 ° and marks a major increase in model and context scale: it uses a total of 456 billion parameters, with only 45.9 billion parameters activated per token at inference, thanks to its MoE structure. The model supports direct deployment in two configurations—with 40K and 80K maximum output ("thinking budget") tokens—enabling both research and production use at varying compute budgets ° (Section 2, (MiniMax et al., 16 Jun 2025 ° )).

Model Architecture: Hybrid Mixture-of-Experts and Lightning Attention

Hybrid Mixture-of-Experts (MoE) Architecture

MiniMax-M1 employs a transformer-based MoE architecture ° with 32 experts, where each input token is routed through a learned subset of experts. This results in substantial parameter savings: while the total model parameter count is 456B, only 45.9B parameters are involved per token. This approach makes it feasible to scale models to extreme sizes without a proportional increase in inference or training cost ((MiniMax et al., 16 Jun 2025 ° ), §3).

Lightning Attention Mechanism

Traditional transformer attention ° scales with the square of the sequence length ° (O(L2)\mathcal{O}(L^2)). MiniMax-M1 implements "lightning attention", a linear complexity attention mechanism (O(L)\mathcal{O}(L)), in most layers. Every eighth block is a standard transformer layer ° with softmax attention °, preserving model capacity and expressivity. Lightning attention ° is implemented via the formula:

Attention(Q,K,V)=ϕ(Q)(ϕ(K)TV)\text{Attention}(Q, K, V) = \phi(Q)\left(\phi(K)^T V\right)

where ϕ\phi is a learned feature map ((MiniMax et al., 16 Jun 2025 ° ), §3). This hybrid-block sequence, inspired by prior research (Qin et al., 2024), allows M1 to process up to 1 million context tokens efficiently.

Efficiency and Parameter Activation

By leveraging the MoE strategy and the lightning attention architecture, MiniMax-M1 dramatically reduces active FLOPs ° compared to similarly large dense transformers. For instance, at 100K generated tokens, its inference compute ° requirements are only about 25% of those for DeepSeek ° R1 ((MiniMax et al., 16 Jun 2025 ° ), Table 3).

RL Fine-Tuning: CISPO Algorithm

MiniMax-M1 introduces CISPO (Clipped IS-weight Policy Optimization) as a novel RL fine-tuning ° approach that directly addresses inefficiencies found in previous RL methods. Prior techniques such as PPO ° and GRPO ° perform token-level gradient clipping °, which may nullify useful learning signals for tokens that are rare ° but critical for complex reasoning (e.g., "However", "Recheck"). CISPO, in contrast, clips the importance sampling (IS) weights rather than the token updates, ensuring that crucial low-probability "fork" tokens retain gradient information needed for multi-step reasoning:

r^i,t(θ)=clip(ri,t(θ),1ϵlowIS,1+ϵhighIS)\hat{r}_{i,t}(\theta) = \text{clip}\big(r_{i,t}(\theta), 1-\epsilon^{IS}_{low}, 1+\epsilon^{IS}_{high}\big)

JCISPO(θ)=E[1i=1Goii=1Gt=1oisg(r^i,t(θ))A^i,tlogπθ(oi,tq,oi,<t)]\mathcal{J}_{\text{CISPO}}(\theta) = \mathbb{E}\Bigg[ \frac{1}{\sum_{i=1}^G |o_i|} \sum_{i=1}^G \sum_{t=1}^{|o_i|} sg(\hat{r}_{i,t}(\theta))\hat{A}_{i,t}\log \pi_\theta(o_{i,t} \mid q, o_{i,<t}) \Bigg]

((MiniMax et al., 16 Jun 2025 ° ), §3.2, eqs. (3) and (4)). Experiments with Qwen ° 2.5-32B trials show CISPO achieves a twofold speedup compared to DAPO, and reaches equivalent performance using half the training steps ° (see Figure 3, (MiniMax et al., 16 Jun 2025 ° )).

Practical Scalability and Training Cost

A major feature of MiniMax-M1 is the efficient scaling of its training. Full RL training (including 1M context support) was completed in three weeks using 512 NVIDIA H800 GPUs, at a total rental cost of approximately \$534,700 ((MiniMax et al., 16 Jun 2025 ° ), §5). This efficiency is particularly notable given the model's scale and context length.

Empirical Results and Benchmark Performance

Comprehensive evaluation places MiniMax-M1 among the leading open-weight LLMs ° across standard benchmarks:

Task DeepSeek-R1-0528 Qwen3-235B-A22B MiniMax-M1-40k MiniMax-M1-80k
AIME ° 2024 91.4 85.7 83.3 86.0
LiveCodeBench ° (coding) 73.1 65.9 62.3 65.0
SWE-bench Verified ° (software eng.) 57.6 34.4 55.6 56.0
TAU-bench (tool/agent use) 53.5 34.7 60.0 62.0
OpenAI-MRCR (128k context) 51.5 27.7 76.1 73.4

((MiniMax et al., 16 Jun 2025 ° ), Table 2). On long-context benchmarks ° such as OpenAI-MRCR and LongBench-v2, MiniMax-M1 is the only open-weight model ° to natively process inputs of 1M tokens, outperforming all open competitors (including OpenAI o3) and ranking just behind Gemini 2.5 Pro. Across software engineering and agentic tool use ° tasks, it demonstrates robust and often superior performance.

Scaling analysis ° further shows that the 80k "thinking budget" model consistently outperforms the 40k variant in complex and long-context tasks ° (Section 5, (MiniMax et al., 16 Jun 2025 ° )).

Core Applications

MiniMax-M1 is evaluated and released with direct applicability to complex real-world tasks:

  • Software Engineering: Trained and assessed in sandboxed RL environments using real GitHub repositories ° (SWE-bench °), focusing on bug localization, code repair, and test generation ° with execution-based rewards ((MiniMax et al., 16 Jun 2025 ° ), §4).
  • Long-Document and Codebase ° Analysis: The 1M-token context supports the ingestion of entire code repositories, extensive legal documents, or research articles in a single sequence.
  • Agent Tool Use and Multi-turn Dialogue: Its architecture supports large "working memory" for agentic workflows where context retention and multi-step reasoning are critical.
  • General Reasoning: Demonstrates competitive results on tasks requiring math, logic, and open-ended generation °.

Deployment is supported via model release on GitHub and HuggingFace, with compatibility for both vLLM ° and HuggingFace Transformers ° frameworks. A commercial API endpoint is also available ((MiniMax et al., 16 Jun 2025 ° ), §6).

Discussion and Future Trajectory

Architectural Innovations

The hybrid attention mechanism, in which linear-complexity attention and periodic softmax attention are combined, addresses a principal limitation in scaling context length efficiently. This approach is likely to influence future model designs, offering a balanced trade-off between efficiency and modeling power ((MiniMax et al., 16 Jun 2025 ° ), §3, Table 3).

RL Fine-Tuning Advances

By shifting from token-level gradient clipping to IS-weight clipping, CISPO improves stability and enables the learning of complex, multi-step reasoning patterns necessary for sophisticated agentic tasks ° ((MiniMax et al., 16 Jun 2025 ° ), §3.2).

Context Length and Scaling

Releasing models at different maximum output budgets (40k and 80k) provides a practical mechanism for scaling reasoning depth, making MiniMax-M1 suitable for both research isolation and production at scale ((MiniMax et al., 16 Jun 2025 ° ), §2, §5).

Open-weight LLM Benchmarking

With its public release, large scale, documented performance, and efficient training methodology, MiniMax-M1 sets a reference standard for future open-weight LLM development.


Speculative Note

While MiniMax-M1 delivers state-of-the-art performance for software engineering, long-context, and agentic tool use tasks, the source does not include independent evaluation of its instruction-following abilities, safety, or reliability in general-purpose dialogue outside these contexts. Claims about performance in multimodal or retrieval-augmented settings are not present in the release data and require further paper ([citation needed]).


Conclusion

MiniMax-M1 integrates a hybrid Mixture-of-Experts transformer °, efficient lightning attention for extreme context sizes, and a novel RL training algorithm ° (CISPO). Its design enables efficient scaling in both compute and context, achieving strong benchmark results ° in long-context, agentic, and software engineering domains (MiniMax et al., 16 Jun 2025 ° ). The model is available for public download in 40k and 80k thinking budget versions, supporting community research and deployment at scale.