The MiniMax-M2 Series: Mini Activations Unleashing Max Real-World Intelligence

Published 26 May 2026 in cs.AI, cs.CL, and cs.LG | (2605.26494v1)

Abstract: We introduce the MiniMax-M2 series, a family of Mixture-of-Experts LLMs built around the principle that mini activations can unleash maximum real-world intelligence. The flagship M2 contains 229.9B total parameters with only 9.8B activated per token. Designed end-to-end for agentic deployment, the M2 series rests on three components: (i) agent-driven data pipelines producing large-scale, verifiable trajectories across agentic coding and agentic cowork, each grounded in an executable workspace and an artifact-aligned reward; (ii) Forge, a scalable agent-native RL system that adapts to long-horizon agent trajectories, paired with windowed-FIFO scheduling, prefix-tree merging, inference optimization, and a clean training-inference-agent decoupling that supports both white-box and black-box agents; (iii) the latest M2.7 checkpoint takes an early step toward self-evolution -- autonomously debugging training runs and modifying its own scaffold. Across M2 through M2.7, this combination translates a mini-activation footprint into frontier-tier performance on agentic coding, deep search, office-task, and reasoning benchmarks.

Abstract PDF Upgrade to Chat

Authors (206)

First 10 authors:

Summary

The paper presents a 229.9B-parameter decoder-only Transformer that activates only 9.8B parameters per token, matching larger systems in real-world benchmarks.
It employs a multi-domain, agent-driven post-training pipeline and interleaved thinking for efficient code synthesis, tool utilization, and autonomous ML engineering.
The integration of dynamic expert routing, prefix tree merging, and mixed-domain RL yields scalable, cost-efficient intelligence with robust self-evolution capabilities.

MiniMax-M2 Series: Efficient Agentic Intelligence via Mini Activations

Architectural Innovations

The MiniMax-M2 series leverages Mixture-of-Experts (MoE) architecture to reconcile high-capacity modeling with economical compute cost. The M2 backbone is a 229.9B-parameter, 62-layer decoder-only Transformer utilizing only 9.8B activated parameters per token. Each feed-forward layer contains 256 fine-grained experts, with sigmoid gating and expert-specific bias terms, enabling eight simultaneous activations per token and smooth routing dynamics. Attention employs full multi-head, GQA-style mechanism with RoPE, intentionally eschewing efficiency-centric hybrids due to observed degradation in multi-hop reasoning and long-context retrieval performance. The model sustains a 192K-token native context window, made feasible by a robust infrastructure stack and progressive long-context training.

Figure 1: Multi-Token Prediction (MTP) module architecture used in M2.

Multi-Token Prediction (MTP) is integrated throughout pre-training and RL, furnishing joint next- $K$ token prediction during both forward and speculative-decoding inference, accelerating throughput and enhancing agentic trajectory modeling.

Data Pipeline Engineering

The series is characterized by multi-domain, agent-driven post-training pipelines:

Agentic Coding: Automated SWE pipelines ingest real-world GitHub data, constructing verified Docker environments and task-specific reward functions grounded in test-case execution. The pipeline is diversified via augmentation—bug injection, code review, and SWE-Test conversion—maximizing problem and method space coverage.
Figure 2: The agentic coding data pipelines for SWE and AppDev tasks.
Application Development: Expert-in-the-loop query synthesis provides high-variety, production-aligned AppDev tasks. System prompts encode best practices, and the Agent-as-a-Verifier (AaaV) paradigm enables automated rejection sampling using sandboxed interaction, validating execution, interaction, and aesthetics.
Terminal-Gym: Stack Overflow-derived terminal tasks are synthesized with multi-stage environment and test generation, query evolution, and difficulty calibration, supplying verifiable agentic terminal scenarios for robust evaluation.
Agentic Cowork: Multi-tool, workspace-grounded agentic trajectories are distilled from diverse domains: deep-search and open-web research, office artifact creation, financial spreadsheet manipulation, and slide generation. Verification signals are tailored to artifact format, reinforced by pairwise comparison and multi-axis rubrics.
Reasoning: Deep scaling is performed across queries and responses, with OOD generalization monitored via solution diversity profiling and mixture ratio calibration. Rubric-driven QA ensures correctness and coverage at scale.

RL System and Infrastructure

MiniMax-M2 RL is executed on Forge, an agent-native system decoupling agent execution from training/inference, supporting both white-box and black-box agent architectures.

Figure 3: Overview of the Forge RL system. Three decoupled modules---Agent Side, Training/Inference Side, and the middleware abstraction layer (Gateway Server, Data Pool)---communicate through standardized interfaces, allowing the agent and the training loop to scale independently.

Critical algorithmic advancements include:

Windowed FIFO scheduling: Interpolates FIFO and greedy rollout scheduling for throughput-consistent sample distributions.
Figure 4: Windowed FIFO scheduling. The training scheduler only fetches completed trajectories within a sliding window of size $W$ over the generation queue: within the window, completion order is free (mitigating head-of-line blocking); across window boundaries, strict FIFO is enforced (preserving distributional consistency).
Prefix tree merging: Enables mathematically equivalent loss computation with up to 40x training speedup by sharing prefix computations among agentic trajectory branches.
Figure 5: Prefix tree merging. Completions that share a common prefix within a training batch are merged into a tree; the shared prefix is computed exactly once in the forward pass and the computation branches into individual response segments. The tree is deconstructed before the loss is computed independently per sample, so the procedure is mathematically equivalent to independent-sample training while eliminating redundant prefix recomputation.
Mixed-domain RL: Maintains broad knowledge and reasoning capacity alongside task-specific skill by joint optimization and progressive difficulty/context length scheduling.

Agentic Mechanisms: Interleaved Thinking and Self-Evolution

MiniMax-M2 operationalizes interleaved thinking as a trajectory protocol, alternating reasoning and tool-invocation blocks with persistent reasoning state, unlike front-loaded or stateless strategies.

Figure 6: The Plan-Act-Reflect loop of interleaved thinking in M2.

This method enables structured plan-act-reflect loops, supporting iterative self-correction and consistent planning across extended horizons.

M2.7 demonstrates early-stage self-evolution: it autonomously triages failed runs, edits agent scaffolds, and conducts multi-round self-upgrade cycles in ML engineering tasks.

Figure 7: (A) The Model Iteration System used to drive M2.7's autonomous execution. (B) The dual-loop workflow used by our RL team.

Performance and Capability Progression

Despite activating only $\sim$ 10\,B parameters per token, M2.7 achieves performance at or near parity with frontier closed-weight systems activating an order of magnitude more. Numerical highlights include:

SWE-bench Pro: 56.2
Terminal-Bench 2.0: 57.0
BrowseComp: 77.8
GDPval-AA: 50.0
AIME 2026: 94.2
GPQA-Diamond: 89.8
MLE Bench Lite: 66.6 medal rate
Figure 8: Performance of MiniMax-M2.7 versus closed-weight frontier baselines across agentic coding, agentic cowork, and reasoning knowledge benchmarks. With only $\sim$ 10\,B activated parameters, MiniMax-M2.7 remains competitive with substantially larger and more compute-intensive systems.

Capability accrual across M2, M2.5, M2.7 checkpoints tracks investments in agentic data, RL system engineering, and self-evolution, with pronounced improvements in deep-search, tool-use, and autonomous ML engineering.

Figure 9: Capability progression of the MiniMax-M2 series across eleven benchmarks. All eleven benchmarks improve across the three checkpoints, with the largest gains concentrated in deep-search, tool-use, and autonomous ML-engineering domains where data pipelines added new task families.

Case Study: Self-Evolution in Autonomous ML Engineering

In controlled MLE Bench Lite trials, M2.7 operates as an autonomous ML engineer, iterating over internal feedback to yield a 66.6\% medal rate, including multiple gold/silver medals, equaling Gemini 3.1 Pro.

Figure 10: Medal rate of M2.7 on MLE Bench Lite across iterative trials.

This empirically establishes autonomous scaffold improvement and iterative self-correction, closing the loop on agentic self-evolution in practice.

Implications and Future Directions

The MiniMax-M2 series concretely demonstrates that highly-scaled sparse MoE architectures, when combined with rigorous agentic trajectory pipelines, robust RL infrastructure, and persistent interleaved thinking, deliver frontier real-world intelligence with a fraction of the per-step compute. Theoretical implications suggest further exploration into expert specialization, dynamic gating strategies, and large-scale sparse RL. Practically, the system architecture supports deployment scenarios previously hindered by cost and latency, particularly in agentic code synthesis and multi-tool workflows. The observed self-evolution capacities presage a future where autonomous model-driven infrastructure iteration and improvement will be foundational to scaling intelligent systems. Subsequent M2.x releases are expected to expand data diversity, increase RL system complexity, and refine recursive self-improvement mechanisms, potentially advancing both sample efficiency and autonomous agent robustness.

Conclusion

The MiniMax-M2 series fuses mini activation design with agent-centered RL and self-evolution, achieving frontier-scale agentic intelligence. Its architectural, data, and system innovations validate the thesis that a sparse, dynamically-routed model, deeply integrated with agentic environments and agent-specific reward signals, can match closed-weight capacity models in real-world performance regimes. The implications for scalable, efficient agent deployment and continued autonomous improvement are substantial, and future research should systematically deepen each axis—data, RL, and self-evolution—in tandem (2605.26494).

Markdown Report Issue