MiniMax-01 Series: Scalable Multi-Modal Models
- MiniMax-01 Series is a line of large-scale open-weight models leveraging lightning attention and MoE architectures for efficient multi-million token processing.
- It integrates specialized modules for text, vision-language, and high-reasoning tasks to deliver state-of-the-art performance on diverse benchmarks.
- The series employs hybrid transformer backbones, staged training, and parallelism strategies to optimize computational efficiency and scaling in context processing.
The MiniMax-01 Series refers to a line of large-scale foundation models and associated methodologies, characterized primarily by the integration of efficient "lightning" linear attention mechanisms with Mixture-of-Experts (MoE) architectures to enable high-capacity, long-context neural computation. The series incorporates models for text (MiniMax-Text-01), vision-language (MiniMax-VL-01), and large-scale reasoning (MiniMax-M1), and includes algorithmic and systems-level innovations for scaling context windows to millions of tokens at low computational and latency costs. These models are publicly released as open-weight solutions and are designed for parity with or superiority over contemporary closed-source and open-source models on standard reasoning and multimodal benchmarks (MiniMax et al., 14 Jan 2025, MiniMax et al., 16 Jun 2025).
1. Architecture and Design Principles
At the core of the MiniMax-01 Series is a hybrid transformer backbone combining:
- Lightning Attention: An implementation of linear attention, based on the TransNormer variant with SiLU activations and gating, which computes attention as
with denoting element-wise SiLU, as queries/keys/values, as gate, and as sigmoid. This operator enables strict linear complexity in sequence length , supporting context windows up to 1 million tokens in training and extrapolation up to 4 million tokens at inference (MiniMax et al., 14 Jan 2025).
- Mixture-of-Experts (MoE): Each transformer block replaces its standard multilayer perceptron with a top-2 gating MoE, deploying experts, each 14B parameters, with per-token routing via softmax-projected scores:
yielding a total model size of 456B parameters, 45.9B active per token.
- Parallel and Efficient Training: The series exploits a multi-axis parallelism scheme: pipeline (PP), data (DP), tensor (TP), and, for MoE layers, specialist axes for expert and expert-data parallelism. Flow control, computation–communication overlap, and all-gather routines are optimized for long-context throughput.
- Context Window Expansion: Rotary position embeddings (RoPE) are used with extended base frequencies to support multi-million token contexts. A staged pretraining approach progressively increases window length (128K → 512K → 1M tokens), ensuring gradient stability and strong performance on long-context benchmarks (MiniMax et al., 14 Jan 2025).
2. Core Model Instances and Modalities
Text-Modality (MiniMax-Text-01):
Supports autoregressive language modeling for text-only data. Achieves context windows of 1M tokens during training and 4M at inference. Benchmarks show parity with GPT-4o and Claude-3.5-Sonnet at 0–32K context, and leading performance (e.g., 0.910 RULER@1M tokens) on long-context tasks.
Vision-Language Extension (MiniMax-VL-01):
Extends text models with a ViT-L/14 (303M) vision encoder and a 2-layer projection adapter, supporting image patch and thumbnail embeddings. The LLM decoder processes joint vision-text sequences. Four-stage multimodal training covers modality alignment, vision understanding, user experience tuning, and direct preference optimization via DPO. The VL variant attains SOTA or near-SOTA scores on MMMU, ChartQA, DocVQA, OCRBench, and cross-modal benchmarks (MiniMax et al., 14 Jan 2025).
High-Reasoning Variant (MiniMax-M1):
Introduces a hybrid attention stack (7x lightning + 1x softmax), native 1M context, and efficient off-policy RL optimization (see Section 4) (MiniMax et al., 16 Jun 2025).
3. Attention Scaling: Lightning Attention
Lightning attention, as instantiated in MiniMax-01, addresses the quadratic complexity bottleneck of softmax-based attention by employing feature mapping (e.g., SiLU) and sequential prefix-sum accumulation, computable via tiled SRAM kernels and I/O-aware techniques to maximize Model-Flops Utilization (>75% on Nvidia H20). The approach achieves:
- Time complexity: ,
- Memory: 0,
- Prefill latency for 1M tokens: 1 GPU-seconds (vs. 10–202 in quadratic attention models),
- Up to 323 longer context window at similar or better inference cost relative to dense-transformers (MiniMax et al., 14 Jan 2025).
Periodic softmax attention blocks (after every 8 lightning blocks) preserve global retrieval abilities, while ring-attention and LASP+ enable scalable variable-length sequence support with minimal communication overhead.
4. Algorithmic Innovation: RL with CISPO
MiniMax-M1 introduces CISPO (Clipped Importance-Sampling Policy Optimization), a reinforcement learning scheme designed to enhance efficiency in long-horizon, chain-of-thought, and compositional reasoning:
- Objective: Clipping is applied to the IS weight 4, rather than to 5 or token gradients as in PPO/GRPO/DAPO, ensuring that all tokens, including rare decision forks, are utilized.
- No KL or value function regularization is present.
- Empirically, CISPO achieves 6 faster RL convergence than DAPO on zero-RL math benchmarks and enables the RL training phase to complete in 3 weeks (512 H800 GPUs, 7534,700$ rental) compared with 6 weeks projected for dense alternatives.</li> </ul> <p>This framework underpins the "thinking-budget" partitioning (40K/80K) in model variants, optimizing reasoning depth for complex tasks (<a href="/papers/2506.13585" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">MiniMax et al., 16 Jun 2025</a>).</p> <h2 class='paper-heading' id='benchmarks-and-comparative-performance'>5. Benchmarks and Comparative Performance</h2> <p>The MiniMax-01 Series models demonstrate leading or parity performance with state-of-the-art models across tasks:</p> <ul> <li><strong>Short-context:</strong> MiniMax-Text-01 matches/exceeds GPT-4o and Claude-3.5-Sonnet on <a href="https://www.emergentmind.com/topics/massive-multi-task-language-understanding-mmlu" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">MMLU</a> (88.5), ComplexQA, and <a href="https://www.emergentmind.com/topics/code-reasoning" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">code reasoning</a> (HumanEval, MBPP).</li> <li><strong>Long-context:</strong> MiniMax-Text-01 leads RULER (0.910@1M tokens), <a href="https://www.emergentmind.com/topics/longbench-v2" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">LongBench-v2</a> (56.5 overall), and MR-NIAH ($\phi(\cdot)$895% recall @1M).
- Vision-Language: MiniMax-VL-01 attains 68.5 MMMU, 91.7 ChartQA, and 96.4 DocVQA.
- Software engineering and tool use: MiniMax-M1-80K matches DeepSeek-R1 on SWE-bench (56.0%), leads on TAU-bench (62.0%), and outperforms Qwen3-235B by wide margins.
- Scalability: Lightning attention provides $\phi(\cdot)$9 inference FLOPs reduction at $Q/K/V$0K vs. DeepSeek-R1.
The table below summarizes selected context and output capacities:
| Model | Max Input | Max Output |
|---|---|---|
| DeepSeek-R1 (0528) | 128K | 64K |
| Qwen3-235B | 128K | 32K |
| MiniMax-M1-80K | 1M | 80K |
The models are publicly released at https://github.com/MiniMax-AI and evaluated against both standard (MMLU, LongBench) and in-house user-centric benchmarks (MiniMax et al., 14 Jan 2025, MiniMax et al., 16 Jun 2025).
6. Training Regime, Data Pipeline, and Cost
Pretraining uses a staged schedule with deliberate data curation and reward-based cleaning, critical batch doubling, and context ramps. Tokenization is byte-level BPE (vocab 200K, upsampled for multilinguality). Estimated resource usage includes 46 days on 2000 H800 GPUs for 1M window context, with efficient inference (1M tokens on 8×H800, 8-bit quantization) (MiniMax et al., 14 Jan 2025).
For MiniMax-M1, RL training is conducted entirely on 512 H800 GPUs, with staged thinking-budget increments. The cost structure is explicitly documented ($534,700 for 3 weeks), leveraging the architectural and RL efficiency introduced by lightning attention and CISPO.
7. Significance and Prospects
The MiniMax-01 Series establishes a new standard for open-weight, high-capacity neural architectures, demonstrating that:
- Lightning (linear) attention can be instantiated at scale to handle multi-million token contexts without compromising on core benchmarks.
- MoE integration at this scale is tractable when paired with careful token routing, global load balancing, and expert-parallel communication.
- Hybrid block designs, staged data/positional expansion, and bespoke RL algorithms (CISPO) yield resource-efficient, robust reasoning agents.
- Empirically, these models extend the frontier in long-context reasoning, vision-language integration, and agentic tool-use, while remaining cost-competitive and publicly accessible (MiniMax et al., 14 Jan 2025, MiniMax et al., 16 Jun 2025).
The architectural, algorithmic, and scaling principles found in the MiniMax-01 Series are likely to serve as foundational templates for future multi-modal, long-context, and highly efficient foundation models.