Create a Video View Paper

MiniMax Sparse Attention: Orders-of-Magnitude Speedups for Ultra-Long Context LLMs

This lightning talk explains MiniMax Sparse Attention (MSA), a principled sparse attention mechanism that enables large language models to handle ultra-long contexts—up to millions of tokens—with minimal performance loss. We'll explore how MSA's dual-branch architecture couples a lightweight Index Branch with standard attention to achieve 28× reductions in attention FLOPs and 14× wall-clock speedups, all while matching dense-attention baselines across diverse benchmarks. The presentation covers the co-designed training procedure, efficient GPU implementation, and empirical validation at 109B-parameter scale, demonstrating a practical path to scalable long-context inference on commodity hardware.

Script

Transformer attention hits a wall at ultra-long contexts: quadratic complexity makes a million-token conversation computationally ruinous. MiniMax Sparse Attention shatters that barrier, delivering 28-fold reductions in attention operations while preserving performance across benchmarks.

The architecture is elegantly minimal: an Index Branch—just two projection matrices—scores all key-value blocks and selects the top 16 for each query group. The Main Branch then runs exact softmax attention, but only on those 2,048 tokens instead of the full million.

Training the non-differentiable selection requires a clever trick: the Index Branch learns to match the Main Branch's attention distribution through a KL alignment loss. Crucially, gradients from this auxiliary objective are blocked from flowing back into the backbone—without that firewall, the model degrades on short-context tasks.

Hardware efficiency comes from block-level regularity. By selecting at block granularity and iterating over key-value tiles instead of queries, MSA packs computations into tensor cores and achieves 14-fold prefill speedups and 7-fold decoding speedups over dense attention—measured wall-clock time, not just theory.

At 109 billion parameters, both from-scratch sparse training and dense-to-sparse conversion match full-attention performance across general reasoning, math, code, and multimodal benchmarks. The KL alignment loss is essential: models trained without it collapse on long-context retrieval despite looking fine on standard evaluations.

MiniMax Sparse Attention opens a practical path to million-token inference on commodity hardware, making ultra-long context viable for agentic systems and persistent memory without sacrificing the quality frontier. To dive deeper into this work and generate your own research videos, visit EmergentMind.com.