MiniMax-01: Scaling Foundation Models with Lightning Attention (2501.08313v1)

Published 14 Jan 2025 in cs.CL and cs.CV

Abstract: We introduce MiniMax-01 series, including MiniMax-Text-01 and MiniMax-VL-01, which are comparable to top-tier models while offering superior capabilities in processing longer contexts. The core lies in lightning attention and its efficient scaling. To maximize computational capacity, we integrate it with Mixture of Experts (MoE), creating a model with 32 experts and 456 billion total parameters, of which 45.9 billion are activated for each token. We develop an optimized parallel strategy and highly efficient computation-communication overlap techniques for MoE and lightning attention. This approach enables us to conduct efficient training and inference on models with hundreds of billions of parameters across contexts spanning millions of tokens. The context window of MiniMax-Text-01 can reach up to 1 million tokens during training and extrapolate to 4 million tokens during inference at an affordable cost. Our vision-LLM, MiniMax-VL-01 is built through continued training with 512 billion vision-language tokens. Experiments on both standard and in-house benchmarks show that our models match the performance of state-of-the-art models like GPT-4o and Claude-3.5-Sonnet while offering 20-32 times longer context window. We publicly release MiniMax-01 at https://github.com/MiniMax-AI.

Authors (90)

MiniMax (2 papers)
Aonian Li (2 papers)
Bangwei Gong (2 papers)
Bo Yang (427 papers)
Boji Shan (3 papers)
Chang Liu (864 papers)
Cheng Zhu (19 papers)
Chunhao Zhang (5 papers)
Congchao Guo (3 papers)
Da Chen (42 papers)
Dong Li (429 papers)
Enwei Jiao (2 papers)
Gengxin Li (4 papers)
Guojun Zhang (43 papers)
Haohai Sun (3 papers)
Houze Dong (1 paper)
Jiadai Zhu (3 papers)
Jiaqi Zhuang (2 papers)
Jiayuan Song (2 papers)
Jin Zhu (35 papers)

Summary

The paper introduces the MiniMax-01 series of models, including MiniMax-Text-01 and MiniMax-VL-01, emphasizing their proficiency in handling extended contexts through lightning attention and efficient scaling techniques. These models integrate a Mixture of Experts (MoE) architecture, featuring 32 experts and a total of 456 billion parameters, with 45.9 billion parameters activated per token. The paper details the custom-designed parallel strategy alongside optimized computation-communication overlap methods tailored for both MoE and lightning attention, which facilitates training and inference on models of this scale across contexts extending to millions of tokens.

The MiniMax-Text-01 model supports context windows of up to 1 million tokens during training, which the authors claim can be extrapolated to 4 million tokens during inference. The vision-LLM, MiniMax-VL-01, is constructed via continued training with 512 billion vision-language tokens. Benchmark results, including in-house evaluations, show that the MiniMax-01 series performance is on par with state-of-the-art models such as GPT-4o and Claude-3.5-Sonnet, while offering a 20-32 times longer context window.

The paper addresses the limitations of current models, whose context windows typically range from 32K to 256K tokens, which may not suffice for tasks involving professional books, large programming projects, or extensive in-context learning. It notes that prior context window expansions primarily leveraged more powerful GPUs and I/O-aware softmax attention implementations. The paper cites the quadratic computational complexity of the transformer architecture as a barrier to further context window expansion, leading to exploration of methods to reduce attention mechanism complexity, such as sparse attention, linear attention, long convolutions, state space models, and linear RNNs.

The authors' approach is to match the performance of leading commercial models while significantly extending the context window, requiring a balance of network architecture, data, and computation. After experimentation, the authors chose lightning attention, an I/O-aware implementation of a linear attention variant. The architecture employs a hybrid approach with one transformer block with softmax attention following every seven transnormer blocks with lightning attention.

To manage the computational demands of processing over 1 million tokens, the model's total parameters were determined based on what the authors claim is a practical constraint: the ability to process this length on a single machine with up to 8 GPUs and 640GB memory using 8-bit quantization. MoE was implemented to maximize parameter and computation capacity.

The paper addresses the need for redesigning training and inference frameworks to accommodate the integration of lightning attention, softmax attention, and MoE. The all-to-all communication in MoE is implemented using expert parallel (EP) and expert tensor parallel (ETP). Varlen ring attention is designed to reduce computation redundancy and Linear Attention Sequence Parallelism (LASP) is improved to utilize device parallelism. Custom CUDA kernels tailored for lightning attention inference are also implemented, achieving over 75\% Model Flops Utilization (MFU) on Nvidia H20.

The pre-training of MiniMax-Text-01 involved curating a diverse corpus using data cleaning, reward-based quality enhancement, and data mixture balancing validated through repetition-aware testing. A three-stage training procedure extends the context window to one million tokens. The alignment phase incentivizes model capabilities through tuned reward dimensions and multi-stage training. Augmenting the LLM with visual capabilities creates MiniMax-VL-01, which undergoes additional training with 512 billion vision-language tokens, utilizing a four-stage training process.

The contributions of the paper include:

A model rivaling top-tier closed-source models with support for context inputs up to 4 million tokens and strong performance in long-context evaluations.
A large-scale implementation of linear attention, with algorithm design and engineering optimizations.
A practical approach and methodology for models, datasets, evaluations, and algorithms.
Public release of weights and a cost-effective API.

The model architecture integrates linear attention and softmax attention mechanisms, using MoE with a global router for load balancing. The final MiniMax-Text-01 architecture has a transformber block with softmax attention after every 7 transnormer blocks of linear attention, totaling 80 layers. Each attention module has 64 heads with a head dimension of 128. The softmax attention layers use Group Query Attention (GQA) with a group size of 8. Rotary Position Embedding (RoPE) is applied to half of the attention head dimension, with a base frequency of 10,000. The hidden size is 6144, and each layer incorporates 32 experts with a top-2 routing strategy. The feed-forward network within each expert has a hidden dimension of 9216. In total, MiniMax-Text-01 uses 456 billion parameters, with 45.9 billion activated for each token.

The paper discusses the use of MoE to enhance scalability and efficiency compared to dense models. For an input token $\mathbf{x}_t$ , the corresponding output hidden state $\mathbf{h}_t$ is calculated as:

$\mathbf{h}_t = \sum_{i=1}^E \text{Softmax}_i\left(\text{TopK}(\mathbf{x}_t \cdot \mathbf{W}_g)\right) \cdot \text{FFN}_i(\mathbf{x}_t)$ ,

where $E$ represents the total number of experts, $\mathbf{W}_g$ is the weight of the gate, $\text{FFN}_i$ stands for the $i$ -th expert, and $\text{TopK}(\cdot)$ denotes the operation that preserves the top $k$ scores among all $E$ experts while setting the remaining scores to $-\infty$ .

The training of MoE based LLMs is categorized into token-drop and dropless; the paper adopts the token-drop strategy. To mitigate routing collapse, a global routing strategy is incorporated to the GShard auxiliary loss for better load balancing.

The auxiliary loss is defined as $L_{\text{aux} = \alpha_{\text{aux} \cdot \frac{1}{E} \sum_{i=1}^{E} f_i \cdot m_i$, where $\alpha_{\text{aux}$ represents the coefficient of the auxiliary loss, $f_i$ denotes the fraction of tokens assigned to the $i$ -th expert, and $m_i$ is the average routing probability of expert $i$ . A global token dispatching strategy is implemented across EP groups using an allgather communication step to synchronize the number of tokens awaiting processing by each expert.

Linear attention transforms quadratic computational complexity into linear complexity using the "right product kernel trick." As an example, the NormAttention mechanism can be written as:

$\mathbf{O}=\mathrm{Norm}((\mathbf{Q} \mathbf{K}^{\top})\mathbf{V})$ ,

where $\mathbf{Q}$ , $\mathbf{K}$ , and $\mathbf{V} \in R^{n\times d}$ are the query, key, and value matrices, respectively, with $n$ for sequence length and $d$ for feature dimension. This can be transformed into its linear variant:

$\mathbf{O}=\mathrm{Norm}(\mathbf{Q} (\mathbf{K}^{\top}\mathbf{V}))$ .

The linear formulation enables efficient recurrent prediction with a training complexity of $O(nd^2)$ .

Lightning attention represents an I/O-aware, optimized implementation of TransNormer. It addresses the bottleneck in computational efficiency of existing linear attention mechanisms: the slow cumsum operation in causal LLMing. It uses a tiling technique that effectively circumvents the cumsum operation. The attention calculation is divided into intra-block and inter-block computations; the left product attention calculation is employed for intra-block operations, while the right product is used for inter-block operations.

The forward pass in lightning attention is defined as:

$\mathbf O=[(\mathbf Q\mathbf K^\top)\odot \mathbf M] \mathbf V$

where $\mathbf M_{ts}=1$ if $t\ge s$ , otherwise 0. The right product operation can be computed in a recursive formula as:

$\mathbf {kv}_0=\mathbf 0, \mathbf {kv}_t=\mathbf {kv}_{t-1} + \mathbf k_t\mathbf v_t^\top, \mathbf o_t^{\top} = \mathbf q_t^{\top} \mathbf {kv_t}$ .

The matrices $\mathbf Q, \mathbf K, \mathbf V$ are partitioned into two distinct blocks along the row dimension:

$\mathbf X=\left[\begin{matrix} \mathbf X_1\ \mathbf X_2 \end{matrix}\right], \mathbf X_1 \in \mathbb R{m\times d}, \mathbf X_2 \in \mathbb R{(n - m)\times d}, \mathbf X\in {\mathbf Q, \mathbf K, \mathbf V}$ .

The intra-block $[(\mathbf Q_1 \mathbf K_1^{\top})\odot \mathbf M]\mathbf V_1$ can use the left product and the inter-block $\mathbf Q_1 \mathbf {KV}_0$ can use the right product. The intra-block can be further divided using the same strategy:

$\mathbf {kv}{m+t}=\mathbf {kv}{m}+\sum_{j=m+1}{m+t} \mathbf k_j\mathbf v_j{\top},t=1,\ldots,n-m, \ \mathbf o_{m+t}{\top}=\mathbf q_{m+t}{\top} \mathbf {kv}_{m+t},\ \mathbf O_2 = \mathbf Q_2 \mathbf {kv}_m + [(\mathbf Q_2 \mathbf K_2{\top})\odot \mathbf M]\mathbf V_2 \triangleq \mathbf Q_2 \mathbf {KV}_1 + [(\mathbf Q_2 \mathbf K_2{\top})\odot \mathbf M]\mathbf V_2$ .

To compute the second block, the equation $\mathbf{KV}_1=\mathbf {kv}_m$ is used, which can be computed by:

$\mathbf{KV}_1 = \mathbf {KV}_0+\sum_{j=1}^{m} \mathbf k_m\mathbf v_m^{\top}= \mathbf {KV}_0+\mathbf K_1^{\top}\mathbf V_1$ , where $\mathbf {KV}_0=\mathbf {kv}_0$ .

Scaling law experiments evaluated the scalability of lightning attention in comparison to softmax attention and verified performance on downstream tasks. A hybrid approach (Hybrid-lightning) substituted lightning attention with softmax attention at intervals of every eight layers to enhance retrieval performance.

The checklist of model parameters and FLOPs is shown in table 1. The relationships between loss ( $L$ ), optimal model size ( $N_{opt}$ ), and optimal dataset size ( $D_{opt}$ ) as functions of computational budget ( $C$ ) are also shown. The time complexity of lightning attention is $O(nd^2+ nBd)$ , where $B$ is the block size. The scaling law experiments, downstream performance, and speed comparisons led to the conclusion that while pure linear attention models are computationally efficient, they are not suitable for LLMs because of their inability to perform retrieval. The hybrid model matches and surpasses softmax attention in both retrieval and extrapolation tasks.

The paper provides the following explanation of softmax attention: $\mathbf{O} = \mathrm{Softmax}(\mathbf{Q} \mathbf{K}^\top / \sqrt{d}) \mathbf{V}$ .

This can be rewritten into a linear recurrent form as:

$s_t0=0, \quad s_t{j}=s_{t}{j-1}+\exp(\mathbf q_t \mathbf k_jT/\sqrt d),\quad \mathbf o_tj =(s_t{j-1}/s_tj)\mathbf o_{t}{j-1} +(1-s_t{j-1}/s_tj) \mathbf v_j, \quad \mathbf o_t=\mathbf o_tt , j=1,\ldots, t$ .

The linear recurrence form of lightning attention is as follows: $\mathbf {kv}0=0, \quad \mathbf {kv}_j=\mathbf {kv}{j-1}+ \mathbf k_j \mathbf v_j\top \quad \mathbf o_j= \mathbf {kv}_j\top \mathbf q_j , j=1,\ldots, t$ .

The capacity of softmax attention is $O(d)$ and the capacity of lightning attention is $O(d^2/h)$ .

Module ablation experiments validated module choices within the MoE architecture on a larger scale: Hybrid-lightning attention versus softmax attention, and Pre-Layer Normalization versus Post-Layer Normalization. Post Layer Normalization (PostNorm), consistently outperforms PreNorm across all evaluated metrics. For PostNorm, DeepNorm is utilized for more stable training.

To determine optimal parameter allocations, the authors formulated the following optimization problem: $\min_{P_{\text{all}, P_{\text{act} L(P_{\text{all}, P_{\text{act}, T) \quad \text{subject to} \quad C_{\text{compute}(P_{\text{all}, P_{\text{act}, T) < C \quad \text{and} \quad P_{\text{all} < 500B$, where $L$ denotes the loss, $P_{\text{all}$ and $P_{\text{act}$ represent the total and activation parameter counts respectively, $T$ is the number of training tokens, $C_{\text{compute}$ denotes the computational costs, and $C$ signifies the budget constraint. The following formula was also proposed:

$L(P_{\text{act},T|E) = d + a P_{\text{act}^{\alpha} + bT^{\beta} + c (P_{\text{act} T)^{\gamma}$, where $L(P_{\text{act},T|E)$ represents the loss conditioned on the number of experts, while $a$ , $b$ , $c$ , $d$ , $\alpha$ , $\beta$ , and $\gamma$ are parameters to be fitted in relation to the number of experts.

The paper discusses the computational optimization strategies for training and inference, noting three challenges: mitigating all-to-all (a2a) communication overhead during MoE training, distributing tokens within a large context window across GPUs, and managing real-world batched inputs with variable sequence lengths and prefix caching for lightning attention.

A token-grouping-based overlap scheme is implemented to reduce communication overhead. An ETP (Expert Tensor Parallel) process group is designed to manage weight partitioning of experts, and an EDP (Expert Data Parallel) process group is designed to encapsulate the data parallelism of identical experts. An EP-ETP overlap strategy is designed to maximize network and computational resource utilization.

For long context optimization, "data-packing" is used to concatenate different samples end-to-end along the sequence dimension. Varlen Ring Attention is implemented to apply the ring attention algorithm directly to the entire sequence after data-packing. The LASP (Linear Attention Sequence Parallelism) algorithm is improved to create LASP+ which transforms serial computation into parallelized computation to eliminate dependencies during the computation process. StridedBatchedMatmul operations are managed to ensure high performance and versatility across diverse hardware architectures.

For lightning attention inference, the paper discusses four optimization strategies: batched kernel fusion, separated prefill and decoding execution, multi-level padding, and strided batched matmul extension.

The pre-training corpus for MiniMax-Text-01 incorporates diverse sources and enhances corpus quality through data quality enhancement, data formatting optimization, and data mixture investigation. The pre-training corpus encompasses a comprehensive and meticulously curated dataset, incorporating diverse sources including academic literature, books, web content, and programming code. The model parameters are initialized using the Xavier initialization method and the AdamW optimizer is employed.

The paper uses a three-stage training procedure to systematically upsample long-context data across diverse length ranges, preserving the distributional characteristics of critical domains to keep short-context evaluation performances steady.

The post-training framework enhances the model's general performance, long-context capability, and real-world applicability. The training process uses Supervised Fine-Tuning (SFT) and Offline and Online Reinforcement Learning (RL). Model safety is ensured through data mining techniques and a harmless reward model. A multi-stage training methodology enhances the model's capacity to process extended contexts while maintaining optimal performance on shorter sequences.

The reward model framework evaluates responses across correctness, truthfulness, helpfulness, and harmlessness. Our SFT dataset construction involves a multi-stage process utilizing domain-specific expert models trained through iterative SFT and RL cycles. The paper incorporates the offline RL phase using Direct Preference Optimization (DPO) and implements online RL to improve model performance, particularly in mathematical reasoning tasks. The online RL approach uses a modified Group Relative Policy Optimization (GRPO) approach.

The safety alignment of the model is addressed throughout both the SFT and RL stages. The model employs a multi-stage training methodology to enhance its capacity for processing extended contexts.

The model demonstrates its strengths in academic benchmarks such as MMLU and MMLU-Pro, SimpleQA and C-SimpleQA, GPQA, and DROP, as well as mathematics and coding. The model demonstrates its long-context benchmarks via Long-Context Retrieval, Long-Context Understanding, and Long In-Context Learning. A case paper based on user interactions showed a leap in performance from $58\%$ to $71.5\%$ via the Hailuo AI end-to-end evaluation after implementing search tools.

The paper discusses how a vision-LLM, MiniMax-VL-01, was developed by integrating an image encoder and an image adapter into the MiniMax-Text-01 model. To pre-train the vision encoder, the authors curated a substantial image-caption dataset by aggregating and filtering data from internet sources and trained the Vision Transformer (ViT) using 694 million unique image-caption pairs. Description data serves as a robust resource for modal alignment and enhancing understanding in further training. To train MiniMax-VL-01, the authors constructed a comprehensive and diverse instruction-based dataset by synthesizing an extensive range of QA pairs involving visual inputs. A four-stage training strategy was used to progressively develop comprehensive multimodal understanding capabilities while retaining its language understanding skills.

The MiniMax-VL-01 architecture consists of a Vision Transformer (ViT) with 303 million parameters, a two-layer MLP projector initialized randomly, and the MiniMax-Text-01 model. A dynamic resolution strategy is implemented by resizing the input image according to a predefined grid configuration list, ranging from $336\times336$ to $2016\times2016$ , while maintaining a standard thumbnail at a resolution of $336\times336$ . The resized images are subsequently partitioned into non-overlapping patches, each measuring $336\times336$ .

The vision encoder is a lightweight ViT-L/14 which has been trained from scratch. The architecture is particularly effective in capturing intricate visual details and the complex interrelationships within images. Contrastive learning was used to enhance the alignment between corresponding image-caption pairs while diminishing the alignment between non-corresponding pairs.

PDF Markdown

Related Papers

Find Related Papers

GitHub

MiniMax · GitHub

Tweets

https://twitter.com/AdinaYakup/status/1879835030517874784

https://twitter.com/omarsar0/status/1879572524243329180

https://twitter.com/christophcsmith/status/1881765974925627457

https://twitter.com/theomitsa/status/1880629534971789783

https://twitter.com/rohanpaul_ai/status/1879558214704042173

https://twitter.com/TheTuringPost/status/1879696248581894340

MiniMax-01: Scaling Foundation Models with Lightning Attention (2501.08313v1)

Summary

Related Papers

GitHub

Tweets

YouTube

HackerNews

Reddit