MoBA: Mixture of Block Attention for Long-Context LLMs (2502.13189v1)

Published 18 Feb 2025 in cs.LG, cs.AI, and cs.CL

Abstract: Scaling the effective context length is essential for advancing LLMs toward artificial general intelligence (AGI). However, the quadratic increase in computational complexity inherent in traditional attention mechanisms presents a prohibitive overhead. Existing approaches either impose strongly biased structures, such as sink or window attention which are task-specific, or radically modify the attention mechanism into linear approximations, whose performance in complex reasoning tasks remains inadequately explored. In this work, we propose a solution that adheres to the ``less structure'' principle, allowing the model to determine where to attend autonomously, rather than introducing predefined biases. We introduce Mixture of Block Attention (MoBA), an innovative approach that applies the principles of Mixture of Experts (MoE) to the attention mechanism. This novel architecture demonstrates superior performance on long-context tasks while offering a key advantage: the ability to seamlessly transition between full and sparse attention, enhancing efficiency without the risk of compromising performance. MoBA has already been deployed to support Kimi's long-context requests and demonstrates significant advancements in efficient attention computation for LLMs. Our code is available at https://github.com/MoonshotAI/MoBA.

Summary

The paper introduces Mixture of Block Attention (MoBA), a novel attention mechanism that applies Mixture of Experts principles to partition context into blocks and dynamically select relevant blocks for efficient long-sequence processing in large language models.
Evaluating on Llama 3.1 8B 1M MoBA, the mechanism achieves comparable performance to full attention on downstream tasks while demonstrating significant efficiency gains, up to 6.5x faster for prefilling 1M tokens and 16x faster attention computation at 10M tokens.
Scaling law experiments show MoBA scales comparably to full attention, and hybrid training approaches combining MoBA with full attention can further improve fine-tuning performance.

The paper introduces Mixture of Block Attention (MoBA), a novel attention mechanism for efficient processing of long sequences in LLMs. MoBA applies the Mixture of Experts (MoE) principle to the attention mechanism, partitioning the context into blocks and dynamically selecting relevant blocks for each query token. This approach contrasts with existing methods that either impose predefined structural biases or use linear approximations of the attention mechanism. The authors adhere to a "less structure" principle, allowing the model to autonomously determine where to attend.

The authors detail the MoBA architecture, beginning with a review of standard attention in Transformers, where a single query token $\in R^{1\times d}$ attends to $N$ key and value tokens, $, \in R^{N\times d}$ , respectively, and the standard attention is computed as: \begin{equation} \mathrm{Attn}(, , ) = \mathrm{Softmax}{\left( ^{\top\right)},} \end{equation} where $d$ denotes the dimension of a single attention head.

In contrast, MoBA enables each query token to attend only to a subset of keys and values: \begin{equation} \mathrm{MoBA}(, , ) = \mathrm{Softmax}{\left( {[I]}^{\top\right)}[I],} \end{equation} where $I \subseteq [N]$ is the set of selected keys and values.

The key innovation involves block partitioning and selection. The full context of length $N$ is divided into $n$ blocks, each of size $B = \frac{N}{n}$ . The $i$ -th block spans the range: \begin{equation} I_i = \left[(i-1)\times B+1, i \times B\right]. \end{equation} A top- $k$ gating mechanism, inspired by MoE, is employed to enable each query to selectively focus on tokens from different blocks: \begin{equation} I = \bigcup_{g_i > 0} I_i. \end{equation} The model uses a gating mechanism, $g_i$ , to select the most relevant blocks for each query token. The gate value for the $i$ -th block, $g_i$ , is computed by: \begin{equation} g_i = \begin{cases} 1 & s_i \in \mathrm{Topk}\left({s_j | j\in [n]}, k\right) \ 0 & \text{otherwise}\end{cases}, \end{equation} where $\mathrm{Topk}(\cdot, k)$ denotes the set containing $k$ highest scores among the affinity scores calculated for each block. The score $s_i$ is computed by the inner product between $q$ and the mean pooling of $K[I_i]$ along the sequence dimension: \begin{equation} s_i = \langle , \mathrm{mean_pool}([I_i])\rangle \end{equation}

To maintain causality in autoregressive LLMs, MoBA incorporates two specific designs: (1) no attention to future blocks, ensuring a query token cannot be routed to any future blocks, and (2) current block attention with causal masking, where each token is routed to its respective current block, and a causal mask is applied during the current block attention.

The implementation leverages optimization techniques from FlashAttention and MoE. The algorithm consists of steps to determine the assignment of query tokens to key/value (KV) blocks, arrange the ordering of query tokens based on their assigned KV blocks, compute attention outputs for each KV block and assigned query tokens using FlashAttention with varying lengths, re-arrange the attention outputs back to their original ordering, and combine the corresponding attention outputs using online Softmax.

The authors conduct scaling law experiments and ablation studies to validate key design choices. Scalability is assessed by comparing the validation loss of LLMs trained using either full attention or MoBA. Following the Chinchilla scaling law, five LLMs of varying sizes are trained. The validation loss curves for MoBA and full attention display similar scaling trends, with MoBA achieving comparable scaling performance despite its sparse attention pattern.

To assess long-context capability, the LM loss of trailing tokens is evaluated. With a sequence length increased from 8k to 32k, MoBA exhibits a marginally higher last block LM loss compared to full attention, but the loss gap progressively narrows.

Ablation studies on block granularity reveal that MoBA's performance is significantly affected by block granularity, with finer-grained segmentation enhancing performance.

The flexibility of MoBA as a substitute for full attention is explored through hybrid training approaches. A two-stage recipe, involving MoBA training followed by full attention training, achieves a loss nearly identical to that of full attention. Layer-wise hybrid strategies, switching the last several Transformer layers from MoBA to full attention during supervised fine-tuning (SFT), are shown to reduce SFT loss.

The authors evaluate MoBA across a variety of real-world downstream tasks, using the Llama 3.1 8B Base Model as the starting point. The model, termed Llama-8B-1M-MoBA, undergoes long-context pre-training with context lengths gradually increasing to 1M tokens. Results on benchmarks, such as AGIEval, BBH, CEval, GSM8K, HellaSWAG, and RULER, demonstrate that Llama-8B-1M-MoBA exhibits a performance highly comparable to that of Llama-8B-1M-Full.

Efficiency is examined by comparing the forward pass time of the attention layer in Llama-8B-1M-MoBA and Llama-8B-1M-Full. MoBA is shown to be more efficient than full attention across all context lengths, achieving a speedup ratio of up to 6.5x when prefilling 1M tokens. Scaling the context length to 10 million tokens demonstrates MoBA's superior efficiency compared to standard Flash Attention. At 10M tokens, MoBA achieves a 16x reduction in attention computation time.

The related work section discusses static sparse patterns, dynamic sparse patterns, training-free sparse attention, and models using alternate architectures beyond traditional attention.

In conclusion, MoBA enhances the efficiency and scalability of LLMs for long-context tasks. Future work may explore optimizations of MoBA's block-selection strategies, investigate its application to other modalities, and paper its potential for improving generalization in complex reasoning tasks.