MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression (2406.14909v2)

Published 21 Jun 2024 in cs.LG, cs.AI, and cs.CL

Abstract: Sparse attention can effectively mitigate the significant memory and throughput demands of LLMs in long contexts. Existing methods typically employ a uniform sparse attention mask, applying the same sparse pattern across different attention heads and input lengths. However, this uniform approach fails to capture the diverse attention patterns inherent in LLMs, ignoring their distinct accuracy-latency trade-offs. To address this challenge, we propose the Mixture of Attention (MoA), which automatically tailors distinct sparse attention configurations to different heads and layers. MoA constructs and navigates a search space of various attention patterns and their scaling rules relative to input sequence lengths. It profiles the model, evaluates potential configurations, and pinpoints the optimal sparse attention compression plan. MoA adapts to varying input sizes, revealing that some attention heads expand their focus to accommodate longer sequences, while other heads consistently concentrate on fixed-length local contexts. Experiments show that MoA increases the effective context length by $3.9\times$ with the same average attention span, boosting retrieval accuracy by $1.5-7.1\times$ over the uniform-attention baseline across Vicuna-{7B,13B}, and Llama3-{8B,70B} models. Moreover, MoA narrows the capability gaps between sparse and dense models, reducing the maximum relative performance drop from $9\%-36\%$ to within $5\%$ across two long-context understanding benchmarks. MoA achieves a $1.2-1.4\times$ GPU memory reduction, boosting decode throughput by $6.6-8.2\times$ and $1.7-1.9\times$ compared to FlashAttention2 and vLLM, with minimal impact on performance. Our code is available at \url{https://github.com/thu-nics/MoA}.

PDF HTML Abstract

Overview of "MoA: Mixture of Sparse Attention for Automatic LLM Compression"

The paper entitled "MoA: Mixture of Sparse Attention for Automatic LLM Compression" introduces a novel approach to tackle the efficiency challenges associated with LLMs when dealing with long sequences. Sparse attention mechanisms historically offer memory and computation relief, but existing methodologies, which typically apply a uniform sparse attention pattern across all attention heads and input lengths, fail to exploit the full potential of LLM architectures. These uniform approaches do not capture the diverse attention patterns necessary for maintaining high performance amidst varying accuracy-latency trade-offs.

Key Contributions

The authors have proposed the Mixture of Attention (MoA), a method designed to automatically calibrate sparse attention configurations uniquely for each attention head and layer within an LLM. The central objectives and contributions of MoA are outlined as follows:

Heterogeneous Elastic Rules: MoA adopts a non-uniform, context-dependent strategy by implementing heterogeneous elastic rules which adjust sparse attention span dynamically based on the input sequence length. This enables the model to flexibly expand or contract its focus, either maintaining a global perspective or concentrating on local contexts as needed.
Construction of a Calibration Dataset: The necessity of informing model compression with appropriate datasets is emphasized. The authors leverage data with long-range dependencies, using the model outputs rather than human-generated responses as a reference, to more reliably profile compression effects on LLMs.
Automatic Optimization Pipeline: MoA employs an automatic pipeline that profiles each attention value's influence and optimizes various configurations to determine the optimal sparse attention pattern for each head. This automatic configuration identifies suitable elastic rules to minimize predictive loss under predefined density constraints.

Results and Implications

The experiments conducted confirm that MoA significantly extends the effective context length of models by up to 3.9 times while improving retrieval accuracy by a factor of 1.5 to 7.1 over uniform sparse baselines. With only a modest degradation in performance metrics, MoA reduces maximum relative performance losses from dense configurations and achieves up to 1.4 times GPU memory savings alongside marked increases in computational throughput (6.6-8.2 times over FlashAttention2).

Theoretical and Practical Implications

The implications of MoA are noteworthy for both performance enhancement and resource optimization. This work showcases that a heterogeneous approach allows LLMs to better allocate attention resources where necessary, mitigating the significant efficiency trade-offs of long-context processing. From a practical standpoint, the methodology offers a transparent and reproducible means to reconfigure LLMs for improved efficiency without the need for significant retraining.

Prospective Developments

In future iterations, exploring non-linear scaling rules and integrating dynamic attention schemes could further optimize context handling across diverse input scenarios. Expanding the profiling concept to evaluate other model components, such as weight matrices, could stimulate advances in complementary compression techniques like quantization or pruning, further broadening the scope of large-model efficiency optimization.

Overall, MoA represents a sophisticated step towards more efficient and versatile LLMs, offering a promising strategy for engaging with the complexity and variability inherent in natural language processing tasks.