Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Training-Free Activation Sparsity in Large Language Models (2408.14690v2)

Published 26 Aug 2024 in cs.CL and cs.AI

Abstract: Activation sparsity can enable practical inference speedups in LLMs by reducing the compute and memory-movement required for matrix multiplications during the forward pass. However, existing methods face limitations that inhibit widespread adoption. Some approaches are tailored towards older models with ReLU-based sparsity, while others require extensive continued pre-training on up to hundreds of billions of tokens. This paper describes TEAL, a simple training-free method that applies magnitude-based activation sparsity to hidden states throughout the entire model. TEAL achieves 40-50% model-wide sparsity with minimal performance degradation across Llama-2, Llama-3, and Mistral families, with sizes varying from 7B to 70B. We improve existing sparse kernels and demonstrate wall-clock decoding speed-ups of up to 1.53$\times$ and 1.8$\times$ at 40% and 50% model-wide sparsity. TEAL is compatible with weight quantization, enabling further efficiency gains.

Citations (1)

Summary

  • The paper introduces TEAL, a training-free technique that leverages magnitude-based activation sparsity to accelerate inference in LLMs.
  • It utilizes layer-wise thresholding and a greedy algorithm to prune 40–50% low-magnitude activations while maintaining near-baseline performance.
  • TEAL achieves 1.53× to 1.8× speed-ups in decoding and integrates effectively with quantization methods for added efficiency.

Training-Free Activation Sparsity in LLMs

The paper "Training-Free Activation Sparsity in LLMs" introduces TEAL, a method aimed at leveraging magnitude-based activation sparsity to enhance the efficiency of LLMs without the need for additional training. This research is motivated by the computational challenges posed by LLMs due to their significant parameter count, especially during inference. The proposed approach demonstrates promising improvements in terms of activation sparsity, leading to marked speed-ups in wall-clock decoding times while maintaining minimal degradation in model performance.

Background and Motivation

Modern LLMs, like the LLaMA and Mistral families, often face a memory-bound inference phase due to the extensive parameter sizes, which necessitates efficient memory management and computational strategies. Prior research has predominantly focused on weight quantization and sparsification, providing substantial improvements in speed. Activation sparsity, where non-salient activations are zeroed out to avoid unnecessary computations, has emerged as an effective strategy but often relies on training or finetuning models with ReLU-based layers. Such methods have limited applicability to newer models employing advanced activation functions like SwiGLU, which are not naturally sparse.

The TEAL Approach

TEAL stands for "Training-Free Activation Sparsity in LLMs." The method applies magnitude-based activation sparsity across all hidden states in a model, identifying and zeroing out low-magnitude activations, thus reducing the computational burden. This is accomplished without any additional training requirements, making it an attractive option for enhancing the efficiency of existing models. Specifically, TEAL achieves model-wide sparsity levels of 40-50% with minimal performance degradation in models such as LLaMA-2, LLaMA-3, and Mistral, across various model sizes ranging from 7 billion to 70 billion parameters.

Key to TEAL's implementation is a detailed paper of the distributional properties of LLM activations, revealing that these are often zero-mean and unimodal, fitting Gaussian or Laplacian distributions. This informed the decision to adopt layer-dependent thresholding for sparsification, enabling the framework to selectively prune low-magnitude activations effectively.

Methodology

A significant part of TEAL's novelty lies in its greedy algorithm for optimizing sparsity across Transformer blocks. The approach initializes layer-level sparsities to zero and incrementally adjusts based on the resulting activation error, optimizing for minimal degradation while adhering to a target sparsity level. This block-wise optimization ensures that sparsity levels are balanced across the model, avoiding overly aggressive pruning in any single component.

Furthermore, the method is complemented by specialized kernels that facilitate efficient sparse Matrix-Vector (GEMV) operations. These improvements include optimized memory coalescing, selective weight loading, and enhanced parallelism through SplitK decomposition. These enhancements are critical for realizing practical speed-ups during inference.

Results

The effectiveness of TEAL is underscored by both evaluation metrics and end-to-end inference speed-ups:

  • Accuracy: Across various LLMs and sparsity configurations, TEAL demonstrates minimal degradation at 25% sparsity (with performance near baseline on most tasks). Even at 40-50% sparsity, the model maintains acceptable performance levels, contrasting sharply with other methods that show significant degradation at similar sparsity levels.
  • Speed-up: TEAL achieves substantial wall-clock speed-ups, particularly in single-batch decoding, demonstrating up to 1.53× and 1.8× speed-ups at 40% and 50% sparsity, respectively, on different GPU architectures (A6000 and A100).
  • Compatibility with Quantization: TEAL is shown to work synergistically with various quantization methods, indicating additive gains in efficiency and demonstrating its robustness to further compression techniques.

Analysis and Implications

A thorough analysis of TEAL's effectiveness reveals key insights into the behavior of activation sparsity in neural networks. The method performs favorably against other sparsification techniques, such as CATS and ReLUfication, by avoiding extreme sparsity in any single component and maintaining balanced sparsity across the model. The paper also highlights the impact of prefill sparsification, stressing the importance of maintaining initial token precision due to the attention sink phenomenon.

Future Directions

The research opens several avenues for further exploration. Future work could refine the block-wise optimization algorithm, making it more adaptive to different model architectures and activation functions. Additionally, exploring the integration of TEAL with emerging hardware accelerators optimized for sparse operations could further amplify its efficiency gains. Another promising direction is to extend sparsity methods to a broader range of downstream applications, potentially adopting more sophisticated sparsification criteria that consider both activation magnitude and contextual relevance.

Conclusion

TEAL represents a significant advancement in the effort to make LLMs more efficient without sacrificing performance. By eliminating the need for training and finetuning, it offers a practical, scalable solution for deploying state-of-the-art models in resource-constrained environments. The insights gained from this work contribute to a deeper understanding of activation sparsity mechanisms and pave the way for more sophisticated, efficient inference techniques in the future.

X Twitter Logo Streamline Icon: https://streamlinehq.com