Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

80 tokens/sec

GPT-4o

59 tokens/sec

Gemini 2.5 Pro Pro

43 tokens/sec

o3 Pro

7 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

Prompt-prompted Adaptive Structured Pruning for Efficient LLM Generation (2404.01365v3)

Published 1 Apr 2024 in cs.LG, cs.AI, and cs.CL

Abstract: With the development of transformer-based LLMs, they have been applied to many fields due to their remarkable utility, but this comes at a considerable computational cost at deployment. Fortunately, some methods such as pruning or constructing a mixture of experts (MoE) aim at exploiting sparsity in transformer feedforward (FF) blocks to gain boosts in speed and reduction in memory requirements. However, these techniques can be very costly and inflexible in practice, as they often require training or are restricted to specific types of architectures. To address this, we introduce GRIFFIN, a novel training-free and calibration-free method that selects unique FF experts at the sequence level for efficient generation across a plethora of LLMs with different non-ReLU activation functions. This is possible due to a critical observation that many trained LLMs naturally produce highly structured FF activation patterns within a sequence, which we call flocking. Despite our method's simplicity, we show with 50% of the FF parameters, GRIFFIN maintains the original model's performance with little to no degradation on a variety of classification and generation tasks, all while improving latency (e.g. 1.29$\times$ and 1.25$\times$ speed-ups in Gemma 7B and Llama 2 13B, respectively, on an NVIDIA L40). Code is available at https://github.com/hdong920/GRIFFIN.

PDF HTML Abstract

GRIFFIN: An Efficient Training-Free Mixture of Experts Approach for LLM Generation

Introduction

The advent of transformer-based LLMs has ushered in a new era in various domains, including natural language understanding and generation, due to their unparalleled effectiveness. However, this effectiveness is constrained by their substantial computational and storage requirements, primarily driven by their massive model sizes. Notably, feedforward (FF) blocks within these models, which constitute up to two-thirds of the parameters, significantly contribute to these computational bottlenecks. Addressing this, there have been attempts to leverage sparsity within these FF blocks for computational efficiency through techniques like pruning and constructing mixtures of experts (MoEs). Nonetheless, these techniques either require intensive training, exhibit limited flexibility across different architectures, or both.

Key Contribution

This paper introduces GRIFFIN (Gating by Repetition In Feedforward Intermediate Neurons), a novel, training-free MoE technique that exploits the inherent structured sparsity in the FF activation patterns of LLMs across a sequence - a phenomenon termed as "flocking". Remarkably, GRIFFIN achieves this without any performance degradation on a spectrum of tasks, while substantially reducing the computational overhead. Specifically, it demonstrates that with just 50% of the FF parameters, it can maintain the original model's performance on various classification and generation tasks and improve latency (e.g., a 1.25× speed-up in Llama 2 13B on an NVIDIA L40).

Background and Motivation

Current practices to utilize sparsity for improving FF blocks' efficiency face significant challenges. For instance, pruning methods, despite reducing the model size, do not necessarily translate to improved computational speed. On the other hand, MoEs, despite preserving original performance more effectively, demand the model to learn a gating function for expert selection, which can be computationally expensive or impractical for pre-trained models, particularly those with non-ReLU activations.

Observing Flocking

The authors present a detailed exploration of the flocking phenomenon, which underpins the foundation of GRIFFIN. Flocking refers to the consistency in sparsity patterns across FF activations within a sequence. The paper reveals that the relative magnitude of activations, instead of their absolute values, exhibits this patterned sparsity. Surprisingly, this pattern persists across different models with varying architectures and activation functions, indicating its ubiquity in LLMs.

The GRIFFIN Algorithm

GRIFFIN capitalizes on the flocking phenomenon by selecting FF experts at the sequence level for efficient generation across LLMs. The selection is performed based on the sequence's prompt, which prefaced the generation phase, allowing an efficient and accurate expert determination. Through this method, GRIFFIN addresses the prevailing challenges in leveraging FF block sparsity, namely the requirement for training, complexity of gating functions, and limitations across models with different activation functions.

Experimental Validation

The paper conducts comprehensive experiments to validate GRIFFIN's effectiveness, encompassing a variety of models like Llama 2, Gemma, Mistral, and OPT, across multiple generation and classification tasks. The results highlight GRIFFIN's ability to retain almost the same level of performance as the original models while cutting down up to 50% of FF parameters. Moreover, the method showcases a remarkable improvement in latency without any need for training or fine-tuning, a significant advancement over previous approaches.

Implications and Future Prospects

The implications of GRIFFIN extend beyond just computational efficiency. By demonstrating the presence of flocking across various models and the feasibility of exploiting this phenomenon without performance loss, it opens up new avenues for designing inherently efficient LLM architectures. Furthermore, this work suggests potential in exploring sparsity patterns within FF blocks for robustness and interpretability of LLMs. Moving forward, a promising area of research could be investigating the applications of GRIFFIN in enabling the deployment of LLMs on resource-constrained devices, thereby broadening their accessibility and utility.

Conclusion

This paper presents a significant leap forward in the pursuit of computationally efficient LLMs. Through GRIFFIN, it showcases the practicality of leveraging natural sparsity patterns within FF blocks, dubbed flocking, to achieve impressive gains in performance efficiency without the need for intensive retraining or fine-tuning. This novel approach not only challenges the existing methodologies for optimizing computational efficiency in LLMs but also lays the groundwork for future innovations in sparsity exploitation for AI efficiency.

PDF Markdown Bookmark Chat (Pro)

References (59)

Authors (3)

Harry Dong (9 papers)
Beidi Chen (61 papers)
Yuejie Chi (109 papers)

Citations (5)

View on Semantic Scholar

Tweets

https://twitter.com/fly51fly/status/1775636880056594744

https://twitter.com/Real_HDong/status/1825255147498537454

https://twitter.com/knishimae0531/status/1775667088751055332

https://twitter.com/knishimae0531/status/1775677513697751109

https://twitter.com/knishimae0531/status/1813541804710744109