GRIFFIN: An Efficient Training-Free Mixture of Experts Approach for LLM Generation
Introduction
The advent of transformer-based LLMs has ushered in a new era in various domains, including natural language understanding and generation, due to their unparalleled effectiveness. However, this effectiveness is constrained by their substantial computational and storage requirements, primarily driven by their massive model sizes. Notably, feedforward (FF) blocks within these models, which constitute up to two-thirds of the parameters, significantly contribute to these computational bottlenecks. Addressing this, there have been attempts to leverage sparsity within these FF blocks for computational efficiency through techniques like pruning and constructing mixtures of experts (MoEs). Nonetheless, these techniques either require intensive training, exhibit limited flexibility across different architectures, or both.
Key Contribution
This paper introduces GRIFFIN (Gating by Repetition In Feedforward Intermediate Neurons), a novel, training-free MoE technique that exploits the inherent structured sparsity in the FF activation patterns of LLMs across a sequence - a phenomenon termed as "flocking". Remarkably, GRIFFIN achieves this without any performance degradation on a spectrum of tasks, while substantially reducing the computational overhead. Specifically, it demonstrates that with just 50% of the FF parameters, it can maintain the original model's performance on various classification and generation tasks and improve latency (e.g., a 1.25× speed-up in Llama 2 13B on an NVIDIA L40).
Background and Motivation
Current practices to utilize sparsity for improving FF blocks' efficiency face significant challenges. For instance, pruning methods, despite reducing the model size, do not necessarily translate to improved computational speed. On the other hand, MoEs, despite preserving original performance more effectively, demand the model to learn a gating function for expert selection, which can be computationally expensive or impractical for pre-trained models, particularly those with non-ReLU activations.
Observing Flocking
The authors present a detailed exploration of the flocking phenomenon, which underpins the foundation of GRIFFIN. Flocking refers to the consistency in sparsity patterns across FF activations within a sequence. The paper reveals that the relative magnitude of activations, instead of their absolute values, exhibits this patterned sparsity. Surprisingly, this pattern persists across different models with varying architectures and activation functions, indicating its ubiquity in LLMs.
The GRIFFIN Algorithm
GRIFFIN capitalizes on the flocking phenomenon by selecting FF experts at the sequence level for efficient generation across LLMs. The selection is performed based on the sequence's prompt, which prefaced the generation phase, allowing an efficient and accurate expert determination. Through this method, GRIFFIN addresses the prevailing challenges in leveraging FF block sparsity, namely the requirement for training, complexity of gating functions, and limitations across models with different activation functions.
Experimental Validation
The paper conducts comprehensive experiments to validate GRIFFIN's effectiveness, encompassing a variety of models like Llama 2, Gemma, Mistral, and OPT, across multiple generation and classification tasks. The results highlight GRIFFIN's ability to retain almost the same level of performance as the original models while cutting down up to 50% of FF parameters. Moreover, the method showcases a remarkable improvement in latency without any need for training or fine-tuning, a significant advancement over previous approaches.
Implications and Future Prospects
The implications of GRIFFIN extend beyond just computational efficiency. By demonstrating the presence of flocking across various models and the feasibility of exploiting this phenomenon without performance loss, it opens up new avenues for designing inherently efficient LLM architectures. Furthermore, this work suggests potential in exploring sparsity patterns within FF blocks for robustness and interpretability of LLMs. Moving forward, a promising area of research could be investigating the applications of GRIFFIN in enabling the deployment of LLMs on resource-constrained devices, thereby broadening their accessibility and utility.
Conclusion
This paper presents a significant leap forward in the pursuit of computationally efficient LLMs. Through GRIFFIN, it showcases the practicality of leveraging natural sparsity patterns within FF blocks, dubbed flocking, to achieve impressive gains in performance efficiency without the need for intensive retraining or fine-tuning. This novel approach not only challenges the existing methodologies for optimizing computational efficiency in LLMs but also lays the groundwork for future innovations in sparsity exploitation for AI efficiency.