Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Prompt-prompted Adaptive Structured Pruning for Efficient LLM Generation (2404.01365v3)

Published 1 Apr 2024 in cs.LG, cs.AI, and cs.CL

Abstract: With the development of transformer-based LLMs, they have been applied to many fields due to their remarkable utility, but this comes at a considerable computational cost at deployment. Fortunately, some methods such as pruning or constructing a mixture of experts (MoE) aim at exploiting sparsity in transformer feedforward (FF) blocks to gain boosts in speed and reduction in memory requirements. However, these techniques can be very costly and inflexible in practice, as they often require training or are restricted to specific types of architectures. To address this, we introduce GRIFFIN, a novel training-free and calibration-free method that selects unique FF experts at the sequence level for efficient generation across a plethora of LLMs with different non-ReLU activation functions. This is possible due to a critical observation that many trained LLMs naturally produce highly structured FF activation patterns within a sequence, which we call flocking. Despite our method's simplicity, we show with 50% of the FF parameters, GRIFFIN maintains the original model's performance with little to no degradation on a variety of classification and generation tasks, all while improving latency (e.g. 1.29$\times$ and 1.25$\times$ speed-ups in Gemma 7B and Llama 2 13B, respectively, on an NVIDIA L40). Code is available at https://github.com/hdong920/GRIFFIN.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (59)
  1. Llm in a flash: Efficient large language model inference with limited memory. arXiv preprint arXiv:2312.11514, 2023.
  2. Anthropic. The claude 3 model family: Opus, sonnet, haiku, 2024.
  3. What is the state of neural network pruning? Proceedings of machine learning and systems, 2:129–146, 2020.
  4. Piqa: Reasoning about physical commonsense in natural language. In Thirty-Fourth AAAI Conference on Artificial Intelligence, 2020.
  5. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv:1803.05457v1, 2018.
  6. Approximating two-layer feedforward networks for efficient transformers. arXiv preprint arXiv:2310.10837, 2023.
  7. Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044, 2019.
  8. Towards structured sparsity in transformers for efficient inference. In Workshop on Efficient Systems for Foundation Models@ ICML2023, 2023.
  9. Everybody prune now: Structured pruning of llms with only forward passes. arXiv preprint arXiv:2402.05406, 2024.
  10. A dataset of information-seeking questions and answers anchored in research papers. arXiv preprint arXiv:2105.03011, 2021.
  11. Llm. int8 (): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:2208.07339, 2022.
  12. E. Frantar and D. Alistarh. Massive language models can be accurately pruned in one-shot. arXiv preprint arXiv:2301.00774, 2023.
  13. J. Frankle and M. Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635, 2018.
  14. Understanding the efficiency of gpu algorithms for matrix-matrix multiplication. In Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware, pages 133–137, 2004.
  15. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. The Journal of Machine Learning Research, 23(1):5232–5270, 2022.
  16. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
  17. Transformer feed-forward layers are key-value memories. arXiv preprint arXiv:2012.14913, 2020.
  18. A framework for few-shot language model evaluation, 12 2023.
  19. Adaptive mixtures of local experts. Neural computation, 3(1):79–87, 1991.
  20. Instant soup: Cheap pruning ensembles in a single pass can draw lottery tickets from large models. In International Conference on Machine Learning, pages 14691–14701. PMLR, 2023.
  21. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  22. Mixtral of experts, 2024.
  23. Transformers in vision: A survey. ACM computing surveys (CSUR), 54(10s):1–41, 2022.
  24. Optimal brain damage. Advances in neural information processing systems, 2, 1989.
  25. Pruning and quantization for deep neural network acceleration: A survey. Neurocomputing, 461:370–403, 2021.
  26. Ebert: Efficient bert inference with dynamic structured pruning. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 4814–4823, 2021.
  27. Deja vu: Contextual sparsity for efficient llms at inference time. In International Conference on Machine Learning, pages 22137–22176. PMLR, 2023.
  28. A survey of transformers. AI Open, 2022.
  29. The lazy neuron phenomenon: On emergence of activation sparsity in transformers. In The Eleventh International Conference on Learning Representations, 2022.
  30. Losparse: Structured compression of large language models based on low-rank and sparse approximation. arXiv preprint arXiv:2306.11222, 2023.
  31. A survey of lottery ticket hypothesis. arXiv preprint arXiv:2403.04861, 2024.
  32. Relu strikes back: Exploiting activation sparsity in large language models. arXiv preprint arXiv:2310.04564, 2023.
  33. Llm-pruner: On the structural pruning of large language models. Advances in neural information processing systems, 36:21702–21720, 2023.
  34. Pointer sentinel mixture models, 2016.
  35. Transformers in healthcare: A survey. arXiv preprint arXiv:2307.00067, 2023.
  36. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. arXiv preprint arXiv:1808.08745, 2018.
  37. Abstractive text summarization using sequence-to-sequence rnns and beyond. arXiv preprint arXiv:1602.06023, 2016.
  38. Exploiting transformer activation sparsity with dynamic inference. arXiv preprint arXiv:2310.04361, 2023.
  39. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In 2011 AAAI Spring Symposium Series, 2011.
  40. Coqa: A conversational question answering challenge. Transactions of the Association for Computational Linguistics, 7:249–266, 2019.
  41. Compressive transformers for long-range sequence modelling. arXiv preprint, 2019.
  42. N. Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020.
  43. A simple and effective pruning approach for large language models. arXiv preprint arXiv:2306.11695, 2023.
  44. Scrolls: Standardized comparison over long language sequences. arXiv preprint arXiv:2201.03533, 2022.
  45. What matters in the structured pruning of generative language models? arXiv preprint arXiv:2302.03773, 2023.
  46. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  47. S. Team. Sparse large language models with relu activation, 2023.
  48. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024.
  49. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  50. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  51. Benchmarking tpu, gpu, and cpu platforms for deep learning. arXiv preprint arXiv:1907.10701, 2019.
  52. Sheared llama: Accelerating language model pre-training via structured pruning. arXiv preprint arXiv:2310.06694, 2023.
  53. Structured pruning learns compact and accurate models. arXiv preprint arXiv:2204.00408, 2022.
  54. Pruning small pre-trained weights irreversibly and monotonically impairs "difficult" downstream tasks in llms, 2024.
  55. Hire: High recall approximate top-k𝑘kitalic_k estimation for efficient llm inference. arXiv preprint arXiv:2402.09360, 2024.
  56. Learn to be efficient: Build structured sparsity in large language models. arXiv preprint arXiv:2402.06126, 2024.
  57. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019.
  58. Moefication: Transformer feed-forward layers are mixtures of experts. arXiv preprint arXiv:2110.01786, 2021.
  59. Opt: Open pre-trained transformer language models, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Harry Dong (9 papers)
  2. Beidi Chen (61 papers)
  3. Yuejie Chi (109 papers)
Citations (5)

Summary

GRIFFIN: An Efficient Training-Free Mixture of Experts Approach for LLM Generation

Introduction

The advent of transformer-based LLMs has ushered in a new era in various domains, including natural language understanding and generation, due to their unparalleled effectiveness. However, this effectiveness is constrained by their substantial computational and storage requirements, primarily driven by their massive model sizes. Notably, feedforward (FF) blocks within these models, which constitute up to two-thirds of the parameters, significantly contribute to these computational bottlenecks. Addressing this, there have been attempts to leverage sparsity within these FF blocks for computational efficiency through techniques like pruning and constructing mixtures of experts (MoEs). Nonetheless, these techniques either require intensive training, exhibit limited flexibility across different architectures, or both.

Key Contribution

This paper introduces GRIFFIN (Gating by Repetition In Feedforward Intermediate Neurons), a novel, training-free MoE technique that exploits the inherent structured sparsity in the FF activation patterns of LLMs across a sequence - a phenomenon termed as "flocking". Remarkably, GRIFFIN achieves this without any performance degradation on a spectrum of tasks, while substantially reducing the computational overhead. Specifically, it demonstrates that with just 50% of the FF parameters, it can maintain the original model's performance on various classification and generation tasks and improve latency (e.g., a 1.25× speed-up in Llama 2 13B on an NVIDIA L40).

Background and Motivation

Current practices to utilize sparsity for improving FF blocks' efficiency face significant challenges. For instance, pruning methods, despite reducing the model size, do not necessarily translate to improved computational speed. On the other hand, MoEs, despite preserving original performance more effectively, demand the model to learn a gating function for expert selection, which can be computationally expensive or impractical for pre-trained models, particularly those with non-ReLU activations.

Observing Flocking

The authors present a detailed exploration of the flocking phenomenon, which underpins the foundation of GRIFFIN. Flocking refers to the consistency in sparsity patterns across FF activations within a sequence. The paper reveals that the relative magnitude of activations, instead of their absolute values, exhibits this patterned sparsity. Surprisingly, this pattern persists across different models with varying architectures and activation functions, indicating its ubiquity in LLMs.

The GRIFFIN Algorithm

GRIFFIN capitalizes on the flocking phenomenon by selecting FF experts at the sequence level for efficient generation across LLMs. The selection is performed based on the sequence's prompt, which prefaced the generation phase, allowing an efficient and accurate expert determination. Through this method, GRIFFIN addresses the prevailing challenges in leveraging FF block sparsity, namely the requirement for training, complexity of gating functions, and limitations across models with different activation functions.

Experimental Validation

The paper conducts comprehensive experiments to validate GRIFFIN's effectiveness, encompassing a variety of models like Llama 2, Gemma, Mistral, and OPT, across multiple generation and classification tasks. The results highlight GRIFFIN's ability to retain almost the same level of performance as the original models while cutting down up to 50% of FF parameters. Moreover, the method showcases a remarkable improvement in latency without any need for training or fine-tuning, a significant advancement over previous approaches.

Implications and Future Prospects

The implications of GRIFFIN extend beyond just computational efficiency. By demonstrating the presence of flocking across various models and the feasibility of exploiting this phenomenon without performance loss, it opens up new avenues for designing inherently efficient LLM architectures. Furthermore, this work suggests potential in exploring sparsity patterns within FF blocks for robustness and interpretability of LLMs. Moving forward, a promising area of research could be investigating the applications of GRIFFIN in enabling the deployment of LLMs on resource-constrained devices, thereby broadening their accessibility and utility.

Conclusion

This paper presents a significant leap forward in the pursuit of computationally efficient LLMs. Through GRIFFIN, it showcases the practicality of leveraging natural sparsity patterns within FF blocks, dubbed flocking, to achieve impressive gains in performance efficiency without the need for intensive retraining or fine-tuning. This novel approach not only challenges the existing methodologies for optimizing computational efficiency in LLMs but also lays the groundwork for future innovations in sparsity exploitation for AI efficiency.