Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
131 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

HiRE: High Recall Approximate Top-$k$ Estimation for Efficient LLM Inference (2402.09360v1)

Published 14 Feb 2024 in cs.LG and cs.AI

Abstract: Autoregressive decoding with generative LLMs on accelerators (GPUs/TPUs) is often memory-bound where most of the time is spent on transferring model parameters from high bandwidth memory (HBM) to cache. On the other hand, recent works show that LLMs can maintain quality with significant sparsity/redundancy in the feedforward (FFN) layers by appropriately training the model to operate on a top-$k$ fraction of rows/columns (where $k \approx 0.05$), there by suggesting a way to reduce the transfer of model parameters, and hence latency. However, exploiting this sparsity for improving latency is hindered by the fact that identifying top rows/columns is data-dependent and is usually performed using full matrix operations, severely limiting potential gains. To address these issues, we introduce HiRE (High Recall Approximate Top-k Estimation). HiRE comprises of two novel components: (i) a compression scheme to cheaply predict top-$k$ rows/columns with high recall, followed by full computation restricted to the predicted subset, and (ii) DA-TOP-$k$: an efficient multi-device approximate top-$k$ operator. We demonstrate that on a one billion parameter model, HiRE applied to both the softmax as well as feedforward layers, achieves almost matching pretraining and downstream accuracy, and speeds up inference latency by $1.47\times$ on a single TPUv5e device.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (62)
  1. Intriguing properties of quantization at scale. ArXiv, abs/2305.19268, 2023. URL https://api.semanticscholar.org/CorpusID:258967189.
  2. Llm in a flash: Efficient large language model inference with limited memory. arXiv preprint arXiv:2312.11514, 2023.
  3. Deepspeed-inference: enabling efficient inference of transformer models at unprecedented scale. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, pp.  1–15. IEEE, 2022.
  4. Palm 2 technical report. CoRR, abs/2305.10403, 2023. doi: 10.48550/ARXIV.2305.10403. URL https://doi.org/10.48550/arXiv.2305.10403.
  5. Fast and robust early-exiting framework for autoregressive language models with synchronized parallel decoding. arXiv preprint arXiv:2310.05424, 2023.
  6. Alternating updates for efficient transformers. arXiv preprint arXiv:2301.13310, 2023.
  7. Semantic parsing on freebase from question-answer pairs. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, EMNLP 2013, 18-21 October 2013, Grand Hyatt Seattle, Seattle, Washington, USA, A meeting of SIGDAT, a Special Interest Group of the ACL, pp.  1533–1544. ACL, 2013. URL https://aclanthology.org/D13-1160/.
  8. PIQA: reasoning about physical commonsense in natural language. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pp.  7432–7439. AAAI Press, 2020. doi: 10.1609/AAAI.V34I05.6239. URL https://doi.org/10.1609/aaai.v34i05.6239.
  9. Findings of the 2014 workshop on statistical machine translation. In Proceedings of the ninth workshop on statistical machine translation, pp.  12–58, 2014.
  10. Boolq: Exploring the surprising difficulty of natural yes/no questions. In Burstein, J., Doran, C., and Solorio, T. (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pp.  2924–2936. Association for Computational Linguistics, 2019. doi: 10.18653/V1/N19-1300. URL https://doi.org/10.18653/v1/n19-1300.
  11. Tydi qa: A benchmark for information-seeking question answering in ty pologically di verse languages. Transactions of the Association for Computational Linguistics, 8:454–470, 2020.
  12. Think you have solved question answering? try arc, the AI2 reasoning challenge. CoRR, abs/1803.05457, 2018. URL http://arxiv.org/abs/1803.05457.
  13. Cloud, G. Cloud tpu v5e inference. URL https://cloud.google.com/tpu/docs/v5e-inference. Accessed on Feb 1, 2024.
  14. Approximating two-layer feedforward networks for efficient transformers. arXiv preprint arXiv:2310.10837, 2023.
  15. The commitmentbank: Investigating projection in naturally occurring discourse. 2019. URL https://api.semanticscholar.org/CorpusID:203595067.
  16. Towards structured sparsity in transformers for efficient inference. In Workshop on Efficient Systems for Foundation Models, at ICML2023, 2023.
  17. Glam: Efficient scaling of language models with mixture-of-experts. In International Conference on Machine Learning, pp.  5547–5569. PMLR, 2022.
  18. Sparse gpu kernels for deep learning. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp.  1–14. IEEE, 2020.
  19. Accelerating deep neural networks via semi-structured activation sparsity. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  1179–1188, 2023.
  20. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023.
  21. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.
  22. Rest: Retrieval-based speculative decoding. arXiv preprint arXiv:2311.08252, 2023.
  23. Compressing llms: The truth is rarely pure and never simple. arXiv preprint arXiv:2310.01382, 2023.
  24. Sparse is enough in scaling transformers. Advances in Neural Information Processing Systems, 34:9895–9907, 2021.
  25. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551, 2017.
  26. Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp.  252–262, 2018.
  27. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466, 2019.
  28. RACE: large-scale reading comprehension dataset from examinations. In Palmer, M., Hwa, R., and Riedel, S. (eds.), Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017, pp.  785–794. Association for Computational Linguistics, 2017. doi: 10.18653/V1/D17-1082. URL https://doi.org/10.18653/v1/d17-1082.
  29. The winograd schema challenge. In Thirteenth international conference on the principles of knowledge representation and reasoning, 2012.
  30. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pp.  19274–19286. PMLR, 2023.
  31. Norm tweaking: High-performance low-bit quantization of large language models. arXiv preprint arXiv:2309.02784, 2023.
  32. The lazy neuron phenomenon: On emergence of activation sparsity in transformers. In The Eleventh International Conference on Learning Representations, 2022.
  33. Mixkd: Towards efficient distillation of large-scale language models. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. URL https://openreview.net/forum?id=UFGEelJkLu5.
  34. Awq: Activation-aware weight quantization for llm compression and acceleration. arXiv preprint arXiv:2306.00978, 2023.
  35. Deja vu: Contextual sparsity for efficient llms at inference time. In International Conference on Machine Learning, pp.  22137–22176. PMLR, 2023.
  36. Treeformer: Dense gradient trees for efficient attention computation. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/pdf?id=DWn1TEb2fK.
  37. Can a suit of armor conduct electricity? A new dataset for open book question answering. In Riloff, E., Chiang, D., Hockenmaier, J., and Tsujii, J. (eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pp.  2381–2391. Association for Computational Linguistics, 2018. doi: 10.18653/V1/D18-1260. URL https://doi.org/10.18653/v1/d18-1260.
  38. Relu strikes back: Exploiting activation sparsity in large language models. arXiv preprint arXiv:2310.04564, 2023.
  39. A corpus and evaluation framework for deeper understanding of commonsense stories. CoRR, abs/1604.01696, 2016. URL http://arxiv.org/abs/1604.01696.
  40. Controlled decoding from language models. CoRR, abs/2310.17022, 2023. doi: 10.48550/ARXIV.2310.17022. URL https://doi.org/10.48550/arXiv.2310.17022.
  41. Adversarial NLI: A new benchmark for natural language understanding. In Jurafsky, D., Chai, J., Schluter, N., and Tetreault, J. R. (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pp.  4885–4901. Association for Computational Linguistics, 2020. doi: 10.18653/V1/2020.ACL-MAIN.441. URL https://doi.org/10.18653/v1/2020.acl-main.441.
  42. Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350, 2021.
  43. Wic: the word-in-context dataset for evaluating context-sensitive meaning representations. In Burstein, J., Doran, C., and Solorio, T. (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pp.  1267–1273. Association for Computational Linguistics, 2019. doi: 10.18653/V1/N19-1128. URL https://doi.org/10.18653/v1/n19-1128.
  44. Exploiting transformer activation sparsity with dynamic inference. arXiv preprint arXiv:2310.04361, 2023.
  45. Efficiently scaling transformer inference. Proceedings of Machine Learning and Systems, 5, 2023.
  46. Popović, M. chrf: character n-gram f-score for automatic mt evaluation. In Proceedings of the tenth workshop on statistical machine translation, pp.  392–395, 2015.
  47. Know what you don’t know: Unanswerable questions for squad. In Gurevych, I. and Miyao, Y. (eds.), Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 2: Short Papers, pp.  784–789. Association for Computational Linguistics, 2018. doi: 10.18653/V1/P18-2124. URL https://aclanthology.org/P18-2124/.
  48. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In Logical Formalizations of Commonsense Reasoning, Papers from the 2011 AAAI Spring Symposium, Technical Report SS-11-06, Stanford, California, USA, March 21-23, 2011. AAAI, 2011. URL http://www.aaai.org/ocs/index.php/SSS/SSS11/paper/view/2418.
  49. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, 2021.
  50. Flexgen: High-throughput generative inference of large language models with a single gpu. In International Conference on Machine Learning, pp.  31094–31116. PMLR, 2023.
  51. Powerinfer: Fast large language model serving with a consumer-grade gpu. arXiv preprint arXiv:2312.12456, 2023.
  52. Blockwise parallel decoding for deep autoregressive models. Advances in Neural Information Processing Systems, 31, 2018.
  53. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  54. Superglue: A stickier benchmark for general-purpose language understanding systems. In Wallach, H. M., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E. B., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp.  3261–3275, 2019. URL https://proceedings.neurips.cc/paper/2019/hash/4496bf24afe7fab6f046bf4923da8de6-Abstract.html.
  55. Spatten: Efficient sparse attention architecture with cascade token and head pruning. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pp.  97–110. IEEE, 2021.
  56. Flash-llm: Enabling cost-effective and highly-efficient large generative model inference with unstructured sparsity. arXiv preprint arXiv:2309.10285, 2023.
  57. Edgemoe: Fast on-device inference of moe-based large language models. arXiv preprint arXiv:2308.14352, 2023.
  58. Hellaswag: Can a machine really finish your sentence? In Korhonen, A., Traum, D. R., and Màrquez, L. (eds.), Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pp.  4791–4800. Association for Computational Linguistics, 2019. doi: 10.18653/V1/P19-1472. URL https://doi.org/10.18653/v1/p19-1472.
  59. Lookupffn: making transformers compute-lite for cpu inference. In International Conference on Machine Learning, pp.  40707–40718. PMLR, 2023.
  60. Findings of the WMT 2022 shared task on quality estimation. In Koehn, P., Barrault, L., Bojar, O., Bougares, F., Chatterjee, R., Costa-jussà, M. R., Federmann, C., Fishel, M., Fraser, A., Freitag, M., Graham, Y., Grundkiewicz, R., Guzman, P., Haddow, B., Huck, M., Jimeno-Yepes, A., Kocmi, T., Martins, A., Morishita, M., Monz, C., Nagata, M., Nakazawa, T., Negri, M., Névéol, A., Neves, M., Popel, M., Turchi, M., and Zampieri, M. (eds.), Proceedings of the Seventh Conference on Machine Translation, WMT 2022, Abu Dhabi, United Arab Emirates (Hybrid), December 7-8, 2022, pp.  69–99. Association for Computational Linguistics, 2022. URL https://aclanthology.org/2022.wmt-1.3.
  61. Record: Bridging the gap between human and machine commonsense reading comprehension. CoRR, abs/1810.12885, 2018. URL http://arxiv.org/abs/1810.12885.
  62. Atom: Low-bit quantization for efficient and accurate llm serving. arXiv preprint arXiv:2310.19102, 2023.
Citations (4)

Summary

  • The paper introduces HiRE, a novel method that approximates top-k selection with high recall to reduce LLM inference latency.
  • It employs a compression scheme and low-rank approximations to focus computation on key FFN and softmax components while preserving output quality.
  • The distributed DA-TOP operator scales performance across multiple devices, achieving up to 2.27x speedup in real-world deployments.

High Recall Approximate Top-kk Estimation for Efficient LLM Inference

The paper "HiRE: High Recall Approximate Top-kk Estimation for Efficient LLM Inference" addresses a critical challenge in deploying autoregressive LLMs, which is the substantial latency during the inference phase. This latency predominantly arises due to the need to transfer large model parameters from high-bandwidth memory (HBM) to cache, a process that is often memory-bound on standard accelerators such as GPUs and TPUs.

Contributions of the Paper

The authors propose a novel approach called HiRE, which stands for High Recall Approximate Top-kk Estimation. HiRE consists of two main components:

  1. Compression Scheme: This enables the prediction of top-kk rows or columns with a high recall rate, thereby limiting full computation to these predicted subsets.
  2. DA-TOP Operator: A distributed, approximate top-kk operator for multi-device environments, which facilitates efficient handling across multiple accelerators.

Key Ideas and Methodology

The paper elaborates on the inherent sparsity and redundancy in the feedforward (FFN) and softmax layers of LLMs, suggesting that models can be trained to operate efficiently on only a top fraction of these components. HiRE leverages this by performing approximate top-kk estimation to limit computations to the necessary components. Specifically, it uses low-rank approximations and quantization to predict top-kk indices, followed by precise calculations within this subset, ensuring high fidelity in output accuracy.

For deployment on large models distributed across multiple devices, HiRE introduces DA-TOP, which reduces communication overhead by operating top-kk predictions on each device and subsequently aggregating these predictions, rather than centralizing the entire matrix computation.

Empirical Results

  1. Latency Improvement: HiRE achieves a 1.47x speedup in inference latency on a one billion parameter model across TPUs without degrading pretraining and downstream task performance.
  2. Accuracy Retention: Despite the approximations, HiRE maintains almost matching quality with full calculations, suggesting the efficacy of high recall approximation in preserving the accuracy of the softmax and FFN computations.
  3. Scalability with DA-TOP: On a cluster of TPU devices, the DA-TOP approach further enhances the speedup to 2.27x, demonstrating the scalability and efficiency of the distributed approximation in real-world deployment environments.

Implications and Future Directions

The research highlights the potential of exploiting sparsity within LLM architectures to significantly enhance inference efficiency. This efficiency gain has practical implications in reducing the computation cost and energy consumption associated with large-scale model deployment, thereby supporting more sustainable AI practices. Theoretically, this work presents a compelling case for focusing future model architecture designs on inherently sparse computations.

Future research could extend these findings to the attention layers of LLMs or explore more sophisticated training mechanisms that naturally induce sparsity across the model's components. Additionally, investigating the interplay between model compression techniques and HiRE could yield further reductions in parameter size and computational overhead.

In summary, the paper provides a detailed methodology and empirical evaluation of a technique poised to redefine efficient inference in the deployment of LLMs, paving the way for broader accessibility and integration of AI technologies in resource-constrained settings.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets