Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LLM in a flash: Efficient Large Language Model Inference with Limited Memory (2312.11514v3)

Published 12 Dec 2023 in cs.CL, cs.AI, and cs.LG
LLM in a flash: Efficient Large Language Model Inference with Limited Memory

Abstract: LLMs are central to modern natural language processing, delivering exceptional performance in various tasks. However, their substantial computational and memory requirements present challenges, especially for devices with limited DRAM capacity. This paper tackles the challenge of efficiently running LLMs that exceed the available DRAM capacity by storing the model parameters in flash memory, but bringing them on demand to DRAM. Our method involves constructing an inference cost model that takes into account the characteristics of flash memory, guiding us to optimize in two critical areas: reducing the volume of data transferred from flash and reading data in larger, more contiguous chunks. Within this hardware-informed framework, we introduce two principal techniques. First, "windowing" strategically reduces data transfer by reusing previously activated neurons, and second, "row-column bundling", tailored to the sequential data access strengths of flash memory, increases the size of data chunks read from flash memory. These methods collectively enable running models up to twice the size of the available DRAM, with a 4-5x and 20-25x increase in inference speed compared to naive loading approaches in CPU and GPU, respectively. Our integration of sparsity awareness, context-adaptive loading, and a hardware-oriented design paves the way for effective inference of LLMs on devices with limited memory.

Introduction to the Study

Advancements in natural language processing have been propelled by the development of LLMs, which are highly sophisticated models capable of understanding and generating human-like text. Notable examples of these models include GPT-3, OPT, and PaLM. LLMs are incredibly parameter-dense, often having a size that challenges the storage and computational capabilities of many devices, particularly those with constrained DRAM capacity. This paper introduces an innovative method that enables LLM inference on such devices by leveraging flash memory, which is typically higher in capacity compared to DRAM, without the need to load the whole model into DRAM at once.

Flash Memory & LLM Inference

The core of the challenge boils down to the discrepancy between the high capacity of flash memory and the faster speeds of DRAM. Traditionally, running an LLM requires loading the entire model into the quick-access DRAM. This is not feasible for very large models on hardware with limited DRAM capacity. The authors' method circumvents this limitation by directly reading only the necessary model parameters from flash memory during inference. This technique is based on two key principles: reducing the volume of data transfer from flash and reading data in more substantial and sequential blocks, aligning with how flash memory performs best.

Load From Flash

The authors further describe a "windowing" technique, which involves only loading parameters related to the most recent tokens, thereby reusing previously activated data and reducing the number of I/O requests to flash memory. Moreover, they introduce "row-column bundling," a method that combines associated matrix rows and columns for larger contiguous data reads. These strategies, coupled with a focus on sparsity within model layers, significantly reduce the amount of data needing to be loaded from flash memory. The approach is designed to selectively load only the non-zero and more likely to be non-zero parameters, therefore minimizing the amount of necessary memory traffic.

Significant Findings

Implementing these techniques, the paper shows that it is possible to run LLMs that are up to double the size of the available DRAM with substantial speed gains. In particular, the researchers achieved an inference speed increase of 4 to 5 times on CPU and 20 to 25 times on GPU relative to more naive loading strategies. These results offer a substantial contribution to the field of AI by enabling more efficient utilization of LLMs on a variety of devices, thus widening the scope of potential applications. The paper stands as an example of the significance of considering hardware limitations in the design of machine learning algorithms, particularly those that are resource-intensive.

Conclusion and Future Implications

The work accomplished in this paper paves the way for numerous new possibilities where LLMs can be utilized effectively on devices previously deemed unsuitable due to memory constraints. This not only democratizes access to state-of-the-art AI capabilities but also invites further research dedicated to optimizing the performance of such models, ensuring their widespread adoption across various platforms. The intersection of hardware-aware algorithm development and machine learning, as showcased in this paper, is likely to remain a crucial area of focus as the models continue to grow in scale and potential.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. Atomlayer: minimizing dram data movement for ultra-sparse models on gpus. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, pages 223–238.
  2. Intriguing properties of quantization at scale. ArXiv, abs/2305.19268.
  3. The falcon series of language models: Towards open frontier models.
  4. Deepspeed-inference: enabling efficient inference of transformer models at unprecedented scale. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–15. IEEE.
  5. Fast and robust early-exiting framework for autoregressive language models with synchronized parallel decoding. ArXiv, abs/2310.05424.
  6. Alternating updates for efficient transformers. ArXiv, abs/2301.13310.
  7. Petals: Collaborative inference and fine-tuning of large models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), pages 558–568, Toronto, Canada. Association for Computational Linguistics.
  8. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  9. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
  10. Spatten: Efficient sparse attention architecture with cascade token and head pruning. In Advances in Neural Information Processing Systems, volume 34.
  11. computedram: In-memory compute using off-the-shelf dram. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, pages 1065–1079.
  12. Alex Graves. 2016. Adaptive computation time for recurrent neural networks. In International Conference on Machine Learning, pages 3500–3509. PMLR.
  13. Graphssd: a high performance flash-based storage system for large-scale graph processing. In 2016 USENIX Annual Technical Conference (USENIXATC 16), pages 243–256.
  14. Eie: efficient inference engine on compressed deep neural network. arXiv preprint arXiv:1602.01528.
  15. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. In International Conference on Learning Representations (ICLR).
  16. Rest: Retrieval-based speculative decoding. ArXiv, abs/2311.08252.
  17. (dynamic) prompting might be all you need to repair compressed llms. ArXiv, abs/2310.00867.
  18. Compressing llms: The truth is rarely pure and never simple. ArXiv, abs/2310.01382.
  19. Fast inference from transformers via speculative decoding.
  20. Jiaxi Li and Wei Lu. 2023. Contextual distortion reveals constituency: Masked language models are implicit parsers. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5208–5222, Toronto, Canada. Association for Computational Linguistics.
  21. Norm tweaking: High-performance low-bit quantization of large language models. ArXiv, abs/2309.02784.
  22. Awq: Activation-aware weight quantization for llm compression and acceleration. ArXiv, abs/2306.00978.
  23. Llm-qat: Data-free quantization aware training for large language models. ArXiv, abs/2305.17888.
  24. Deja vu: Contextual sparsity for efficient llms at inference time. In International Conference on Machine Learning, pages 22137–22176. PMLR.
  25. Neural cache: Bit-serial in-cache acceleration of deep neural networks. In 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 383–394. IEEE.
  26. Relu strikes back: Exploiting activation sparsity in large language models.
  27. Firefly: A lightweight system for running multi-billion parameter models on commodity hardware. In 2022 ACM/IEEE 49th Annual International Symposium on Computer Architecture (ISCA), pages 757–771. IEEE.
  28. Sparse gpu kernels for deep learning. In International Conference on Learning Representations.
  29. Timeloop: A systematic approach to dnn accelerator evaluation. In 2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 241–251. IEEE.
  30. Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning. In SC21: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–14.
  31. vdnn: Virtualized deep neural networks for scalable, memory-efficient neural network design. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), page Article 13. IEEE Computer Society.
  32. Omniquant: Omnidirectionally calibrated quantization for large language models. ArXiv, abs/2308.13137.
  33. Hotpot: Warmed-up gigascale inference with tightly-coupled compute and reuse in flash. In Proceedings of the 55th Annual IEEE/ACM International Symposium on Microarchitecture, pages 335–349.
  34. Flexgen: High-throughput generative inference of large language models with a single GPU. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 31094–31116. PMLR.
  35. Adapt: Parameter adaptive token-wise inference for vision transformers. In Proceedings of the 55th Annual IEEE/ACM International Symposium on Microarchitecture.
  36. A simple and effective pruning approach for large language models. ArXiv, abs/2306.11695.
  37. Flash-llm: Enabling low-cost and highly-efficient large generative model inference with unstructured sparsity. Proc. VLDB Endow., 17:211–224.
  38. Compress, then prompt: Improving accuracy-efficiency trade-off of llm inference with transferable prompt. ArXiv, abs/2305.11186.
  39. Edgemoe: Fast on-device inference of moe-based large language models. ArXiv, abs/2308.14352.
  40. Draft & verify: Lossless large language model acceleration via self-speculative decoding. ArXiv, abs/2309.08168.
  41. Llm quantization: Quantization-aware training for large language models. In Advances in Neural Information Processing Systems, volume 35.
  42. OPT: open pre-trained transformer language models. CoRR, abs/2205.01068.
  43. Atom: Low-bit quantization for efficient and accurate llm serving. ArXiv, abs/2310.19102.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Keivan Alizadeh (8 papers)
  2. Iman Mirzadeh (11 papers)
  3. Dmitry Belenko (3 papers)
  4. Karen Khatamifard (1 paper)
  5. Minsik Cho (36 papers)
  6. Carlo C Del Mundo (5 papers)
  7. Mohammad Rastegari (57 papers)
  8. Mehrdad Farajtabar (56 papers)
Citations (72)
Youtube Logo Streamline Icon: https://streamlinehq.com