Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
136 tokens/sec
GPT-4o
11 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

AI and Memory Wall (2403.14123v1)

Published 21 Mar 2024 in cs.LG, cs.AR, and cs.DC

Abstract: The availability of unprecedented unsupervised training data, along with neural scaling laws, has resulted in an unprecedented surge in model size and compute requirements for serving/training LLMs. However, the main performance bottleneck is increasingly shifting to memory bandwidth. Over the past 20 years, peak server hardware FLOPS has been scaling at 3.0x/2yrs, outpacing the growth of DRAM and interconnect bandwidth, which have only scaled at 1.6 and 1.4 times every 2 years, respectively. This disparity has made memory, rather than compute, the primary bottleneck in AI applications, particularly in serving. Here, we analyze encoder and decoder Transformer models and show how memory bandwidth can become the dominant bottleneck for decoder models. We argue for a redesign in model architecture, training, and deployment strategies to overcome this memory limitation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. G. Ballard, J. Demmel, O. Holtz, and O. Schwartz, “Minimizing communication in numerical linear algebra,” SIAM Journal on Matrix Analysis and Applications, vol. 32, no. 3, pp. 866–901, 2011.
  2. L. Bottou, F. E. Curtis, and J. Nocedal, “Optimization methods for large-scale machine learning,” SIAM Review, vol. 60, no. 2, pp. 223–311, 2018.
  3. T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
  4. A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann et al., “Palm: Scaling language modeling with pathways,” Journal of Machine Learning Research, vol. 24, no. 240, pp. 1–113, 2023.
  5. H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, W. Fedus, Y. Li, X. Wang, M. Dehghani, S. Brahma et al., “Scaling instruction-finetuned language models,” arXiv preprint arXiv:2210.11416, 2022.
  6. T. Dettmers, R. Svirschevski, V. Egiazarian, D. Kuznedelev, E. Frantar, S. Ashkboos, A. Borzunov, T. Hoefler, and D. Alistarh, “Spqr: A sparse-quantized representation for near-lossless llm weight compression,” arXiv preprint arXiv:2306.03078, 2023.
  7. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019, pp. 4171–4186.
  8. E. Frantar and D. Alistarh, “Sparsegpt: Massive language models can be accurately pruned in one-shot,” 2023.
  9. E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh, “OPTQ: Accurate quantization for generative pre-trained transformers,” in The Eleventh International Conference on Learning Representations, 2023.
  10. A. Gholami, S. Kim, Z. Dong, Z. Yao, M. W. Mahoney, and K. Keutzer, “A survey of quantization methods for efficient neural network inference,” in Low-Power Computer Vision.   Chapman and Hall/CRC, 2022, pp. 291–326.
  11. B. Ginsburg, S. Nikolaev, A. Kiswani, H. Wu, A. Gholaminejad, S. Kierat, M. Houston, and A. Fit-Florea, “Tensor processing using low precision format,” Dec. 28 2017, uS Patent App. 15/624,577.
  12. T. Hoefler, D. Alistarh, T. Ben-Nun, N. Dryden, and A. Peste, “Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks,” The Journal of Machine Learning Research, vol. 22, no. 1, pp. 10 882–11 005, 2021.
  13. J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. de Las Casas, L. A. Hendricks, J. Welbl, A. Clark, T. Hennigan, E. Noland, K. Millican, G. van den Driessche, B. Damoc, A. Guy, S. Osindero, K. Simonyan, E. Elsen, O. Vinyals, J. Rae, and L. Sifre, “An empirical analysis of compute-optimal large language model training,” in Advances in Neural Information Processing Systems, vol. 35, 2022, pp. 30 016–30 030.
  14. P. Jain, A. Jain, A. Nrusimha, A. Gholami, P. Abbeel, J. Gonzalez, K. Keutzer, and I. Stoica, “Checkmate: Breaking the memory wall with optimal tensor rematerialization,” Proceedings of Machine Learning and Systems, vol. 2, pp. 497–511, 2020.
  15. A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. d. l. Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier et al., “Mistral 7b,” arXiv preprint arXiv:2310.06825, 2023.
  16. S. Kim, C. Hooper, A. Gholami, Z. Dong, X. Li, S. Shen, M. W. Mahoney, and K. Keutzer, “Squeezellm: Dense-and-sparse quantization,” arXiv preprint arXiv:2306.07629, 2023.
  17. S. Kim, C. Hooper, T. Wattanawong, M. Kang, R. Yan, H. Genc, G. Dinh, Q. Huang, K. Keutzer, M. W. Mahoney, Y. S. Shao, and A. Gholami, “Full stack optimization of transformer inference: a survey,” Workshop on Architecture and System Support for Transformer Models (ASSYST) at ISCA, 2023.
  18. V. A. Korthikanti, J. Casper, S. Lym, L. McAfee, M. Andersch, M. Shoeybi, and B. Catanzaro, “Reducing activation recomputation in large transformer models,” Proceedings of Machine Learning and Systems, vol. 5, 2023.
  19. S. Krishna and R. Krishna, “Accelerating recommender systems via hardware scale-in,” arXiv preprint arXiv:2009.05230, 2020.
  20. W. Kwon, S. Kim, M. W. Mahoney, J. Hassoun, K. Keutzer, and A. Gholami, “A fast post-training pruning framework for transformers,” in Advances in Neural Information Processing Systems, vol. 35.   Curran Associates, Inc., 2022, pp. 24 101–24 116.
  21. Y. LeCun, L. Bottou, Y. Bengio, P. Haffner et al., “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
  22. J. Lin, J. Tang, H. Tang, S. Yang, X. Dang, C. Gan, and S. Han, “Awq: Activation-aware weight quantization for llm compression and acceleration,” 2023.
  23. J. McCalpin, “Stream: Sustainable memory bandwidth in high performance computers,” http://www. cs. virginia. edu/stream/, 2006.
  24. S. A. McKee, “Reflections on the memory wall,” in Proceedings of the 1st Conference on Computing Frontiers, ser. CF ’04.   New York, NY, USA: Association for Computing Machinery, 2004, p. 162.
  25. P. Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen, D. Garcia, B. Ginsburg, M. Houston, O. Kuchaiev, G. Venkatesh, and H. Wu, “Mixed precision training,” in International Conference on Learning Representations, 2018.
  26. P. Micikevicius, D. Stosic, N. Burgess, M. Cornea, P. Dubey, R. Grisenthwaite, S. Ha, A. Heinecke, P. Judd, J. Kamalu et al., “Fp8 formats for deep learning,” arXiv preprint arXiv:2209.05433, 2022.
  27. G. Moore, “No exponential is forever: but ”forever” can be delayed! [semiconductor industry],” in 2003 IEEE International Solid-State Circuits Conference, 2003. Digest of Technical Papers. ISSCC., 2003, pp. 20–23 vol.1.
  28. M. Naumov, D. Mudigere, H.-J. M. Shi, J. Huang, N. Sundaraman, J. Park, X. Wang, U. Gupta, C.-J. Wu, A. G. Azzolini et al., “Deep learning recommendation model for personalization and recommendation systems,” arXiv preprint arXiv:1906.00091, 2019.
  29. J. Ousterhout, “Why aren’t operating systems getting faster as fast as hardware?” in USENIX Summer Conference, 1990, 1990.
  30. D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Keeton, C. Kozyrakis, R. Thomas, and K. Yelick, “A case for intelligent ram,” IEEE micro, vol. 17, no. 2, pp. 34–44, 1997.
  31. D. A. Patterson, “Latency lags bandwith,” Communications of the ACM, vol. 47, no. 10, pp. 71–75, 2004.
  32. A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, p. 9, 2019.
  33. S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He, “Zero: Memory optimizations toward training trillion parameter models,” in SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.   IEEE, 2020, pp. 1–16.
  34. T. Schick and H. Schütze, “It’s not just size that matters: Small language models are also few-shot learners,” arXiv preprint arXiv:2009.07118, 2020.
  35. D. Sites, “It’s the memory, stupid!” Microprocessor Report, pp. 18–24, 1996.
  36. H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale et al., “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023.
  37. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in neural information processing systems, 2017, pp. 5998–6008.
  38. J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler et al., “Emergent abilities of large language models,” arXiv preprint arXiv:2206.07682, 2022.
  39. M. V. Wilkes, “The memory wall and the cmos end-point,” ACM SIGARCH Computer Architecture News, vol. 23, no. 4, pp. 4–6, 1995.
  40. S. Williams, A. Waterman, and D. Patterson, “Roofline: an insightful visual performance model for multicore architectures,” Communications of the ACM, vol. 52, no. 4, pp. 65–76, 2009.
  41. W. A. Wulf and S. A. McKee, “Hitting the memory wall: Implications of the obvious,” ACM SIGARCH computer architecture news, vol. 23, no. 1, pp. 20–24, 1995.
  42. Z. Yao, A. Gholami, S. Shen, M. Mustafa, K. Keutzer, and M. Mahoney, “Adahessian: An adaptive second order optimizer for machine learning,” in proceedings of the AAAI conference on artificial intelligence, vol. 35, no. 12, 2021, pp. 10 665–10 673.
  43. Z. Yao, R. Yazdani Aminabadi, M. Zhang, X. Wu, C. Li, and Y. He, “Zeroquant: Efficient and affordable post-training quantization for large-scale transformers,” in Advances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, Eds., vol. 35.   Curran Associates, Inc., 2022, pp. 27 168–27 183.
Citations (102)

Summary

  • The paper identifies a significant performance bottleneck, demonstrating that the rapid increase in compute power far outpaces memory bandwidth improvements.
  • The paper reveals that transformer encoder models benefit from higher arithmetic intensity compared to decoders, reducing the memory wall impact.
  • The paper proposes strategies such as optimized training algorithms, efficient deployment methods, and new hardware designs to overcome memory constraints.

Addressing the Memory Wall in AI: A Comprehensive Study on Transformer Models

Introduction

Recent trends in the development and deployment of LLMs and AI applications have spotlighted a critical bottleneck in their performance: the memory wall. This refers to the growing gap between the computational power of hardware and the bandwidth of memory systems, including DRAM (Dynamic Random-Access Memory) and interconnect bandwidth. A detailed analysis reveals that while peak hardware FLOPS (Floating-Point Operations Per Second) have increased substantially, memory bandwidth has not kept pace, posing significant challenges for efficiently training and serving AI models.

The Memory Wall Problem

The memory wall represents a fundamental constraint on AI model performance, encompassing issues related to memory capacity, bandwidth, and latency. The problem is multi-faceted, affecting data transfer across different memory hierarchies and between processors. Notably, the performance of server-grade AI hardware over the past two decades underscores this constraint: while hardware FLOPS have risen by a factor of 60,000, DRAM bandwidth and interconnect bandwidth have lagged significantly. This discrepancy highlights memory, particularly intra/inter-chip data transfer, as a primary performance bottleneck for AI applications.

Case Study on Transformer Models

A case paper focusing on Transformer models, including encoder (e.g., BERT) and decoder (e.g., GPT) architectures, offers valuable insights into how the memory wall impacts AI performance. The paper underscores the importance of considering arithmetic intensity—a metric evaluating the number of FLOPs per byte loaded from memory. Encoder models, benefiting from matrix-matrix operations, demonstrate higher arithmetic intensity and are less affected by the memory wall compared to decoder models, which rely on matrix-vector operations and exhibit significantly lower arithmetic intensity. This analysis highlights the need for optimized model architectures and deployment strategies to navigate the constraints posed by the memory wall.

Strategies to Overcome the Memory Wall

Efficient Training Algorithms

Improving training efficiency involves minimizing the need for extensive hyperparameter tuning and reducing memory requirements. Approaches such as second-order stochastic optimization methods and memory optimization strategies, including the rematerialization of activations, hold promise for addressing these challenges. Additionally, enhancing algorithms' robustness to low-precision training can contribute to more efficient hardware utilization.

Efficient Model Deployment

Strategies for efficient model deployment focus on reducing model size and computational demands. Techniques like model quantization, pruning, and the development of smaller, more efficient LLMs are key to facilitating model deployment, particularly for large-scale applications.

Rethinking AI Hardware Design

Addressing the memory wall requires not only software and algorithmic innovations but also a reevaluation of AI hardware design. By balancing compute capabilities with memory bandwidth and adopting more sophisticated memory hierarchies, it is possible to design hardware better suited to the demands of current and future AI applications.

Conclusion

The accelerating divergence between computational power and memory bandwidth—coupled with the exponential growth in AI model sizes—necessitates a holistic approach to addressing the memory wall. This involves innovations across model design, training and deployment strategies, and hardware development. As the AI field continues to evolve, overcoming the memory wall will be critical for unlocking new levels of performance and efficiency in AI applications.

Youtube Logo Streamline Icon: https://streamlinehq.com