Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Survey on Hardware Accelerators for Large Language Models (2401.09890v1)

Published 18 Jan 2024 in cs.AR, cs.CL, and cs.LG

Abstract: LLMs have emerged as powerful tools for natural language processing tasks, revolutionizing the field with their ability to understand and generate human-like text. As the demand for more sophisticated LLMs continues to grow, there is a pressing need to address the computational challenges associated with their scale and complexity. This paper presents a comprehensive survey on hardware accelerators designed to enhance the performance and energy efficiency of LLMs. By examining a diverse range of accelerators, including GPUs, FPGAs, and custom-designed architectures, we explore the landscape of hardware solutions tailored to meet the unique computational demands of LLMs. The survey encompasses an in-depth analysis of architecture, performance metrics, and energy efficiency considerations, providing valuable insights for researchers, engineers, and decision-makers aiming to optimize the deployment of LLMs in real-world applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (32)
  1. LLM in a flash: Efficient Large Language Model Inference with Limited Memory. arXiv:2312.11514 [cs.CL]
  2. Transformer-OPU: An FPGA-based Overlay Processor for Transformer Networks. In 2023 IEEE 31st Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). 221–221. https://doi.org/10.1109/FCCM57271.2023.00049
  3. Peter Belcak and Roger Wattenhofer. 2023. Exponentially Faster Language Modelling. arXiv:2311.10770 [cs.CL]
  4. Accelerating Large Language Model Decoding with Speculative Sampling. arXiv:2302.01318 [cs.CL]
  5. Accelerating Transformer Networks through Recomposing Softmax Layers. In 2022 IEEE International Symposium on Workload Characterization (IISWC). 92–103. https://doi.org/10.1109/IISWC55918.2022.00018
  6. A Comprehensive Performance Study of Large Language Models on Novel AI Accelerators. arXiv:2310.04607 [cs.PF]
  7. ATT: A Fault-Tolerant ReRAM Accelerator for Attention-based Neural Networks. In 2020 IEEE 38th International Conference on Computer Design (ICCD). IEEE Computer Society, Los Alamitos, CA, USA, 213–221. https://doi.org/10.1109/ICCD50377.2020.00047
  8. A33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPT: Accelerating Attention Mechanisms in Neural Networks with Approximation. arXiv:2002.10941 [cs.DC]
  9. ELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). 692–705. https://doi.org/10.1109/ISCA52012.2021.00060
  10. Bobby He and Thomas Hofmann. 2023. Simplifying Transformer Blocks. arXiv:2311.01906 [cs.LG]
  11. DFX: A Low-latency Multi-FPGA Appliance for Accelerating Transformer-based Text Generation. arXiv:2209.10797 [eess.SY]
  12. Hardware-friendly compression and hardware acceleration for transformer: A survey. Electronic Research Archive 30, 10 (2022), 3755–3785. https://doi.org/10.3934/era.2022192
  13. A Fast and Flexible FPGA-Based Accelerator for Natural Language Processing Neural Networks. ACM Trans. Archit. Code Optim. 20, 1, Article 11 (feb 2023), 24 pages. https://doi.org/10.1145/3564606
  14. MnnFast: A Fast and Scalable System Architecture for Memory-Augmented Neural Networks. In 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA). 250–263.
  15. NPE: An FPGA-Based Overlay Processor for Natural Language Processing (FPGA ’21). Association for Computing Machinery, New York, NY, USA, 227. https://doi.org/10.1145/3431920.3439477
  16. In-Memory Computing based Accelerator for Transformer Networks for Long Sequences. In 2021 Design, Automation and Test in Europe Conference and Exhibition (DATE). 1839–1844. https://doi.org/10.23919/DATE51398.2021.9474146
  17. FTRANS: Energy-Efficient Acceleration of Transformers Using FPGA. In Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design (Boston, Massachusetts) (ISLPED ’20). Association for Computing Machinery, New York, NY, USA, 175–180. https://doi.org/10.1145/3370748.3406567
  18. Sanger: A Co-Design Framework for Enabling Sparse Attention Using Reconfigurable Architecture. In MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture (Virtual Event, Greece) (MICRO ’21). Association for Computing Machinery, New York, NY, USA, 977–991. https://doi.org/10.1145/3466752.3480125
  19. Hardware Accelerator for Multi-Head Attention and Position-Wise Feed-Forward in the Transformer. In 2020 IEEE 33rd International System-on-Chip Conference (SOCC). 84–89. https://doi.org/10.1109/SOCC49529.2020.9524802
  20. A Cost-Efficient FPGA Implementation of Tiny Transformer Model using Neural ODE. arXiv:2401.02721 [cs.LG]
  21. Accelerating Transformer-based Deep Learning Models on FPGAs using Column Balanced Block Pruning. In 2021 22nd International Symposium on Quality Electronic Design (ISQED). 142–148. https://doi.org/10.1109/ISQED51717.2021.9424344
  22. Premkishore Shivakumar and Norman Jouppi. 2001. CACTI 3.0: An Integrated Cache Timing, Power, and Area Model. (01 2001).
  23. X-Former: In-Memory Acceleration of Transformers. arXiv:2303.07470 [cs.LG]
  24. Hardware Acceleration of Transformer Networks using FPGAs. In 2022 Panhellenic Conference on Electronics and Telecommunications (PACET). 1–5. https://doi.org/10.1109/PACET56979.2022.9976354
  25. SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning. arXiv:2012.09852 [cs.AR]
  26. LightSeq2: Accelerated Training for Transformer-Based Models on GPUs. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis. 1–14. https://doi.org/10.1109/SC41404.2022.00043
  27. Speculative Decoding: Exploiting Speculative Execution for Accelerating Seq2seq Generation. arXiv:2203.16487 [cs.CL]
  28. Inference with Reference: Lossless Acceleration of Large Language Models. arXiv:2304.04487 [cs.CL]
  29. ReTransformer: ReRAM-based Processing-in-Memory Architecture for Transformer Acceleration. In 2020 IEEE/ACM International Conference On Computer Aided Design (ICCAD). 1–9.
  30. The CUBLAS and CULA based GPU acceleration of adaptive finite element framework for bioluminescence tomography. Opt. Express 18, 19 (Sep 2010), 20201–20214. https://doi.org/10.1364/OE.18.020201
  31. Transformer-based models and hardware acceleration analysis in autonomous driving: A survey. arXiv:2304.10891 [cs.LG]
  32. Energon: Toward Efficient Acceleration of Transformers Using Dynamic Sparse Attention. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 42, 1 (2023), 136–149. https://doi.org/10.1109/TCAD.2022.3170848
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (1)
Citations (11)
X Twitter Logo Streamline Icon: https://streamlinehq.com