A Survey on Hardware Accelerators for Large Language Models (2401.09890v1)
Abstract: LLMs have emerged as powerful tools for natural language processing tasks, revolutionizing the field with their ability to understand and generate human-like text. As the demand for more sophisticated LLMs continues to grow, there is a pressing need to address the computational challenges associated with their scale and complexity. This paper presents a comprehensive survey on hardware accelerators designed to enhance the performance and energy efficiency of LLMs. By examining a diverse range of accelerators, including GPUs, FPGAs, and custom-designed architectures, we explore the landscape of hardware solutions tailored to meet the unique computational demands of LLMs. The survey encompasses an in-depth analysis of architecture, performance metrics, and energy efficiency considerations, providing valuable insights for researchers, engineers, and decision-makers aiming to optimize the deployment of LLMs in real-world applications.
- LLM in a flash: Efficient Large Language Model Inference with Limited Memory. arXiv:2312.11514 [cs.CL]
- Transformer-OPU: An FPGA-based Overlay Processor for Transformer Networks. In 2023 IEEE 31st Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). 221–221. https://doi.org/10.1109/FCCM57271.2023.00049
- Peter Belcak and Roger Wattenhofer. 2023. Exponentially Faster Language Modelling. arXiv:2311.10770 [cs.CL]
- Accelerating Large Language Model Decoding with Speculative Sampling. arXiv:2302.01318 [cs.CL]
- Accelerating Transformer Networks through Recomposing Softmax Layers. In 2022 IEEE International Symposium on Workload Characterization (IISWC). 92–103. https://doi.org/10.1109/IISWC55918.2022.00018
- A Comprehensive Performance Study of Large Language Models on Novel AI Accelerators. arXiv:2310.04607 [cs.PF]
- ATT: A Fault-Tolerant ReRAM Accelerator for Attention-based Neural Networks. In 2020 IEEE 38th International Conference on Computer Design (ICCD). IEEE Computer Society, Los Alamitos, CA, USA, 213–221. https://doi.org/10.1109/ICCD50377.2020.00047
- A33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPT: Accelerating Attention Mechanisms in Neural Networks with Approximation. arXiv:2002.10941 [cs.DC]
- ELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). 692–705. https://doi.org/10.1109/ISCA52012.2021.00060
- Bobby He and Thomas Hofmann. 2023. Simplifying Transformer Blocks. arXiv:2311.01906 [cs.LG]
- DFX: A Low-latency Multi-FPGA Appliance for Accelerating Transformer-based Text Generation. arXiv:2209.10797 [eess.SY]
- Hardware-friendly compression and hardware acceleration for transformer: A survey. Electronic Research Archive 30, 10 (2022), 3755–3785. https://doi.org/10.3934/era.2022192
- A Fast and Flexible FPGA-Based Accelerator for Natural Language Processing Neural Networks. ACM Trans. Archit. Code Optim. 20, 1, Article 11 (feb 2023), 24 pages. https://doi.org/10.1145/3564606
- MnnFast: A Fast and Scalable System Architecture for Memory-Augmented Neural Networks. In 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA). 250–263.
- NPE: An FPGA-Based Overlay Processor for Natural Language Processing (FPGA ’21). Association for Computing Machinery, New York, NY, USA, 227. https://doi.org/10.1145/3431920.3439477
- In-Memory Computing based Accelerator for Transformer Networks for Long Sequences. In 2021 Design, Automation and Test in Europe Conference and Exhibition (DATE). 1839–1844. https://doi.org/10.23919/DATE51398.2021.9474146
- FTRANS: Energy-Efficient Acceleration of Transformers Using FPGA. In Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design (Boston, Massachusetts) (ISLPED ’20). Association for Computing Machinery, New York, NY, USA, 175–180. https://doi.org/10.1145/3370748.3406567
- Sanger: A Co-Design Framework for Enabling Sparse Attention Using Reconfigurable Architecture. In MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture (Virtual Event, Greece) (MICRO ’21). Association for Computing Machinery, New York, NY, USA, 977–991. https://doi.org/10.1145/3466752.3480125
- Hardware Accelerator for Multi-Head Attention and Position-Wise Feed-Forward in the Transformer. In 2020 IEEE 33rd International System-on-Chip Conference (SOCC). 84–89. https://doi.org/10.1109/SOCC49529.2020.9524802
- A Cost-Efficient FPGA Implementation of Tiny Transformer Model using Neural ODE. arXiv:2401.02721 [cs.LG]
- Accelerating Transformer-based Deep Learning Models on FPGAs using Column Balanced Block Pruning. In 2021 22nd International Symposium on Quality Electronic Design (ISQED). 142–148. https://doi.org/10.1109/ISQED51717.2021.9424344
- Premkishore Shivakumar and Norman Jouppi. 2001. CACTI 3.0: An Integrated Cache Timing, Power, and Area Model. (01 2001).
- X-Former: In-Memory Acceleration of Transformers. arXiv:2303.07470 [cs.LG]
- Hardware Acceleration of Transformer Networks using FPGAs. In 2022 Panhellenic Conference on Electronics and Telecommunications (PACET). 1–5. https://doi.org/10.1109/PACET56979.2022.9976354
- SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning. arXiv:2012.09852 [cs.AR]
- LightSeq2: Accelerated Training for Transformer-Based Models on GPUs. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis. 1–14. https://doi.org/10.1109/SC41404.2022.00043
- Speculative Decoding: Exploiting Speculative Execution for Accelerating Seq2seq Generation. arXiv:2203.16487 [cs.CL]
- Inference with Reference: Lossless Acceleration of Large Language Models. arXiv:2304.04487 [cs.CL]
- ReTransformer: ReRAM-based Processing-in-Memory Architecture for Transformer Acceleration. In 2020 IEEE/ACM International Conference On Computer Aided Design (ICCAD). 1–9.
- The CUBLAS and CULA based GPU acceleration of adaptive finite element framework for bioluminescence tomography. Opt. Express 18, 19 (Sep 2010), 20201–20214. https://doi.org/10.1364/OE.18.020201
- Transformer-based models and hardware acceleration analysis in autonomous driving: A survey. arXiv:2304.10891 [cs.LG]
- Energon: Toward Efficient Acceleration of Transformers Using Dynamic Sparse Attention. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 42, 1 (2023), 136–149. https://doi.org/10.1109/TCAD.2022.3170848
- Christoforos Kachris (4 papers)