Understanding the Potential of FPGA-Based Spatial Acceleration for Large Language Model Inference (2312.15159v2)
Abstract: Recent advancements in LLMs boasting billions of parameters have generated a significant demand for efficient deployment in inference workloads. The majority of existing approaches rely on temporal architectures that reuse hardware units for different network layers and operators. However, these methods often encounter challenges in achieving low latency due to considerable memory access overhead. This paper investigates the feasibility and potential of model-specific spatial acceleration for LLM inference on FPGAs. Our approach involves the specialization of distinct hardware units for specific operators or layers, facilitating direct communication between them through a dataflow architecture while minimizing off-chip memory accesses. We introduce a comprehensive analytical model for estimating the performance of a spatial LLM accelerator, taking into account the on-chip compute and memory resources available on an FPGA. Through our analysis, we can determine the scenarios in which FPGA-based spatial acceleration can outperform its GPU-based counterpart. To enable more productive implementations of an LLM model on FPGAs, we further provide a library of high-level synthesis (HLS) kernels that are composable and reusable. This library will be made available as open-source. To validate the effectiveness of both our analytical model and HLS library, we have implemented BERT and GPT2 on an AMD Alveo U280 FPGA device. Experimental results demonstrate our approach can achieve up to 13.4x speedup when compared to previous FPGA-based accelerators for the BERT model. For GPT generative inference, we attain a 2.2x speedup compared to DFX, an FPGA overlay, in the prefill stage, while achieving a 1.9x speedup and a 5.7x improvement in energy efficiency compared to the NVIDIA A100 GPU in the decode stage.
- TensorFlow: A System for Large-Scale Machine Learning. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation.
- DeepSpeed-Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis.
- FlexCNN: An End-to-End Framework for Composing CNN Accelerators on FPGA. ACM Trans. Reconfigurable Technol. Syst. 16, 2, Article 23 (mar 2023), 32 pages. https://doi.org/10.1145/3570928
- FINN-R: An end-to-end deep-learning framework for fast exploration of quantized neural networks. ACM Transactions on Reconfigurable Technology and Systems (TRETS) 11, 3 (2018), 1–23.
- On the Opportunities and Risks of Foundation Models. arXiv preprint arXiv:2108.07258 (2021).
- A cloud-scale acceleration architecture. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 1–13. https://doi.org/10.1109/MICRO.2016.7783710
- QuIP: 2-Bit Quantization of Large Language Models With Guarantees. arXiv preprint arXiv:2307.13304 (2023).
- Accelerating Large Language Model Decoding with Speculative Sampling. arXiv preprint arXiv:2302.01318 (2023).
- Slapo: A Schedule Language for Progressive Optimization of Large Deep Learning Model Training. In Proceedings of the ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS).
- Evaluating Large Language Models Trained on Code. arXiv preprint arXiv:2107.03374 (2021).
- Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality. https://lmsys.org/blog/2023-03-30-vicuna/
- Serving DNNs in Real Time at Datacenter Scale with Project Brainwave. IEEE Micro 38, 2 (2018), 8–20. https://doi.org/10.1109/MM.2018.022071131
- LaMDA: Language Models for Dialog Applications. In arXiv preprint arXiv:2201.08239.
- FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. arXiv preprint arXiv:2205.14135 (2022).
- LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale. In Advances in Neural Information Processing Systems, Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (Eds.).
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
- PaLM-E: An Embodied Multimodal Language Model. In arXiv preprint arXiv:2303.03378.
- High-Performance Sparse Linear Algebra on HBM-Equipped FPGAs Using HLS: A Case Study on SpMV. Int’l Symp. on Field-Programmable Gate Arrays (FPGA) (2022).
- ELS-RD. 2022. kernl.ai. https://github.com/ELS-RD/kernl.
- hls4ml: An Open-Source Codesign Workflow to Empower Scientific Low-Power Machine Learning Devices.
- GPTQ: Accurate Post-training Compression for Generative Pretrained Transformers. arXiv preprint arXiv:2210.17323 (2022).
- AutoBridge: Coupling Coarse-Grained Floorplanning and Pipelining for High-Frequency HLS Design on Multi-Die FPGAs. In The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (Virtual Event, USA) (FPGA ’21). Association for Computing Machinery, New York, NY, USA, 81–92. https://doi.org/10.1145/3431920.3439289
- DFX: A Low-latency Multi-FPGA Appliance for Accelerating Transformer-based Text Generation. In 2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO). 616–630. https://doi.org/10.1109/MICRO56248.2022.00051
- PyLog: An Algorithm-Centric Python-Based FPGA Programming and Synthesis Flow. IEEE Trans. Comput. 70, 12 (2021), 2015–2028. https://doi.org/10.1109/TC.2021.3123465
- HuggingFace. 2023. Text generation strategies. https://huggingface.co/docs/transformers/generation_strategies.
- A Fast and Flexible FPGA-Based Accelerator for Natural Language Processing Neural Networks. ACM Trans. Archit. Code Optim. 20, 1, Article 11 (feb 2023), 24 pages. https://doi.org/10.1145/3564606
- Intel. 2022. Intel Agilex 7 FPGA and SoC FPGA. https://www.intel.com/content/www/us/en/products/details/fpga/agilex/7.html.
- NPE: An FPGA-Based Overlay Processor for Natural Language Processing. In The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (Virtual Event, USA) (FPGA ’21). Association for Computing Machinery, New York, NY, USA, 227. https://doi.org/10.1145/3431920.3439477
- PASTA: Programming and Automation Support for Scalable Task-Parallel HLS Programs on Modern Multi-Die FPGAs. In 2023 IEEE 31st Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). 12–22. https://doi.org/10.1109/FCCM57271.2023.00011
- I-BERT: Integer-only BERT Quantization. Proceedings of the International Conference on Machine Learning (ICML).
- SqueezeLLM: Dense-and-Sparse Quantization. arXiv (2023).
- Full Stack Optimization of Transformer Inference: a Survey. arXiv preprint arXiv:2302.14017 (2023).
- Reducing Activation Recomputation in Large Transformer Models. arXiv preprint arXiv:2205.05198 (2022).
- Efficient Memory Management for Large Language Model Serving with PagedAttention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles.
- HeteroCL: A Multi-Paradigm Programming Infrastructure for Software-Defined Reconfigurable Computing. In Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays.
- ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. In International Conference on Learning Representations.
- Stratix 10 NX Architecture and Applications. In The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (Virtual Event, USA) (FPGA ’21). Association for Computing Machinery, New York, NY, USA, 57–67. https://doi.org/10.1145/3431920.3439293
- xFormers: A modular and hackable Transformer modelling library. https://github.com/facebookresearch/xformers.
- FTRANS: Energy-Efficient Acceleration of Transformers Using FPGA. In Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design (Boston, Massachusetts) (ISLPED ’20). Association for Computing Machinery, New York, NY, USA, 175–180. https://doi.org/10.1145/3370748.3406567
- Efficient Methods for Mapping Neural Machine Translator on FPGAs. IEEE Transactions on Parallel and Distributed Systems (TPDS) 32, 7 (2021), 1866–1877. https://doi.org/10.1109/TPDS.2020.3047371
- PyTorch Distributed: Experiences on Accelerating Data Parallel Training. Proc. VLDB Endow. (2020).
- Competition-level code generation with AlphaCode. Science 378, 6624 (2022), 1092–1097. https://doi.org/10.1126/science.abq1158 arXiv:https://www.science.org/doi/pdf/10.1126/science.abq1158
- AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving. In 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23). USENIX Association, Boston, MA, 663–679. https://www.usenix.org/conference/osdi23/presentation/li-zhouhan
- AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration. arXiv preprint arXiv:2306.00978 (2023).
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
- Hardware Acceleration of Fully Quantized BERT for Efficient Natural Language Processing. 2021 Design, Automation & Test in Europe Conference & Exhibition (DATE) (2021), 513–516.
- Meta. 2021. Fully Sharded Data Parallel: faster AI training with fewer GPUs. https://engineering.fb.com/2021/07/15/open-source/fsdp/.
- Galvatron: Efficient Transformer Training over Multiple GPUs Using Automatic Parallelism. Proc. VLDB Endow. 16, 3 (nov 2022), 470–479. https://doi.org/10.14778/3570690.3570697
- PipeDream: Generalized Pipeline Parallelism for DNN Training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles.
- Memory-Efficient Pipeline-Parallel DNN Training. In Proceedings of the 38th International Conference on Machine Learning.
- Efficient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis.
- CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis. In The Eleventh International Conference on Learning Representations.
- Nvidia. 2022. FasterTransformer. https://github.com/NVIDIA/FasterTransformer.
- OpenAI. 2023. GPT-4 Technical Report. arXiv preprint arXiv:2303.08774 (2023).
- The LAMBADA dataset: Word prediction requiring a broad discourse context. arXiv preprint arXiv:1606.06031 (2016).
- PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Proceedings of the 33rd International Conference on Neural Information Processing Systems.
- A Length Adaptive Algorithm-Hardware Co-Design of Transformer on FPGA through Sparse Attention and Dynamic Pipelining. In Proceedings of the 59th ACM/IEEE Design Automation Conference (San Francisco, California) (DAC ’22). Association for Computing Machinery, New York, NY, USA, 1135–1140. https://doi.org/10.1145/3489517.3530585
- Accelerating Transformer-based Deep Learning Models on FPGAs using Column Balanced Block Pruning. In 2021 22nd International Symposium on Quality Electronic Design (ISQED). 142–148. https://doi.org/10.1109/ISQED51717.2021.9424344
- Memory-Efficient Dataflow Inference for Deep CNNs on FPGA. 2020 International Conference on Field-Programmable Technology (ICFPT) (2020), 48–55.
- Efficiently Scaling Transformer Inference. In Proceedings of Machine Learning and Systems.
- Accelerating Framework of Transformer by Hardware Design and Model Compression Co-Optimization. In 2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD) (Munich, Germany). IEEE Press, 1–9. https://doi.org/10.1109/ICCAD51958.2021.9643586
- Accommodating Transformer onto FPGA: Coupling the Balanced Model Compression and FPGA-Implementation Optimization. In Proceedings of the 2021 on Great Lakes Symposium on VLSI (Virtual Event, USA) (GLSVLSI ’21). Association for Computing Machinery, New York, NY, USA, 163–168. https://doi.org/10.1145/3453688.3461739
- Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.
- Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.
- Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT. In The Thirty-Fourth AAAI Conference on Artificial Intelligence. AAAI Press, 8815–8821.
- Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. arXiv preprint arXiv:1909.08053 (2019).
- FPGA-Aware Automatic Acceleration Framework for Vision Transformer with Mixed-Scheme Quantization: Late Breaking Results. In Proceedings of the 59th ACM/IEEE Design Automation Conference (San Francisco, California) (DAC ’22). Association for Computing Machinery, New York, NY, USA, 1394–1395. https://doi.org/10.1145/3489517.3530618
- LLaMA: Open and Efficient Foundation Language Models. arXiv preprint arXiv:2302.13971 (2023).
- Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv preprint arXiv:2307.09288 (2023).
- FINN: A Framework for Fast, Scalable Binarized Neural Network Inference. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA ’17). ACM, 65–74.
- Unity: Accelerating DNN Training Through Joint Optimization of Algebraic Transformations and Parallelization. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). USENIX Association, Carlsbad, CA, 267–284. https://www.usenix.org/conference/osdi22/presentation/unger
- Attention is All you Need. In Advances in Neural Information Processing Systems.
- AutoSA: A Polyhedral Compiler for High-Performance Systolic Arrays on FPGA. In The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (Virtual Event, USA) (FPGA ’21). Association for Computing Machinery, New York, NY, USA, 93–104. https://doi.org/10.1145/3431920.3439292
- Overlap Communication with Dependent Computation via Decomposition in Large Deep Learning Models. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1 (Vancouver, BC, Canada) (ASPLOS 2023). Association for Computing Machinery, New York, NY, USA, 93–106. https://doi.org/10.1145/3567955.3567959
- Emergent Abilities of Large Language Models. Transactions on Machine Learning Research (2022). https://openreview.net/forum?id=yzkSU5zdwD
- Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In Advances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.), Vol. 35. Curran Associates, Inc., 24824–24837.
- Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019).
- HeteroFlow: An Accelerator Programming Model with Decoupled Data Placement for Software-Defined FPGAs. In Proceedings of the 2022 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays.
- SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models. In Proceedings of the 40th International Conference on Machine Learning.
- Synthesizing Optimal Parallelism Placement and Reduction Strategies on Hierarchical Systems for Deep Learning. In Proceedings of Machine Learning and Systems, D. Marculescu, Y. Chi, and C. Wu (Eds.), Vol. 4. 548–566. https://proceedings.mlsys.org/paper_files/paper/2022/file/f0f9e98bc2e2f0abc3e315eaa0d808fc-Paper.pdf
- AMD Xilinx. 2021. Alveo U280 Data Center Accelerator Card. https://www.xilinx.com/products/boards-and-kits/alveo/u280.html#specifications.
- AMD Xilinx. 2022a. AI Engines and Their Applications. White Paper. AMD Xilinx.
- AMD Xilinx. 2022b. QSFP Module Connector. (2022). https://docs.xilinx.com/r/en-US/ug1411-vmk180-eval-bd/QSFP-Module-Connector
- AMD Xilinx. 2022c. VCK5000 Versal Development Card. https://www.xilinx.com/products/boards-and-kits/vck5000.html#specs.
- AMD Xilinx. 2022d. Vitis Accelerated Libraries. https://github.com/Xilinx/Vitis_Libraries.
- AMD Xilinx. 2022e. Vitis AI: Adaptable & Real-Time AI Inference Acceleration. https://github.com/Xilinx/Vitis-AI.
- AMD Xilinx. 2022f. Vitis HLS v2022.1. https://www.xilinx.com/products/design-tools/vitis/vitis-platform.html.
- AMD Xilinx. 2023. Versal VHK158. https://www.xilinx.com/products/boards-and-kits/vhk158.html.
- On Layer Normalization in the Transformer Architecture. In Proceedings of the 37th International Conference on Machine Learning (ICML’20). JMLR.org, Article 975, 10 pages.
- PipeMare: Asynchronous Pipeline Parallel DNN Training. In Proceedings of Machine Learning and Systems.
- AIM: Accelerating Arbitrary-precision Integer Multiplication on Heterogeneous Reconfigurable Computing Platform Versal ACAP. In 2023 IEEE/ACM International Conference On Computer Aided Design (ICCAD).
- Zeroquant: Efficient and affordable post-training quantization for large-scale transformers. Advances in Neural Information Processing Systems 35 (2022), 27168–27183.
- ScaleHLS: A New Scalable High-Level Synthesis Framework on Multi-Level Intermediate Representation. In 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA).
- RPTQ: Reorder-based Post-training Quantization for Large Language Models. arXiv preprint arXiv:2304.01089 (2023).
- Optimizing FPGA-Based Accelerator Design for Deep Convolutional Neural Networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (Monterey, California, USA) (FPGA ’15). Association for Computing Machinery, New York, NY, USA, 161–170. https://doi.org/10.1145/2684746.2689060
- DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs. In 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD). 1–8. https://doi.org/10.1145/3240765.3240801
- Algorithm-Hardware Co-Design of Attention Mechanism on FPGA Devices. ACM Trans. Embed. Comput. Syst. 20, 5s, Article 71 (sep 2021), 24 pages. https://doi.org/10.1145/3477002
- DNNExplorer: A Framework for Modeling and Exploring a Novel Paradigm of FPGA-Based DNN Accelerator. In Proceedings of the 39th International Conference on Computer-Aided Design (Virtual Event, USA) (ICCAD ’20). Association for Computing Machinery, New York, NY, USA, Article 61, 9 pages. https://doi.org/10.1145/3400302.3415609
- Binarized Neural Machine Translation. arXiv preprint arXiv:2302.04907 (2023).
- FracBNN: Accurate and FPGA-Efficient Binary Neural Networks with Fractional Activations. The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (2021).
- Atom: Low-bit Quantization for Efficient and Accurate LLM Serving. arXiv preprint arXiv:2310.19102 (2023).
- Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. arXiv:2306.05685 [cs.CL]
- Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). USENIX Association, Carlsbad, CA, 559–578. https://www.usenix.org/conference/osdi22/presentation/zheng-lianmin
- Hongzheng Chen (7 papers)
- Jiahao Zhang (81 papers)
- Yixiao Du (2 papers)
- Shaojie Xiang (5 papers)
- Zichao Yue (5 papers)
- Niansong Zhang (5 papers)
- Yaohui Cai (10 papers)
- Zhiru Zhang (51 papers)