Optimizing Foundation Model Inference on a Many-tiny-core Open-source RISC-V Platform (2405.19284v1)
Abstract: Transformer-based foundation models have become crucial for various domains, most notably NLP or computer vision (CV). These models are predominantly deployed on high-performance GPUs or hardwired accelerators with highly customized, proprietary instruction sets. Until now, limited attention has been given to RISC-V-based general-purpose platforms. In our work, we present the first end-to-end inference results of transformer models on an open-source many-tiny-core RISC-V platform implementing distributed Softmax primitives and leveraging ISA extensions for SIMD floating-point operand streaming and instruction repetition, as well as specialized DMA engines to minimize costly main memory accesses and to tolerate their latency. We focus on two foundational transformer topologies, encoder-only and decoder-only models. For encoder-only models, we demonstrate a speedup of up to 12.8x between the most optimized implementation and the baseline version. We reach over 79% FPU utilization and 294 GFLOPS/W, outperforming State-of-the-Art (SoA) accelerators by more than 2x utilizing the HW platform while achieving comparable throughput per computational unit. For decoder-only topologies, we achieve 16.1x speedup in the Non-Autoregressive (NAR) mode and up to 35.6x speedup in the Autoregressive (AR) mode compared to the baseline implementation. Compared to the best SoA dedicated accelerator, we achieve 2.04x higher FPU utilization.
- W. X. Zhao et al., “A Survey of Large Language Models,” Nov. 2023, arXiv preprint arXiv:2303.18223.
- A. Dosovitskiy et al., “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,” arXiv preprint arXiv:2010.11929, Oct. 2020.
- C. Zhou et al., “A Comprehensive Survey on Pretrained Foundation Models: A History from BERT to ChatGPT,” May 2023, arXiv preprint arXiv:2302.09419.
- A. Vaswani et al., “Attention is All you Need,” in Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc., 2017.
- R. Bommasani et al., “On the Opportunities and Risks of Foundation Models,” Jul. 2022, arXiv preprint arXiv:2108.07258.
- T. Brown et al., “Language Models are Few-Shot Learners,” in Advances in Neural Information Processing Systems, vol. 33. Curran Associates, Inc., 2020, pp. 1877–1901.
- OpenAI, “Whisper,” https://openai.com/research/whisper, 2024, accessed: 2024-04-10.
- L. Ribar et al., “SparQ Attention: Bandwidth-Efficient LLM Inference,” Mar. 2024, arXiv preprint arXiv:2312.04985.
- A. Katharopoulos et al., “Transformers are RNNs: Fast autoregressive transformers with linear attention,” in Proceedings of the 37th International Conference on Machine Learning, ser. ICML’20, vol. 119. JMLR.org, Jul. 2020, pp. 5156–5165.
- T. Dao, “FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning,” Jul. 2023, arXiv preprint arXiv:2307.08691.
- J. Choquette et al., “NVIDIA A100 Tensor Core GPU: Performance and Innovation,” IEEE Micro, vol. 41, no. 2, pp. 29–35, Mar. 2021.
- J. Choquette, “NVIDIA Hopper H100 GPU: Scaling Performance,” IEEE Micro, vol. 43, no. 3, pp. 9–17, 2023.
- N. Dey et al., “Cerebras-GPT: Open Compute-Optimal Language Models Trained on the Cerebras Wafer-Scale Cluster,” Apr. 2023, arXiv preprint arXiv:2304.03208.
- D. Abts et al., “A Software-Defined Tensor Streaming Multiprocessor for Large-Scale Machine Learning,” in Proceedings of the 49th Annual International Symposium on Computer Architecture. New York New York: ACM, Jun. 2022, pp. 567–580.
- T. Benz et al., “A High-Performance, Energy-Efficient Modular DMA Engine Architecture,” IEEE Transactions on Computers, vol. 73, no. 1, pp. 263–277, Jan. 2024.
- F. Zaruba et al., “Snitch: A Tiny Pseudo Dual-Issue Processor for Area and Energy Efficient Execution of Floating-Point Intensive Workloads,” IEEE Transactions on Computers, vol. 70, no. 11, pp. 1845–1860, Nov. 2021.
- D. Kalamkar et al., “A Study of BFLOAT16 for Deep Learning Training,” Jun. 2019, arXiv preprint arXiv:1905.12322.
- P. Micikevicius et al., “FP8 Formats for Deep Learning,” Sep. 2022, arXiv preprint arXiv:2209.05433.
- J. Devlin et al., “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), J. Burstein et al., Eds. Minneapolis, Minnesota: Association for Computational Linguistics, Jun. 2019, pp. 4171–4186.
- H. Touvron et al., “Llama 2: Open Foundation and Fine-Tuned Chat Models,” Jul. 2023, arXiv preprint arXiv:2307.09288.
- T. Dettmers et al., “GPT3.int8(): 8-bit Matrix Multiplication for Transformers at Scale,” Advances in Neural Information Processing Systems, vol. 35, pp. 30 318–30 332, Dec. 2022.
- B. Wu et al., “Visual Transformers: Token-based Image Representation and Processing for Computer Vision,” Nov. 2020, arXiv preprint arXiv:2006.03677.
- H. Wu et al., “CvT: Introducing Convolutions to Vision Transformers,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 22–31.
- R. Pope et al., “Efficiently Scaling Transformer Inference,” Proceedings of Machine Learning and Systems, vol. 5, Mar. 2023.
- Z. Liu et al., “MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases,” Feb. 2024, arXiv preprint arXiv:2402.14905.
- X. Ma et al., “Luna: Linear Unified Nested Attention,” in Advances in Neural Information Processing Systems, vol. 34. Curran Associates, Inc., 2021, pp. 2441–2453.
- S. Wang et al., “Linformer: Self-Attention with Linear Complexity,” Jun. 2020, arXiv preprint arXiv:2006.04768.
- Z. Liu et al., “Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time,” Advances in Neural Information Processing Systems, vol. 36, pp. 52 342–52 364, Dec. 2023.
- T. Dettmers et al., “QLoRA: Efficient Finetuning of Quantized LLMs,” Advances in Neural Information Processing Systems, vol. 36, pp. 10 088–10 115, Dec. 2023.
- “NVIDIA Data Center Deep Learning Product Performance AI Inference,” https://developer.nvidia.com/deep-learning-performance-training-inference/ai-inference.
- S. Systems, “SambaNova DataScale® | The AI Platform for Innovation,” https://sambanova.ai/products/datascale.
- S. Lie, “Cerebras Architecture Deep Dive: First Look Inside the Hardware/Software Co-Design for Deep Learning,” IEEE Micro, vol. 43, no. 3, pp. 18–30, May 2023.
- “Intel Gaudi 2 Neural Network Deep Learning Inference Processor,” https://habana.ai/products/gaudi2/.
- A. Smith et al., “AMD Instinct™ MI200 Series Accelerator and Node Architectures,” in 2022 IEEE Hot Chips 34 Symposium (HCS), Aug. 2022, pp. 1–23.
- S. Tuli et al., “AccelTran: A Sparsity-Aware Accelerator for Dynamic Inference With Transformers,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 42, no. 11, pp. 4038–4051, 2023. [Online]. Available: https://ieeexplore.ieee.org/document/10120981/
- Y. Wang et al., “An Energy-Efficient Transformer Processor Exploiting Dynamic Weak Relevances in Global Attention,” IEEE Journal of Solid-State Circuits, vol. 58, no. 1, pp. 227–242, Jan. 2023.
- S. Kim et al., “20.5 C-Transformer: A 2.6-18.1μ𝜇\muitalic_μJ/Token Homogeneous DNN-Transformer/Spiking-Transformer Processor with Big-Little Network and Implicit Weight Generation for Large Language Models,” in 2024 IEEE International Solid-State Circuits Conference (ISSCC). IEEE, 2024, pp. 368–370. [Online]. Available: https://ieeexplore.ieee.org/document/10454330/
- S. Shanmuga Sundaram et al., “FreFlex: A High-Performance Processor for Convolution and Attention Computations via Sparsity-Adaptive Dynamic Frequency Boosting,” IEEE Journal of Solid-State Circuits, vol. 59, no. 3, pp. 855–866, 2024. [Online]. Available: https://ieeexplore.ieee.org/document/10371341/
- Y. Qin et al., “Ayaka: A Versatile Transformer Accelerator With Low-Rank Estimation and Heterogeneous Dataflow,” IEEE Journal of Solid-State Circuits, pp. 1–15, 2024. [Online]. Available: https://ieeexplore.ieee.org/document/10530252/
- T. Tambe et al., “22.9 A 12nm 18.1TFLOPs/W Sparse Transformer Processor with Entropy-Based Early Exit, Mixed-Precision Predication and Fine-Grained Power Management,” in 2023 IEEE International Solid-State Circuits Conference (ISSCC). IEEE, 2023, pp. 342–344. [Online]. Available: https://ieeexplore.ieee.org/document/10067817/
- S. Knowles, “Graphcore,” in 2021 IEEE Hot Chips 33 Symposium (HCS), Aug. 2021, pp. 1–25.
- G. Paulin et al., “Occamy: A 432-core 28.1 dp-gflop/s/w 83% FPU utilization dual-chiplet, dual-HBM2E RISC-V-based accelerator for stencil and sparse linear algebra computations with 8-to-64-bit floating-point support in 12nm FinFET,” VLSI Symposium, 2024.
- L. Bertaccini et al., “MiniFloats on RISC-V Cores: ISA Extensions with Mixed-Precision Short Dot Products,” IEEE Transactions on Emerging Topics in Computing, pp. 1–16, 2024.
- F. Zaruba et al., “Manticore: A 4096-Core RISC-V Chiplet Architecture for Ultraefficient Floating-Point Computing,” IEEE Micro, vol. 41, no. 2, pp. 36–42, Mar. 2021.
- “HBM2E,” https://www.micron.com/products/memory/hbm/hbm2e.
- S. Kim et al., “I-BERT: Integer-only BERT quantization,” arXiv preprint arXiv:2101.01321, 2021.
- P. Scheffler et al., “Sparse Stream Semantic Registers: A Lightweight ISA Extension Accelerating General Sparse Linear Algebra,” IEEE Transactions on Parallel and Distributed Systems, vol. 34, no. 12, pp. 3147–3161, Dec. 2023.
- M. Emani et al., “A Comprehensive Performance Study of Large Language Models on Novel AI Accelerators,” Oct. 2023, arXiv preprint arXiv:2310.04607.
- “Benchmarking Large Language Models on NVIDIA H100 GPUs with CoreWeave (Part 1),” https://www.databricks.com/blog/coreweave-nvidia-h100-part-1, Thu, 04/27/2023 - 09:09.