Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Optimizing Foundation Model Inference on a Many-tiny-core Open-source RISC-V Platform (2405.19284v1)

Published 29 May 2024 in cs.DC, cs.AI, and cs.AR

Abstract: Transformer-based foundation models have become crucial for various domains, most notably NLP or computer vision (CV). These models are predominantly deployed on high-performance GPUs or hardwired accelerators with highly customized, proprietary instruction sets. Until now, limited attention has been given to RISC-V-based general-purpose platforms. In our work, we present the first end-to-end inference results of transformer models on an open-source many-tiny-core RISC-V platform implementing distributed Softmax primitives and leveraging ISA extensions for SIMD floating-point operand streaming and instruction repetition, as well as specialized DMA engines to minimize costly main memory accesses and to tolerate their latency. We focus on two foundational transformer topologies, encoder-only and decoder-only models. For encoder-only models, we demonstrate a speedup of up to 12.8x between the most optimized implementation and the baseline version. We reach over 79% FPU utilization and 294 GFLOPS/W, outperforming State-of-the-Art (SoA) accelerators by more than 2x utilizing the HW platform while achieving comparable throughput per computational unit. For decoder-only topologies, we achieve 16.1x speedup in the Non-Autoregressive (NAR) mode and up to 35.6x speedup in the Autoregressive (AR) mode compared to the baseline implementation. Compared to the best SoA dedicated accelerator, we achieve 2.04x higher FPU utilization.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. W. X. Zhao et al., “A Survey of Large Language Models,” Nov. 2023, arXiv preprint arXiv:2303.18223.
  2. A. Dosovitskiy et al., “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,” arXiv preprint arXiv:2010.11929, Oct. 2020.
  3. C. Zhou et al., “A Comprehensive Survey on Pretrained Foundation Models: A History from BERT to ChatGPT,” May 2023, arXiv preprint arXiv:2302.09419.
  4. A. Vaswani et al., “Attention is All you Need,” in Advances in Neural Information Processing Systems, vol. 30.   Curran Associates, Inc., 2017.
  5. R. Bommasani et al., “On the Opportunities and Risks of Foundation Models,” Jul. 2022, arXiv preprint arXiv:2108.07258.
  6. T. Brown et al., “Language Models are Few-Shot Learners,” in Advances in Neural Information Processing Systems, vol. 33.   Curran Associates, Inc., 2020, pp. 1877–1901.
  7. OpenAI, “Whisper,” https://openai.com/research/whisper, 2024, accessed: 2024-04-10.
  8. L. Ribar et al., “SparQ Attention: Bandwidth-Efficient LLM Inference,” Mar. 2024, arXiv preprint arXiv:2312.04985.
  9. A. Katharopoulos et al., “Transformers are RNNs: Fast autoregressive transformers with linear attention,” in Proceedings of the 37th International Conference on Machine Learning, ser. ICML’20, vol. 119.   JMLR.org, Jul. 2020, pp. 5156–5165.
  10. T. Dao, “FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning,” Jul. 2023, arXiv preprint arXiv:2307.08691.
  11. J. Choquette et al., “NVIDIA A100 Tensor Core GPU: Performance and Innovation,” IEEE Micro, vol. 41, no. 2, pp. 29–35, Mar. 2021.
  12. J. Choquette, “NVIDIA Hopper H100 GPU: Scaling Performance,” IEEE Micro, vol. 43, no. 3, pp. 9–17, 2023.
  13. N. Dey et al., “Cerebras-GPT: Open Compute-Optimal Language Models Trained on the Cerebras Wafer-Scale Cluster,” Apr. 2023, arXiv preprint arXiv:2304.03208.
  14. D. Abts et al., “A Software-Defined Tensor Streaming Multiprocessor for Large-Scale Machine Learning,” in Proceedings of the 49th Annual International Symposium on Computer Architecture.   New York New York: ACM, Jun. 2022, pp. 567–580.
  15. T. Benz et al., “A High-Performance, Energy-Efficient Modular DMA Engine Architecture,” IEEE Transactions on Computers, vol. 73, no. 1, pp. 263–277, Jan. 2024.
  16. F. Zaruba et al., “Snitch: A Tiny Pseudo Dual-Issue Processor for Area and Energy Efficient Execution of Floating-Point Intensive Workloads,” IEEE Transactions on Computers, vol. 70, no. 11, pp. 1845–1860, Nov. 2021.
  17. D. Kalamkar et al., “A Study of BFLOAT16 for Deep Learning Training,” Jun. 2019, arXiv preprint arXiv:1905.12322.
  18. P. Micikevicius et al., “FP8 Formats for Deep Learning,” Sep. 2022, arXiv preprint arXiv:2209.05433.
  19. J. Devlin et al., “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), J. Burstein et al., Eds.   Minneapolis, Minnesota: Association for Computational Linguistics, Jun. 2019, pp. 4171–4186.
  20. H. Touvron et al., “Llama 2: Open Foundation and Fine-Tuned Chat Models,” Jul. 2023, arXiv preprint arXiv:2307.09288.
  21. T. Dettmers et al., “GPT3.int8(): 8-bit Matrix Multiplication for Transformers at Scale,” Advances in Neural Information Processing Systems, vol. 35, pp. 30 318–30 332, Dec. 2022.
  22. B. Wu et al., “Visual Transformers: Token-based Image Representation and Processing for Computer Vision,” Nov. 2020, arXiv preprint arXiv:2006.03677.
  23. H. Wu et al., “CvT: Introducing Convolutions to Vision Transformers,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 22–31.
  24. R. Pope et al., “Efficiently Scaling Transformer Inference,” Proceedings of Machine Learning and Systems, vol. 5, Mar. 2023.
  25. Z. Liu et al., “MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases,” Feb. 2024, arXiv preprint arXiv:2402.14905.
  26. X. Ma et al., “Luna: Linear Unified Nested Attention,” in Advances in Neural Information Processing Systems, vol. 34.   Curran Associates, Inc., 2021, pp. 2441–2453.
  27. S. Wang et al., “Linformer: Self-Attention with Linear Complexity,” Jun. 2020, arXiv preprint arXiv:2006.04768.
  28. Z. Liu et al., “Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time,” Advances in Neural Information Processing Systems, vol. 36, pp. 52 342–52 364, Dec. 2023.
  29. T. Dettmers et al., “QLoRA: Efficient Finetuning of Quantized LLMs,” Advances in Neural Information Processing Systems, vol. 36, pp. 10 088–10 115, Dec. 2023.
  30. “NVIDIA Data Center Deep Learning Product Performance AI Inference,” https://developer.nvidia.com/deep-learning-performance-training-inference/ai-inference.
  31. S. Systems, “SambaNova DataScale® | The AI Platform for Innovation,” https://sambanova.ai/products/datascale.
  32. S. Lie, “Cerebras Architecture Deep Dive: First Look Inside the Hardware/Software Co-Design for Deep Learning,” IEEE Micro, vol. 43, no. 3, pp. 18–30, May 2023.
  33. “Intel Gaudi 2 Neural Network Deep Learning Inference Processor,” https://habana.ai/products/gaudi2/.
  34. A. Smith et al., “AMD Instinct™ MI200 Series Accelerator and Node Architectures,” in 2022 IEEE Hot Chips 34 Symposium (HCS), Aug. 2022, pp. 1–23.
  35. S. Tuli et al., “AccelTran: A Sparsity-Aware Accelerator for Dynamic Inference With Transformers,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 42, no. 11, pp. 4038–4051, 2023. [Online]. Available: https://ieeexplore.ieee.org/document/10120981/
  36. Y. Wang et al., “An Energy-Efficient Transformer Processor Exploiting Dynamic Weak Relevances in Global Attention,” IEEE Journal of Solid-State Circuits, vol. 58, no. 1, pp. 227–242, Jan. 2023.
  37. S. Kim et al., “20.5 C-Transformer: A 2.6-18.1μ𝜇\muitalic_μJ/Token Homogeneous DNN-Transformer/Spiking-Transformer Processor with Big-Little Network and Implicit Weight Generation for Large Language Models,” in 2024 IEEE International Solid-State Circuits Conference (ISSCC).   IEEE, 2024, pp. 368–370. [Online]. Available: https://ieeexplore.ieee.org/document/10454330/
  38. S. Shanmuga Sundaram et al., “FreFlex: A High-Performance Processor for Convolution and Attention Computations via Sparsity-Adaptive Dynamic Frequency Boosting,” IEEE Journal of Solid-State Circuits, vol. 59, no. 3, pp. 855–866, 2024. [Online]. Available: https://ieeexplore.ieee.org/document/10371341/
  39. Y. Qin et al., “Ayaka: A Versatile Transformer Accelerator With Low-Rank Estimation and Heterogeneous Dataflow,” IEEE Journal of Solid-State Circuits, pp. 1–15, 2024. [Online]. Available: https://ieeexplore.ieee.org/document/10530252/
  40. T. Tambe et al., “22.9 A 12nm 18.1TFLOPs/W Sparse Transformer Processor with Entropy-Based Early Exit, Mixed-Precision Predication and Fine-Grained Power Management,” in 2023 IEEE International Solid-State Circuits Conference (ISSCC).   IEEE, 2023, pp. 342–344. [Online]. Available: https://ieeexplore.ieee.org/document/10067817/
  41. S. Knowles, “Graphcore,” in 2021 IEEE Hot Chips 33 Symposium (HCS), Aug. 2021, pp. 1–25.
  42. G. Paulin et al., “Occamy: A 432-core 28.1 dp-gflop/s/w 83% FPU utilization dual-chiplet, dual-HBM2E RISC-V-based accelerator for stencil and sparse linear algebra computations with 8-to-64-bit floating-point support in 12nm FinFET,” VLSI Symposium, 2024.
  43. L. Bertaccini et al., “MiniFloats on RISC-V Cores: ISA Extensions with Mixed-Precision Short Dot Products,” IEEE Transactions on Emerging Topics in Computing, pp. 1–16, 2024.
  44. F. Zaruba et al., “Manticore: A 4096-Core RISC-V Chiplet Architecture for Ultraefficient Floating-Point Computing,” IEEE Micro, vol. 41, no. 2, pp. 36–42, Mar. 2021.
  45. “HBM2E,” https://www.micron.com/products/memory/hbm/hbm2e.
  46. S. Kim et al., “I-BERT: Integer-only BERT quantization,” arXiv preprint arXiv:2101.01321, 2021.
  47. P. Scheffler et al., “Sparse Stream Semantic Registers: A Lightweight ISA Extension Accelerating General Sparse Linear Algebra,” IEEE Transactions on Parallel and Distributed Systems, vol. 34, no. 12, pp. 3147–3161, Dec. 2023.
  48. M. Emani et al., “A Comprehensive Performance Study of Large Language Models on Novel AI Accelerators,” Oct. 2023, arXiv preprint arXiv:2310.04607.
  49. “Benchmarking Large Language Models on NVIDIA H100 GPUs with CoreWeave (Part 1),” https://www.databricks.com/blog/coreweave-nvidia-h100-part-1, Thu, 04/27/2023 - 09:09.
Citations (2)

Summary

  • The paper introduces an open-source FM library that efficiently executes both encoder-only and decoder-only transformer models on a many-tiny-core RISC-V system.
  • The research leverages specialized RISC-V ISA extensions, achieving up to 35.6× speedup for autoregressive modes and enhanced efficiency across various data precisions.
  • Comprehensive benchmarking reveals eightfold higher FPU utilization and up to 294 GFLOPS/W performance, underscoring its potential for cost-effective, scalable AI deployments.

Optimizing Foundation Model Inference on a Many-tiny-core Open-source RISC-V Platform

The paper "Optimizing Foundation Model Inference on a Many-tiny-core Open-source RISC-V Platform" presents a comprehensive investigation into the end-to-end inference of transformer models on an open-source, many-tiny-core RISC-V platform. This work is a collaboration between various researchers from ETH Zurich, University of Bologna, and Politecnico di Torino. The paper is novel in demonstrating how foundation models (FMs) can be efficiently executed on a RISC-V-based platform, addressing the current void of RISC-V usage in FM deployment.

Key Contributions

  1. Open-source FM Library: The authors have developed an open-source library that supports both encoder-only and decoder-only models, leveraging the hardware capabilities and Instruction Set Architecture (ISA) extensions of the RISC-V multi-core platform. These capabilities include advanced Direct Memory Access (DMA) engines and cluster-to-cluster data transfers, which enhance performance by reducing main memory accesses.
  2. Kernel Optimization with ISA Extensions: The research provides a detailed analysis of the performance boost obtained by employing specialized RISC-V ISA extensions, such as SIMD floating-point operand streaming and instruction repetition. These optimizations result in speedups of up to 35.6×\times for decoder models in autoregressive (AR) mode, 16.1×\times in non-autoregressive (NAR) mode, and 12.8×\times for encoder-only models like Vision Transformers (ViTs). This is achieved with over 79% FPU utilization in NAR mode.
  3. Precision Scalability: The paper explores performance scalability across different data precisions—FP64, FP32, FP16, and FP8. The optimized library benchmarks show that using lower precision formats significantly improves efficiency, achieving up to 294 GFLOPS/W with FP8 precision.
  4. First Fully Open-source Deployment: This work pioneers the full open-source deployment of ViT and GPT models on an open-source RISC-V hardware architecture, showcasing flexibility and the potential for large-scale adoption.
  5. Comprehensive Benchmarking: Benchmarking results demonstrate that the proposed end-to-end inference engine outperforms state-of-the-art (SoA) platforms in terms of hardware utilization. The platform achieves eight times higher FPU utilization when compared to the best SoA dedicated accelerator, with a minimum speedup of 1.81×\times compared to the best competitor.

Detailed Analysis

The researchers targeted various transformer-based foundation models, including encoder-only (like ViTs) and decoder-only (like GPT) models, to validate their framework. The computational patterns of attention layers, characterized by quadratic scaling with the input sequence length, were optimized using the FlashAttention-2 algorithm. This algorithm computes attention efficiently while reducing latency and memory accesses.

For encoder-only models, such as different variants of ViTs (Base, Large, Huge), the paper achieved significant speedups by spatially tiling the GEMM operations across clusters and employing temporal tiling when needed. Double buffering techniques were also used to hide memory transfer latencies effectively. The hierarchical interconnect of the platform facilitated efficient data transfers at different levels of the memory hierarchy, further improving performance by minimizing costly main memory accesses.

For decoder-only models, the paper focused on architectural and kernel optimizations for both non-autoregressive and autoregressive modes. The results showcased substantial performance improvements, particularly in FP8 data precision, demonstrating the efficacy of mixed-precision execution. The usage of cluster-to-cluster data transfers allowed for efficient layer fusion in the MLP and MHA blocks, essential for reducing intermediate memory accesses and enhancing computational efficiency.

Implications and Future Directions

The implications of this research are profound, both practically and theoretically. By demonstrating the feasibility of running large-scale transformer models on an open-source RISC-V platform, the paper opens avenues for cost-effective and transparent AI deployments. This democratization of AI hardware aligns well with the increasing need for open-source ecosystems in machine learning research and applications.

The strong numerical results, such as achieving 294 GFLOPS/W with FP8 precision and more than doubling the FPU utilization of best-in-class SoA platforms, suggest that future developments could focus on further optimizing the RISC-V ISA for AI workloads. Additionally, expanding the scale of the architecture to multi-chiplet systems offers a promising direction for future research, potentially enabling the handling of even larger models and more complex AI tasks.

In conclusion, this research significantly pushes the boundaries of RISC-V platform capabilities in the context of foundation model inference, setting a new benchmark in open-source hardware and software co-design for AI applications.