Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

TAPA-CS: Enabling Scalable Accelerator Design on Distributed HBM-FPGAs (2311.10189v2)

Published 16 Nov 2023 in cs.DC and cs.AR

Abstract: Despite the increasing adoption of Field-Programmable Gate Arrays (FPGAs) in compute clouds, there remains a significant gap in programming tools and abstractions which can leverage network-connected, cloud-scale, multi-die FPGAs to generate accelerators with high frequency and throughput. To this end, we propose TAPA-CS, a task-parallel dataflow programming framework which automatically partitions and compiles a large design across a cluster of FPGAs with no additional user effort while achieving high frequency and throughput. TAPA-CS has three main contributions. First, it is an open-source framework which allows users to leverage virtually "unlimited" accelerator fabric, high-bandwidth memory (HBM), and on-chip memory, by abstracting away the underlying hardware. This reduces the user's programming burden to a logical one, enabling software developers and researchers with limited FPGA domain knowledge to deploy larger designs than possible earlier. Second, given as input a large design, TAPA-CS automatically partitions the design to map to multiple FPGAs, while ensuring congestion control, resource balancing, and overlapping of communication and computation. Third, TAPA-CS couples coarse-grained floorplanning with automated interconnect pipelining at the inter- and intra-FPGA levels to ensure high frequency. We have tested TAPA-CS on our multi-FPGA testbed where the FPGAs communicate through a high-speed 100Gbps Ethernet infrastructure. We have evaluated the performance and scalability of our tool on designs, including systolic-array based convolutional neural networks (CNNs), graph processing workloads such as page rank, stencil applications like the Dilate kernel, and K-nearest neighbors (KNN). TAPA-CS has the potential to accelerate development of increasingly complex and large designs on the low power and reconfigurable FPGAs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (65)
  1. Alibaba FPGAs in the Cloud. https://www.alibabacloud.com/help/en/fpga-based-ecs-instance.
  2. Alveo U55C High Performance Compute Card. https://www.xilinx.com/products/boards-and-kits/alveo/u55c.html#specifications.
  3. AlveoLink. https://github.com/Xilinx/AlveoLink.
  4. Amazon AQUA. https://aws.amazon.com/redshift/features/.
  5. Amazon EC2 F1 Instances. https://aws.amazon.com/ec2/instance-types/f1/.
  6. AMD/Xilinx UltraScale+ Devices Overview. https://docs.xilinx.com/r/en-US/ug1120-alveo-platforms/Overview.
  7. Baidu FPGAs in the Cloud. https://intl.cloud.baidu.com/product/bcc.html.
  8. Gurobi Solver. https://www.gurobi.com/downloads/gurobi-optimizer-eula/.
  9. Huawei FPGAs in the Cloud.
  10. Intel HLS. https://www.intel.com/content/dam/www/central-libraries/us/en/documents/hls-production-brief.pdf.
  11. Intel Stratix 10.
  12. Python MIP. https://www.python-mip.com/.
  13. Vitis HLS 2022.2. https://docs.xilinx.com/r/en-US/ug1399-vitis-hls.
  14. Xilinx PCIe-Based P2P. https://xilinx.github.io/XRT/master/html/p2p.html.
  15. Elastic-DF: Scaling Performance of DNN Inference in FPGA Clouds through Automatic Partitioning. ACM Trans. Reconfigurable Technol. Syst., 15(2), dec 2021.
  16. BYOC: A "Bring Your Own Core" Framework for Heterogeneous-ISA Research. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’20, page 699–714, New York, NY, USA, 2020. Association for Computing Machinery.
  17. Streaming Architecture for Large-Scale Quantized Neural Networks on an FPGA-Based Dataflow Platform. In 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE, may 2018.
  18. Streaming Architecture for Large-Scale Quantized Neural Networks on an FPGA-Based Dataflow Platform. In 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pages 162–169, 2018.
  19. FINN-R: An End-to-End Deep-Learning Framework for Fast Exploration of Quantized Neural Networks, 2018.
  20. The Future of Microprocessors. Commun. ACM, 54(5):67–77, may 2011.
  21. A methodology for correct-by-construction latency insensitive design. In 1999 IEEE/ACM International Conference on Computer-Aided Design. Digest of Technical Papers (Cat. No.99CH37051), pages 309–315, 1999.
  22. Theory of latency-insensitive design. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 20(9):1059–1076, 2001.
  23. A Cloud-Scale Acceleration Architecture. In Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, October 2016.
  24. Extending High-Level Synthesis for Task-Parallel Programs. In 2021 IEEE 29th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), pages 204–213, 2021.
  25. Extending High-Level Synthesis for Task-Parallel Programs, 2021.
  26. SMAPPIC: Scalable Multi-FPGA Architecture Prototype Platform in the Cloud. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, ASPLOS 2023, page 733–746, New York, NY, USA, 2023. Association for Computing Machinery.
  27. HBM Connect: High-Performance HLS Interconnect for FPGA HBM. In The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA ’21, page 116–126, New York, NY, USA, 2021. Association for Computing Machinery.
  28. Serving DNNs in Real Time at Datacenter Scale with Project Brainwave. IEEE Micro, 38:8–20, March 2018.
  29. Understanding Performance Differences of FPGAs and GPUs. In 2018 IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), pages 93–96, 2018.
  30. Streaming Message Interface: High-Performance Distributed Memory Programming on Reconfigurable Hardware. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’19, New York, NY, USA, 2019. Association for Computing Machinery.
  31. Design of ion-implanted MOSFET’s with very small physical dimensions. IEEE Journal of Solid-State Circuits, 9(5):256–268, 1974.
  32. Leveraging Latency-Insensitivity to Ease Multiple FPGA Design. In Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays, FPGA ’12, page 175–184, New York, NY, USA, 2012. Association for Computing Machinery.
  33. A Scalable High-Bandwidth Architecture for Lossless Compression on FPGAs. In The 23rd IEEE International Symposium on Field-Programmable Custom Computing Machines. IEEE – Institute of Electrical and Electronics Engineers, May 2015.
  34. A Configurable Cloud-Scale DNN Processor for Real-Time AI. In Proceedings of the 45th International Symposium on Computer Architecture, 2018. ACM, June 2018.
  35. TAPA: A Scalable Task-Parallel Dataflow Programming Framework for Modern FPGAs with Co-Optimization of HLS and Physical Design. ACM Trans. Reconfigurable Technol. Syst., 16(4), dec 2023.
  36. AutoBridge: Coupling Coarse-Grained Floorplanning and Pipelining for High-Frequency HLS Design on Multi-Die FPGAs. In The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA ’21, page 81–92, New York, NY, USA, 2021. Association for Computing Machinery.
  37. EasyNet: 100 Gbps Network for HLS. pages 197–203, 08 2021.
  38. ACCL: FPGA-Accelerated Collectives over 100 Gbps TCP-IP. In 2021 IEEE/ACM International Workshop on Heterogeneous High-performance Reconfigurable Computing (H2RC), pages 33–43, 2021.
  39. Achieving Super-Linear Speedup across Multi-FPGA for Real-Time DNN Inference. ACM Trans. Embed. Comput. Syst., 18(5s), oct 2019.
  40. FireSim: FPGA-Accelerated Cycle-Exact Scale-Out System Simulation in the Public Cloud. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), pages 29–42, 2018.
  41. Synchronous Data Flow. Proceedings of the IEEE, 75(9):1235–1245, 1987.
  42. Multi-FPGA Communication Interface for Electric Circuit Co-Simulation. In 2020 IEEE Electric Power and Energy Conference (EPEC), pages 1–6, 2020.
  43. SNAP Datasets: Stanford large network dataset collection. http://snap.stanford.edu/data, June 2014.
  44. CHIP-KNN: A Configurable and High-Performance K-Nearest Neighbors Accelerator on Cloud FPGAs. In 2020 International Conference on Field-Programmable Technology (ICFPT), pages 139–147, 2020.
  45. Pioneering Chiplet Technology and Design for the AMD EPYC™ and Ryzen™ Processor Families : Industrial Product. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), pages 57–70, 2021.
  46. Accelerating Deep Convolutional Neural Networks Using Specialized Hardware, February 2015.
  47. The PageRank Citation Ranking : Bringing Order to the Web. In The Web Conference, 1999.
  48. Keshab K Parhi. VLSI digital signal processing systems: design and implementation. In John Wiley & Sons, 2007.
  49. A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services. IEEE Micro, 35(3):10–22, 2015.
  50. ZRLMPI: A Unified Programming Model for Reconfigurable Heterogeneous Computing Clusters. In 2020 IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), pages 220–220, 2020.
  51. TMD-MPI: An MPI Implementation for Multiple Processors Across Multiple FPGAs. In 2006 International Conference on Field Programmable Logic and Applications, pages 1–6, 2006.
  52. Very Deep Convolutional Networks for Large-Scale Image Recognition, 2015.
  53. DIABLO: A Warehouse-Scale Computer Network Simulator Using FPGAs. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’15, page 207–221, New York, NY, USA, 2015. Association for Computing Machinery.
  54. Galapagos: A Full Stack Approach to FPGA Integration in the Cloud. IEEE Micro, 38(6):18–24, 2018.
  55. AIgean: An Open Framework for Machine Learning on Heterogeneous Clusters. In 2020 IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), pages 239–239, 2020.
  56. SASA: A Scalable and Automatic Stencil Acceleration Framework for Optimized Hybrid Spatial and Temporal Parallelism on HBM-Based FPGAs. ACM Trans. Reconfigurable Technol. Syst., 16(2), apr 2023.
  57. FINN. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, feb 2017.
  58. AutoSA: A Polyhedral Compiler for High-Performance Systolic Arrays on FPGA. In The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA ’21, page 93–104, New York, NY, USA, 2021. Association for Computing Machinery.
  59. Layout-Driven RTL Binding Techniques for High-Level Synthesis Using Accurate Estimators. ACM Trans. Des. Autom. Electron. Syst., 2(4):312–343, oct 1997.
  60. Virtualizing FPGAs in the Cloud. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’20, page 845–858, New York, NY, USA, 2020. Association for Computing Machinery.
  61. Hetero-ViTAL: A Virtualization Stack for Heterogeneous FPGA Clusters. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), pages 470–483, 2021.
  62. When Application-Specific ISA Meets FPGAs: A Multi-Layer Virtualization Framework for Heterogeneous Cloud FPGAs. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’21, page 123–134, New York, NY, USA, 2021. Association for Computing Machinery.
  63. Energy-Efficient CNN Implementation on a Deeply Pipelined FPGA Cluster. In Proceedings of the 2016 International Symposium on Low Power Electronics and Design, ISLPED ’16, page 326–331, New York, NY, USA, 2016. Association for Computing Machinery.
  64. An Efficient Mapping Approach to Large-Scale DNNs on Multi-FPGA Architectures. In 2019 Design, Automation and Test in Europe Conference and Exhibition (DATE), pages 1241–1244, 2019.
  65. Fast and Effective Placement and Routing Directed High-Level Synthesis for FPGAs. In Proceedings of the 2014 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA ’14, page 1–10, New York, NY, USA, 2014. Association for Computing Machinery.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Neha Prakriya (6 papers)
  2. Yuze Chi (14 papers)
  3. Suhail Basalama (3 papers)
  4. Linghao Song (17 papers)
  5. Jason Cong (62 papers)
Citations (1)

Summary

We haven't generated a summary for this paper yet.