ACCL+: an FPGA-Based Collective Engine for Distributed Applications (2312.11742v1)
Abstract: FPGAs are increasingly prevalent in cloud deployments, serving as Smart NICs or network-attached accelerators. Despite their potential, developing distributed FPGA-accelerated applications remains cumbersome due to the lack of appropriate infrastructure and communication abstractions. To facilitate the development of distributed applications with FPGAs, in this paper we propose ACCL+, an open-source versatile FPGA-based collective communication library. Portable across different platforms and supporting UDP, TCP, as well as RDMA, ACCL+ empowers FPGA applications to initiate direct FPGA-to-FPGA collective communication. Additionally, it can serve as a collective offload engine for CPU applications, freeing the CPU from networking tasks. It is user-extensible, allowing new collectives to be implemented and deployed without having to re-synthesize the FPGA circuit. We evaluated ACCL+ on an FPGA cluster with 100 Gb/s networking, comparing its performance against software MPI over RDMA. The results demonstrate ACCL+'s significant advantages for FPGA-based distributed applications and highly competitive performance for CPU applications. We showcase ACCL+'s dual role with two use cases: seamlessly integrating as a collective offload engine to distribute CPU-based vector-matrix multiplication, and serving as a crucial and efficient component in designing fully FPGA-based distributed deep-learning recommendation inference.
- Waverunner: An elegant approach to hardware acceleration of state machine replication. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), pages 357–374, Boston, MA, April 2023. USENIX Association.
- Elastic-df: Scaling performance of dnn inference in fpga clouds through automatic partitioning. ACM Trans. Reconfigurable Technol. Syst., 15(2), dec 2021.
- AMD. Rccl’s documentation. https://rccl.readthedocs.io/en/rocm-4.3.0/, 2021.
- Offloading collective operations to programmable logic. IEEE Micro, 37(5):52–60, 2017.
- Software defined multicasting for mpi collective operation offloading with the netfpga. In European Conference on Parallel Processing, 2014.
- Offloading collective operations to programmable logic on a zynq cluster. In 2016 IEEE 24th Annual Symposium on High-Performance Interconnects (HOTI), pages 76–83, 2016.
- ARM. AMBA 4 AXI4-Stream Protocol Specification. https://developer.arm.com/documentation/ihi0051/a/, 2010.
- Virtualized execution runtime for fpga accelerators in the cloud. IEEE Access, 5:1900–1910, 2017.
- Bluesmpi: Efficient mpi non-blocking alltoall offloading designs on modern bluefield smart nics. In Bradford L. Chamberlain, Ana-Lucia Varbanescu, Hatem Ltaief, and Piotr Luszczek, editors, High Performance Computing, pages 18–37, Cham, 2021. Springer International Publishing.
- Toward multi-fpga acceleration of the neural networks. J. Emerg. Technol. Comput. Syst., 17(2), apr 2021.
- The future of fpga acceleration in datacenters and the cloud. ACM Trans. Reconfigurable Technol. Syst., 15(3), feb 2022.
- F4t: A fast and flexible fpga-based full-stack tcp acceleration framework. In Proceedings of the 50th Annual International Symposium on Computer Architecture, ISCA ’23, New York, NY, USA, 2023. Association for Computing Machinery.
- Rethinking software runtimes for disaggregated memory. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’21, page 79–92, New York, NY, USA, 2021. Association for Computing Machinery.
- A cloud-scale acceleration architecture. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2016.
- Enabling fpgas in the cloud. In Proceedings of the 11th ACM Conference on Computing Frontiers, CF ’14, New York, NY, USA, 2014. Association for Computing Machinery.
- Serving DNNs in Real Time at Datacenter Scale with Project Brainwave. IEEE Micro, 2018.
- Enabling reconfigurable hpc through mpi-based inter-fpga communication. In Proceedings of the 37th International Conference on Supercomputing, ICS ’23, page 477–487, New York, NY, USA, 2023. Association for Computing Machinery.
- Deep neural networks for youtube recommendations. In Proceedings of the 10th ACM conference on recommender systems, 2016.
- An fpga-based network intrusion detection architecture. IEEE Transactions on Information Forensics and Security, 3(1):118–132, 2008.
- Ompss@cloudfpga: An fpga task-based programming model with message passing. In 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 828–838, 2022.
- Streaming Message Interface: High-Performance Distributed Memory Programming on Reconfigurable Hardware. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’19. Association for Computing Machinery, 2019.
- Hardware TCP Offload Engine based on 10-Gbps Ethernet for low-latency network communication. In 2016 International Conference on Field-Programmable Technology (FPT), 2016.
- Flexdriver: A network driver for your accelerator. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’22, page 1115–1129, New York, NY, USA, 2022. Association for Computing Machinery.
- NICA: An infrastructure for inline acceleration of network applications. In 2019 USENIX Annual Technical Conference (USENIX ATC 19), pages 345–362, Renton, WA, July 2019. USENIX Association.
- A Modular Heterogeneous Stack for Deploying FPGAs and CPUs in the Data Center. FPGA ’19. Association for Computing Machinery, 2019.
- Accelerating Raw Data Analysis with the ACCORDA Software and Hardware Architecture. Proc. VLDB Endow., page 1568–1582, jul 2019.
- An fpga-based stream processor for embedded real-time vision with convolutional networks. In 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops, pages 878–885, 2009.
- The leap fpga operating system. In 2014 24th International Conference on Field Programmable Logic and Applications (FPL), pages 1–8, 2014.
- Leveraging latency-insensitivity to ease multiple fpga design. In Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays, FPGA ’12, page 175–184, New York, NY, USA, 2012. Association for Computing Machinery.
- Corundum: An Open-Source 100-Gbps Nic. In 2020 IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2020.
- Hardware implementation of MPI_Barrier on an FPGA cluster. In 2009 International Conference on Field Programmable Logic and Applications, pages 12–17, 2009.
- The netflix recommender system: Algorithms, business value, and innovation. ACM Transactions on Management Information Systems (TMIS), 2015.
- Open mpi: A flexible high performance mpi. In Proceedings of the 6th International Conference on Parallel Processing and Applied Mathematics, PPAM’05, page 228–239, Berlin, Heidelberg, 2005. Springer-Verlag.
- Eigen. URl: http://eigen. tuxfamily. org, 3(1), 2010.
- A framework for neural network inference on fpga-centric smartnics. In 2022 32nd International Conference on Field-Programmable Logic and Applications (FPL), pages 01–08, 2022.
- Clio: A hardware-software co-designed disaggregated memory system. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’22, page 417–433, New York, NY, USA, 2022. Association for Computing Machinery.
- DeepRecSys: A System for Optimizing End-To-End At-scale Neural Recommendation Inference. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), 2020.
- The architectural implications of Facebook’s DNN-based personalized recommendation. In 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), 2020.
- A reconfigurable compute-in-the-network fpga assistant for high-level collective support with distributed matrix multiply case study. In 2020 International Conference on Field-Programmable Technology (ICFPT), pages 159–164, 2020.
- EasyNet: 100 Gbps Network for HLS. In 2021 31st International Conference on Field-Programmable Logic and Applications (FPL), 2021.
- ACCL: FPGA-Accelerated Collectives over 100 Gbps TCP-IP. In 2021 IEEE/ACM International Workshop on Heterogeneous High-performance Reconfigurable Computing (H2RC), pages 33–43, 2021.
- A gigabit udp/ip network stack in fpga. In 2009 16th IEEE International Conference on Electronics, Circuits and Systems - (ICECS 2009), pages 836–839, 2009.
- Pieter Hintjens. ZeroMQ: messaging for many applications. " O’Reilly Media, Inc.", 2013.
- Cross-stack workload characterization of deep recommendation systems. In 2020 IEEE International Symposium on Workload Characterization (IISWC), 2020.
- Centaur: A chiplet-based, hybrid sparse-dense accelerator for personalized recommendations. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), 2020.
- Intel. Intel FPGA Add-on for oneAPI Base Toolkit. https://www.intel.com/content/www/us/en/developer/tools/oneapi/fpga.html.
- Intel. Intel Quartus Prime Standard Edition User Guide: Getting Started. https://www.intel.com/content/www/us/en/programmable/documentation/yoq1529444104707.html.
- STYX: Exploiting SmartNIC capability to reduce datacenter memory tax. In 2023 USENIX Annual Technical Conference (USENIX ATC 23), pages 619–633, Boston, MA, July 2023. USENIX Association.
- 40Gbps multi-connection TCP/IP offload engine. In 2011 International Conference on Wireless Communications and Signal Processing (WCSP), pages 1–5, 2011.
- MicroRec: Efficient Recommendation Inference by Hardware and Data Structure Solutions. In 2021 4th Conference on Machine Learning and Systems (MLSys), 2021.
- FleetRec: Large-Scale Recommendation Inference on Hybrid GPU-FPGA Clusters. KDD ’21, New York, NY, USA, 2021. Association for Computing Machinery.
- Vinod Kathail. Xilinx vitis unified software platform. In The 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pages 173–174, 2020.
- Pasta: Programming and automation support for scalable task-parallel hls programs on modern multi-die fpgas. In 2023 IEEE 31st Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), pages 12–22, 2023.
- Sharing, protection, and compatibility for reconfigurable fabric with AmorphOS. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 107–127, Carlsbad, CA, October 2018. USENIX Association.
- Farview: Disaggregated memory with operator off-loading for database engines, 2021.
- Do OS abstractions make sense on FPGAs? In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 991–1010. USENIX Association, November 2020.
- Tensordimm: A practical near-memory processing architecture for embeddings and tensor operations in deep learning. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2019.
- Diad – distributed acceleration for datacenter fpgas. In 2023 33rd International Conference on Field-Programmable Logic and Applications (FPL), pages 166–173, 2023.
- KV-Direct: High-Performance In-Memory Key-Value Store with Programmable NIC. In Proceedings of the 26th Symposium on Operating Systems Principles, SOSP ’17, 2017.
- Drawerpipe: A reconfigurable pipeline for network processing on fpga-based smartnic. Electronics, 9(1), 2020.
- PANIC: A High-Performance Programmable NIC for Multi-Tenant Networks. In Proceedings of the 14th USENIX Conference on Operating Systems Design and Implementation, OSDI’20. USENIX Association, 2020.
- Honeycomb: ordered key-value store acceleration on an fpga-based smartnic, 2023.
- Multi-Path transport for RDMA in datacenters. In 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 18), pages 357–371, Renton, WA, April 2018. USENIX Association.
- A hypervisor for shared-memory fpga platforms. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’20, page 827–844, New York, NY, USA, 2020. Association for Computing Machinery.
- Hetero-rec: Optimal deployment of embeddings for high-speed recommendations. In Proceedings of the Second International Conference on AI-ML Systems, AIMLSystems ’22, New York, NY, USA, 2023. Association for Computing Machinery.
- Fpgavirt: A novel virtualization framework for fpgas in the cloud. In 2018 IEEE 11th International Conference on Cloud Computing (CLOUD), pages 862–865, 2018.
- Optweb: A lightweight fully connected inter-fpga network for efficient collectives. IEEE Transactions on Computers, 70(6):849–862, 2021.
- NVDIA. NVIDIA Collective Communications Library (NCCL). https://docs.nvidia.com/deeplearning/nccl/index.html, 2021.
- A reconfigurable computing system based on a cache-coherent fabric. In 2011 International Conference on Reconfigurable Computing and FPGAs, pages 80–85, 2011.
- The mvapich project: Transforming research into high-performance mpi library for hpc community. Journal of Computational Science, 52:101208, 2021. Case Studies in Translational Computer Science.
- Scale-out acceleration for machine learning. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-50 ’17, page 367–381, New York, NY, USA, 2017. Association for Computing Machinery.
- Optimizing MPI communication on multi-GPU systems using CUDA inter-process communication. In 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum, pages 1848–1857. IEEE, 2012.
- A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services. In Proceeding of the 41st Annual International Symposium on Computer Architecuture (ISCA). IEEE Press, June 2014.
- A survey of system architectures and techniques for fpga virtualization. IEEE Transactions on Parallel and Distributed Systems, 32(9):2216–2230, 2021.
- ZRLMPI: A Unified Programming Model for Reconfigurable Heterogeneous Computing Clusters. In 2020 IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).
- Programming Reconfigurable Heterogeneous Computing Clusters Using MPI With Transpilation. In 2020 IEEE/ACM International Workshop on Heterogeneous High-performance Reconfigurable Computing (H2RC), 2020.
- A case for function-as-a-service with disaggregated fpgas. In 2021 IEEE 14th International Conference on Cloud Computing (CLOUD), pages 333–344, 2021.
- Limago: An FPGA-Based Open-Source 100 GbE TCP/IP Stack. In 2019 29th International Conference on Field Programmable Logic and Applications (FPL), pages 286–292, 2019.
- MPI as a Programming Model for High-Performance Reconfigurable Computers. ACM Trans. Reconfigurable Technol. Syst., 3(4), 2010.
- TMD-MPI: An MPI Implementation for Multiple Processors Across Multiple FPGAs. In 2006 International Conference on Field Programmable Logic and Applications, pages 1–6, 2006.
- Parallel Matrix Multiplication: A Systematic Journey. SIAM Journal on Scientific Computing, 2016.
- A high-throughput, resource-efficient implementation of the rocev2 remote dma protocol for network-attached hardware accelerators. In 2020 International Conference on Field-Programmable Technology (ICFPT), pages 241–249, 2020.
- Cnn-on-aws: Efficient allocation of multikernel applications on multi-fpga platforms. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 40(2):301–314, 2021.
- Towards a fully disaggregated and programmable data center. In Proceedings of the 13th ACM SIGOPS Asia-Pacific Workshop on Systems, APSys ’22, page 18–28, New York, NY, USA, 2022. Association for Computing Machinery.
- Low-latency TCP/IP stack for data center applications. In 2016 26th International Conference on Field Programmable Logic and Applications (FPL), 2016.
- StRoM: Smart Remote Memory. In Proceedings of the Fifteenth European Conference on Computer Systems, EuroSys ’20. Association for Computing Machinery, 2020.
- A novel framework for efficient offloading of communication operations to bluefield smartnics. In 2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 123–133, 2023.
- Galapagos: A Full Stack Approach to FPGA Integration in the Cloud. IEEE Micro, 2018.
- Lynx: A smartnic-driven accelerator-centric architecture for network servers. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’20, page 117–131, New York, NY, USA, 2020. Association for Computing Machinery.
- A survey on fpga virtualization. In 2018 28th International Conference on Field Programmable Logic and Applications (FPL), pages 131–1317, 2018.
- Resource elastic virtualization for fpgas using opencl. In 2018 28th International Conference on Field Programmable Logic and Applications (FPL), pages 111–1117, 2018.
- MPI-HMMER-boost: distributed FPGA acceleration. The Journal of VLSI Signal Processing Systems for Signal, Image, and Video Technology, 48(3):223–238, 2007.
- GPU-Aware MPI on RDMA-Enabled Clusters: Design, Implementation and Evaluation. IEEE Transactions on Parallel and Distributed Systems, 25(10):2595–2605, 2014.
- FpgaNIC: An FPGA-based versatile 100gb SmartNIC for GPUs. In 2022 USENIX Annual Technical Conference (USENIX ATC 22), pages 967–986, Carlsbad, CA, July 2022. USENIX Association.
- Enabling fpgas in hyperscale data centers. In 2015 IEEE 12th Intl Conf on Ubiquitous Intelligence and Computing and 2015 IEEE 12th Intl Conf on Autonomic and Trusted Computing and 2015 IEEE 15th Intl Conf on Scalable Computing and Communications and Its Associated Workshops (UIC-ATC-ScalCom), pages 1078–1086, 2015.
- Network-attached FPGAs for data center applications. In 2016 International Conference on Field-Programmable Technology (FPT), 2016.
- A study of pointer-chasing performance on shared-memory processor-fpga systems. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA ’16, page 264–273, New York, NY, USA, 2016. Association for Computing Machinery.
- AMD Xilinx. XUP Vitis Network Example (VNx). https://github.com/Xilinx/xup_vitis_network_example.
- AMD Xilinx. Quick Start Guide:MicroBlaze Soft Processor for Vitis 2019.2. https://www.xilinx.com/support/documentation/quick_start/microblaze-quick-start-guide-with-vitis.pdf, 2019.
- AMD Xilinx. Axi datamover v5.1 logicore ip product guide. https://docs.xilinx.com/r/en-US/pg022_axi_datamover/AXI-DataMover-v5.1-LogiCORE-IP-Product-Guide, 2023.
- AMD Xilinx. Dma for pci express (pcie) subsystem. https://www.xilinx.com/products/intellectual-property/pcie-dma.html, 2023.
- Hybrid CUDA, OpenMP, and MPI parallel programming on multicore GPU clusters. Computer Physics Communications, 182(1):266–269, 2011.
- Virtualizing fpgas in the cloud. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’20, page 845–858, New York, NY, USA, 2020. Association for Computing Machinery.
- Hetero-vital: A virtualization stack for heterogeneous fpga clusters. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), pages 470–483, 2021.
- When application-specific isa meets fpgas: A multi-layer virtualization framework for heterogeneous cloud fpgas. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’21, page 123–134, New York, NY, USA, 2021. Association for Computing Machinery.
- Smartds: Middle-tier-centric smartnic enabling application-aware message split for disaggregated block storage. In Proceedings of the 50th Annual International Symposium on Computer Architecture, ISCA ’23, New York, NY, USA, 2023. Association for Computing Machinery.
- An efficient mapping approach to large-scale dnns on multi-fpga architectures. In 2019 Design, Automation & Test in Europe Conference & Exhibition (DATE), pages 1241–1244, 2019.
- Deep interest network for click-through rate prediction. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2018.
- Distributed Recommendation Inference on FPGA Clusters. In 2021 31th International Conference on Field-Programmable Logic and Applications (FPL). IEEE, 2021.
- NetFPGA SUME: Toward 100 Gbps as Research Commodity. IEEE Micro, 2014.