Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

FPsPIN: An FPGA-based Open-Hardware Research Platform for Processing in the Network (2405.16378v1)

Published 25 May 2024 in cs.NI, cs.DC, and cs.PF

Abstract: In the era of post-Moore computing, network offload emerges as a solution to two challenges: the imperative for low-latency communication and the push towards hardware specialisation. Various methods have been employed to offload protocol- and data-processing onto network interface cards (NICs), from firmware modification to running full Linux on NICs for application execution. The sPIN project enables users to define handlers executed upon packet arrival. While simulations show sPIN's potential across diverse workloads, a full-system evaluation is lacking. This work presents FPsPIN, a full FPGA-based implementation of sPIN. FPsPIN is showcased through offloaded MPI datatype processing, achieving a 96% overlap ratio. FPsPIN provides an adaptable open-source research platform for researchers to conduct end-to-end experiments on smart NICs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. A. Kalia, M. Kaminsky, and D. G. Andersen, “Design guidelines for high performance {{\{{RDMA}}\}} systems,” in 2016 USENIX Annual Technical Conference (USENIX ATC 16), 2016, pp. 437–450.
  2. T. Hoefler, D. Roweth, K. Underwood, R. Alverson, M. Griswold, V. Tabatabaee, M. Kalkunte, S. Anubolu, S. Shen, M. McLaren, A. Kabbani, and S. Scott, “Data center ethernet and remote direct memory access: Issues at hyperscale,” Computer, vol. 56, no. 7, pp. 67–77, 2023.
  3. D. Molka, D. Hackenberg, and R. Schöne, “Main memory and cache performance of intel sandy bridge and amd bulldozer,” in Proceedings of the workshop on Memory Systems Performance and Correctness, 2014, pp. 1–10.
  4. J. C. Mogul, “{{\{{TCP}}\}} offload is a dumb idea whose time has come,” in 9th Workshop on Hot Topics in Operating Systems (HotOS IX), 2003.
  5. D. De Sensi, S. Di Girolamo, S. Ashkboos, S. Li, and T. Hoefler, “Flare: Flexible in-network allreduce,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2021, pp. 1–16.
  6. Z. István, D. Sidler, G. Alonso, and M. Vukolic, “Consensus in a box: Inexpensive coordination in hardware,” in 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI 16), 2016, pp. 425–438.
  7. T. Hoefler, S. D. Girolamo, K. Taranov, R. E. Grant, and R. Brightwell, “sPIN: High-performance streaming Processing in the Network,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC17), Nov. 2017.
  8. S. D. Girolamo, A. Kurth, A. Calotoiu, T. Benz, T. Schneider, J. Beránek, L. Benini, and T. Hoefler, “A RISC-V in-network accelerator for flexible high-performance low-power packet processing,” in Proceedings of the 48th Annual International Symposium on Computer Architecture (ISCA’21), Jun. 2021.
  9. S. Cao, S. D. Girolamo, and T. Hoefler, “Accelerating Data Serialization/Deserialization Protocols with In-Network Compute,” in 2022 IEEE/ACM International Workshop on Exascale MPI (ExaMPI), Nov. 2022.
  10. S. D. Girolamo, D. D. Sensi, K. Taranov, M. Malesevic, M. Besta, T. Schneider, S. Kistler, and T. Hoefler, “Building Blocks for Network-Accelerated Distributed File Systems,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’22), Nov. 2022.
  11. S. Di Girolamo, K. Taranov, A. Kurth, M. Schaffner, T. Schneider, J. Beránek, M. Besta, L. Benini, D. Roweth, and T. Hoefler, “Network-accelerated non-contiguous memory transfers,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2019, pp. 1–14.
  12. W. Snyder, “Verilator 4.0: open simulation goes multithreaded,” in Open Source Digital Design Conference (ORConf), 2018.
  13. P. Xu, “Full-system evaluation of the spin in-network-compute architecture,” Master Thesis, ETH Zurich, Zurich, 2023-09.
  14. X. Wei, R. Cheng, Y. Yang, R. Chen, and H. Chen, “Characterizing off-path {{\{{SmartNIC}}\}} for accelerating distributed systems,” in 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23), 2023, pp. 987–1004.
  15. F. Petrini, W.-c. Feng, A. Hoisie, S. Coll, and E. Frachtenberg, “The quadrics network: High-performance clustering technology,” Ieee Micro, vol. 22, no. 1, pp. 46–57, 2002.
  16. W. Yu, D. Buntinas, R. L. Graham, and D. K. Panda, “Efficient and scalable barrier over quadrics and myrinet with a new nic-based collective message passing protocol,” in 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings.   IEEE, 2004, p. 182.
  17. A. Wagner, H.-W. Jin, D. K. Panda, and R. Riesen, “Nic-based offload of dynamic user-defined modules for myrinet clusters,” in 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No. 04EX935).   IEEE, 2004, pp. 205–214.
  18. P. Bosshart, D. Daly, G. Gibb, M. Izzard, N. McKeown, J. Rexford, C. Schlesinger, D. Talayco, A. Vahdat, G. Varghese et al., “P4: Programming protocol-independent packet processors,” ACM SIGCOMM Computer Communication Review, vol. 44, no. 3, pp. 87–95, 2014.
  19. D. Chiou, “The microsoft catapult project,” in 2017 IEEE International Symposium on Workload Characterization (IISWC).   Los Alamitos, CA, USA: IEEE Computer Society, oct 2017, pp. 124–124. [Online]. Available: https://doi.ieeecomputersociety.org/10.1109/IISWC.2017.8167769
  20. A. Forencich, A. C. Snoeren, G. Porter, and G. Papen, “Corundum: An open-source 100-gbps nic,” in 2020 IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).   IEEE, 2020, pp. 38–46.
  21. Broadcom. Stingray ps225. [Online]. Available: \url{https://docs.broadcom.com/doc/PS225-PB}
  22. I. Burstein, “Nvidia data center processing unit (dpu) architecture,” in 2021 IEEE Hot Chips 33 Symposium (HCS).   IEEE, 2021, pp. 1–20.
  23. T. Xing, H. Tajbakhsh, I. Haque, M. Honda, and A. Barbalace, “Towards portable end-to-end network performance characterization of smartnics,” in Proceedings of the 13th ACM SIGOPS Asia-Pacific Workshop on Systems, 2022, pp. 46–52.
  24. L. Foundation, “Data plane development kit (DPDK),” 2015. [Online]. Available: http://www.dpdk.org
  25. J. Day, “The (un)revised osi reference model,” SIGCOMM Comput. Commun. Rev., vol. 25, no. 5, p. 39–55, oct 1995. [Online]. Available: https://doi.org/10.1145/216701.216704
  26. I. Kuon, R. Tessier, J. Rose et al., “Fpga architecture: Survey and challenges,” Foundations and Trends® in Electronic Design Automation, vol. 2, no. 2, pp. 135–253, 2008.
  27. A. Ranga, L. Venkatesh, and V. Venkanna, “Design and implementation of amba-axi protocol using vhdl for soc integration,” Int J Eng Res Appl, vol. 2, no. 4, pp. 1102–1110, 2012.
  28. A. Waterman, Y. Lee, D. Patterson, K. Asanovic, V. I. U. level Isa, A. Waterman, Y. Lee, and D. Patterson, “The risc-v instruction set manual,” Volume I: User-Level ISA’, version, vol. 2, pp. 1–79, 2014.
  29. F. Conti, D. Rossi, A. Pullini, I. Loi, and L. Benini, “Energy-efficient vision on the pulp platform for ultra-low power parallel computing,” in 2014 IEEE Workshop on Signal Processing Systems (SiPS).   IEEE, 2014, pp. 1–6.
  30. A. F. et al. (2024) Corundum readme. [Online]. Available: https://github.com/corundum/corundum
  31. D. Hoffman, D. Prabhakar, and P. Strooper, “Testing iptables,” in Proceedings of the 2003 conference of the Centre for Advanced Studies on Collaborative research, 2003, pp. 80–91.
  32. D. Cohen and G. Kessler. (2024) IPTables u32 matcher description. [Online]. Available: http://www.stearns.org/doc/iptables-u32.current.html
  33. W. John and S. Tafvelin, “Analysis of internet backbone traffic and header anomalies observed,” in Proceedings of the 7th ACM SIGCOMM conference on Internet measurement, 2007, pp. 111–116.
  34. T. Benson, A. Anand, A. Akella, and M. Zhang, “Understanding data center traffic characteristics,” ACM SIGCOMM Computer Communication Review, vol. 40, no. 1, pp. 92–99, 2010.
  35. M. Nazarewicz, “A deep dive into cma,” LWN.net, 2019. [Online]. Available: https://lwn.net/Articles/486301/
  36. G. F. Pfister, “An introduction to the infiniband architecture,” High performance mass storage and parallel I/O, vol. 42, no. 617-632, p. 10, 2001.
  37. G. Kaur and M. Bala, “Rdma over converged ethernet: A review,” International Journal of Advances in Engineering & Technology, vol. 6, no. 4, p. 1890, 2013.
  38. A. Tirumala, “Iperf: The tcp/udp bandwidth measurement tool,” http://dast. nlanr. net/Projects/Iperf/, 1999.
  39. Q. Xiong, P. V. Bangalore, A. Skjellum, and M. Herbordt, “Mpi derived datatypes: Performance and portability issues,” in Proceedings of the 25th European MPI Users’ Group Meeting, 2018, pp. 1–10.
  40. R. Ross, N. Miller, and W. D. Gropp, “Implementing fast and reusable datatype processing,” in European Parallel Virtual Machine/Message Passing Interface Users’ Group Meeting.   Springer, 2003, pp. 404–413.
  41. T. Schneider, F. Kjolstad, and T. Hoefler, “Mpi datatype processing using runtime compilation,” in Proceedings of the 20th European MPI Users’ Group Meeting, 2013, pp. 19–24.

Summary

We haven't generated a summary for this paper yet.