Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ARCANE: Adaptive Routing with Caching and Aware Network Exploration (2407.21625v4)

Published 31 Jul 2024 in cs.NI

Abstract: Next-generation datacenters require highly efficient network load balancing to manage the growing scale of AI training and general datacenter traffic. Existing solutions designed for Ethernet, such as Equal Cost Multi-Path (ECMP) and oblivious packet spraying (OPS), struggle to maintain high network utilizations as datacenter topologies (and network failures as a consequence) continue to grow. To address these limitations, we propose ARCANE, a lightweight decentralized per-packet adaptive load balancing algorithm designed to optimize network utilization while ensuring rapid recovery from link failures. ARCANE adapts to network conditions by caching good-performing paths. In case of a network failure, ARCANE re-routes traffic away from it in less than 100 microseconds. ARCANE is designed to be deployed with next-generation out-of-order transports, such as Ultra Ethernet, and introduces less than 25 bytes of per-connection state. We extensively evaluate ARCANE in large-scale simulations and FPGA-based NICs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (32)
  1. Implementing packet trimming support in hardware. (2022). arXiv:cs.NI/2207.04967
  2. Data Center TCP (DCTCP). SIGCOMM Comput. Commun. Rev. 40, 4 (aug 2010), 63–74. https://doi.org/10.1145/1851275.1851192
  3. Infiniband Trade Association. 2024. Supplement to InfiniBand Architecture Specification Volume 1 Release 1.2.1 Annex A17: RoCEv2. (2024).
  4. Empowering Azure Storage with RDMA. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23). USENIX Association, Boston, MA, 49–67. https://www.usenix.org/conference/nsdi23/presentation/bai
  5. SMaRTT-REPS: Sender-based Marked Rapidly-adapting Trimmed & Timed Transport with Recycled Entropies. (2024). arXiv:cs.NI/2404.01630 https://arxiv.org/abs/2404.01630
  6. Broadcom. 2024. Tomahawk 5 Switch. (2024). https://www.broadcom.com/products/ethernet-connectivity/switching/strataxgs/bcm78900-series (accessed 01/24).
  7. V. Cerf and R. Kahn. 1974. A Protocol for Packet Network Intercommunication. IEEE Transactions on Communications 22, 5 (1974), 637–648. https://doi.org/10.1109/TCOM.1974.1092259
  8. Catch the Whole Lot in an Action: Rapid Precise Packet Loss Notification in Data Center. In 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI 14). USENIX Association, Seattle, WA, 17–28. https://www.usenix.org/conference/nsdi14/technical-sessions/presentation/cheng
  9. Ultra Ethernet Consortium. 2024. Ultra Ethernet. (2024). https://ultraethernet.org/.
  10. Mitigating Network Noise on Dragonfly Networks through Application-Aware Routing. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC ’19). Association for Computing Machinery, New York, NY, USA, Article 16, 32 pages. https://doi.org/10.1145/3295500.3356196
  11. An In-Depth Analysis of the Slingshot Interconnect. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. 1–14. https://doi.org/10.1109/SC41405.2020.00039
  12. On the impact of packet spraying in data center networks. In 2013 Proceedings IEEE INFOCOM. 2130–2138. https://doi.org/10.1109/INFCOM.2013.6567015
  13. S. Floyd and V. Jacobson. 1993. Random early detection gateways for congestion avoidance. IEEE/ACM Transactions on Networking 1, 4 (1993), 397–413. https://doi.org/10.1109/90.251892
  14. The Addition of Explicit Congestion Notification (ECN) to IP. RFC 3168. (Sept. 2001). https://doi.org/10.17487/RFC3168
  15. Re-Architecting Datacenter Networks and Stacks for Low Latency and High Performance. In Proceedings of the Conference of the ACM Special Interest Group on Data Communication (SIGCOMM ’17). Association for Computing Machinery, New York, NY, USA, 29–42. https://doi.org/10.1145/3098822.3098825
  16. Presto: Edge-Based Load Balancing for Fast Datacenter Networks. In Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication (SIGCOMM ’15). Association for Computing Machinery, New York, NY, USA, 465–478. https://doi.org/10.1145/2785956.2787507
  17. C. Hopps. 2009. Analysis of an Equal-Cost Multi-Path Algorithm. RFC 2992. (Nov. 2009). https://www.ietf.org/rfc/rfc2992.txt
  18. FlowBender: Flow-level Adaptive Routing for Improved Latency and Throughput in Datacenter Networks. In Proceedings of the 10th ACM International on Conference on Emerging Networking Experiments and Technologies (CoNEXT ’14). Association for Computing Machinery, New York, NY, USA, 149–160. https://doi.org/10.1145/2674005.2674985
  19. Swift: Delay is Simple and Effective for Congestion Control in the Datacenter. https://dl.acm.org/doi/pdf/10.1145/3387514.3406591
  20. DX: Latency-Based Congestion Control for Datacenters. IEEE/ACM Transactions on Networking 25, 1 (2017), 335–348. https://doi.org/10.1109/TNET.2016.2587286
  21. TIMELY: RTT-based Congestion Control for the Datacenter. In Sigcomm ’15.
  22. Kathleen Nichols and Van Jacobson. 2012. Controlling Queue Delay: A modern AQM is just one piece of the solution to bufferbloat. Queue 10, 5 (may 2012), 20–34. https://doi.org/10.1145/2208917.2209336
  23. An edge-queued datagram service for all datacenter traffic. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22). USENIX Association, Renton, WA, 761–777. https://www.usenix.org/conference/nsdi22/presentation/olteanu
  24. Alibaba HPN: A Data Center Network for Large Language Model Training. (2024).
  25. PLB: congestion signals are simple and effective for network load balancing. In Proceedings of the ACM SIGCOMM 2022 Conference (SIGCOMM ’22). Association for Computing Machinery, New York, NY, USA, 207–218. https://doi.org/10.1145/3544216.3544226
  26. Adaptive Routing in InfiniBand Hardware. In 2022 22nd IEEE International Symposium on Cluster, Cloud and Internet Computing (CCGrid). 463–472. https://doi.org/10.1109/CCGrid54584.2022.00056
  27. Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google’s Datacenter Network. In Sigcomm ’15.
  28. Let It Flow: Resilient Asymmetric Load Balancing with Flowlet Switching. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17). USENIX Association, Boston, MA, 407–420. https://www.usenix.org/conference/nsdi17/technical-sessions/presentation/vanini
  29. Poseidon: An Efficient Congestion Control using Deployable INT for Data Center Networks. https://www.usenix.org/system/files/nsdi23-wang-weitao.pdf
  30. Tuning ECN for Data Center Networks. In ACM CoNEXT’12. ACM. https://www.microsoft.com/en-us/research/publication/tuning-ecn-for-data-center-networks/
  31. EMPTCP: An ECN Based Approach to Detect Shared Bottleneck in MPTCP. In 2019 28th International Conference on Computer Communication and Networks (ICCCN). 1–10. https://doi.org/10.1109/ICCCN.2019.8847013
  32. Congestion Control for Large-Scale RDMA Deployments. In SIGCOMM (sigcomm ed.). ACM - Association for Computing Machinery. https://www.microsoft.com/en-us/research/publication/congestion-control-for-large-scale-rdma-deployments/

Summary

We haven't generated a summary for this paper yet.