Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

PET: Multi-agent Independent PPO-based Automatic ECN Tuning for High-Speed Data Center Networks (2405.11956v1)

Published 20 May 2024 in cs.NI

Abstract: Explicit Congestion Notification (ECN)-based congestion control schemes have been widely adopted in high-speed data center networks (DCNs), where the ECN marking threshold plays a determinant role in guaranteeing a packet lossless DCN. However, existing approaches either employ static settings with immutable thresholds that cannot be dynamically self-adjusted to adapt to network dynamics, or fail to take into account many-to-one traffic patterns and different requirements of different types of traffic, resulting in relatively poor performance. To address these problems, this paper proposes a novel learning-based automatic ECN tuning scheme, named PET, based on the multi-agent Independent Proximal Policy Optimization (IPPO) algorithm. PET dynamically adjusts ECN thresholds by fully considering pivotal congestion-contributing factors, including queue length, output data rate, output rate of ECN-marked packets, current ECN threshold, the extent of incast, and the ratio of mice and elephant flows. PET adopts the Decentralized Training and Decentralized Execution (DTDE) paradigm and combines offline and online training to accommodate network dynamics. PET is also fair and readily deployable with commodity hardware. Comprehensive experimental results demonstrate that, compared with state-of-the-art static schemes and the learning-based automatic scheme, our PET achieves better performance in terms of flow completion time, convergence rate, queue length variance, and system robustness.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (44)
  1. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274, 2015.
  2. Octopus: an rdma-enabled distributed persistent memory file system. In 2017 USENIX Annual Technical Conference, pages 773–785, 2017.
  3. Tensorflow: a system for large-scale machine learning. In 12th USENIX symposium on operating systems design and implementation (OSDI 16), pages 265–283, 2016.
  4. Deconstructing rdma-enabled distributed transactions: Hybrid is better! In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 233–251, 2018.
  5. Rack-scale in-memory join processing using rdma. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pages 1463–1475, 2015.
  6. High-performance design of apache spark with rdma and its benefits on various workloads. In 2016 IEEE International Conference on Big Data (Big Data), pages 253–262. IEEE, 2016.
  7. Hydradb: a resilient rdma-driven key-value middleware for in-memory cluster computing. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–11, 2015.
  8. Scaling-up distributed processing of data streams for machine learning. Proceedings of the IEEE, 108(11):1984–2012, 2020.
  9. Mapreduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107–113, 2008.
  10. Spark: Cluster computing with working sets. In 2nd USENIX Workshop on Hot Topics in Cloud Computing, 2010.
  11. Apache flink: Stream and batch processing in a single engine. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 36(4), 2015.
  12. High-performance design of hadoop rpc with rdma over infiniband. In International Conference on Parallel Processing, pages 641–650. IEEE, 2013.
  13. Accelerating spark with rdma for big data processing: Early experiences. In 2014 IEEE 22nd Annual Symposium on High-Performance Interconnects, pages 9–16. IEEE, 2014.
  14. Performance isolation anomalies in rdma. In Proceedings of the Workshop on Kernel-Bypass Networks, pages 43–48, 2017.
  15. Rethinking data center networks: Machine learning enables network intelligence. Journal of Communications and Information Networks, 7(2):157–169, 2022.
  16. Hpcc: High precision congestion control. In Proceedings of the ACM Special Interest Group on Data Communication, pages 44–58. 2019.
  17. Data center tcp (dctcp). In Proceedings of the ACM SIGCOMM 2010 Conference, pages 63–74, 2010.
  18. Deadline-aware datacenter tcp (d2tcp). ACM SIGCOMM Computer Communication Review, 42(4):115–126, 2012.
  19. One more config is enough: Saving (dc)tcp for high-speed extremely shallow-buffered datacenters. In IEEE INFOCOM 2020 - IEEE Conference on Computer Communications, pages 2007–2016, 2020.
  20. Multipath-aware tcp for data center traffic load-balancing. In 2021 IEEE/ACM 29th International Symposium on Quality of Service (IWQOS), pages 1–6. IEEE, 2021.
  21. Aeolus: A building block for proactive transport in datacenter networks. IEEE/ACM Transactions on Networking, 30(2):542–556, 2021.
  22. Ecn marking with micro-burst traffic: Problem, analysis, and improvement. IEEE/ACM Transactions on Networking, 26(4):1533–1546, 2018.
  23. Enabling ecn for datacenter networks with rtt variations. In Proceedings of the 15th International Conference on Emerging Networking Experiments And Technologies, pages 233–245, 2019.
  24. Adaptive marking threshold method for delay-sensitive tcp in data center network. Journal of Network and Computer Applications, 61:222–234, 2016.
  25. Dynamic ecn marking threshold algorithm for tcp congestion control in data center networks. Computer Communications, 129:197–208, 2018.
  26. Qaecn: Dynamically tuning ecn threshold with micro-burst in multi-queue data centers. In 2019 IEEE 23rd International Conference on Computer Supported Cooperative Work in Design (CSCWD), pages 398–403, 2019.
  27. Acc: Automatic ecn tuning for high-speed datacenter networks. In Proceedings of the 2021 ACM SIGCOMM 2021 Conference, pages 384–397, 2021.
  28. Dc-ecn: A machine-learning based dynamic threshold control scheme for ecn marking in dcn. Computer Communications, 150:334–345, 2020.
  29. Deep reinforcement learning with double q-learning. In Proceedings of the AAAI conference on artificial intelligence, volume 30, 2016.
  30. Is independent learning all you need in the starcraft multi-agent challenge? arXiv preprint arXiv:2011.09533, 2020.
  31. Congestion control for large-scale rdma deployments. ACM SIGCOMM Computer Communication Review, 45(4):523–536, 2015.
  32. An intelligent scheme for congestion control: When active queue management meets deep reinforcement learning. Computer Networks, 200:108515, 2021.
  33. On a deep q-network-based approach for active queue management. In 2021 Joint European Conference on Networks and Communications & 6G Summit (EuCNC/6G Summit), pages 371–376. IEEE, 2021.
  34. The panasas activescale storage cluster-delivering scalable high bandwidth storage. In Proceedings of the ACM/IEEE Conference on Supercomputing, pages 53–53. IEEE, 2004.
  35. Devoflow: Scaling flow management for high-performance networks. In Proceedings of the ACM SIGCOMM 2011 Conference, pages 254–265, 2011.
  36. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  37. The surprising effectiveness of ppo in cooperative, multi-agent games. arXiv preprint arXiv:2103.01955, 2021.
  38. High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015.
  39. NS3: https://www.nsnam.org/.
  40. Anaconda: https://www.anaconda.com.
  41. PyTorch: https://pytorch.org/.
  42. ns3-gym: https://github.com/tkn-tub/ns3-gym.
  43. Traffic generator: https://github.com/alibaba-edu/high-precision-congestion-control/tree/master/traffic_gen.
  44. Vl2: A scalable and flexible data center network. In Proceedings of the ACM SIGCOMM 2009 conference on Data communication, pages 51–62, 2009.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com