Papers
Topics
Authors
Recent
Search
2000 character limit reached

Credence: Augmenting Datacenter Switch Buffer Sharing with ML Predictions

Published 5 Jan 2024 in cs.NI and cs.LG | (2401.02801v1)

Abstract: Packet buffers in datacenter switches are shared across all the switch ports in order to improve the overall throughput. The trend of shrinking buffer sizes in datacenter switches makes buffer sharing extremely challenging and a critical performance issue. Literature suggests that push-out buffer sharing algorithms have significantly better performance guarantees compared to drop-tail algorithms. Unfortunately, switches are unable to benefit from these algorithms due to lack of support for push-out operations in hardware. Our key observation is that drop-tail buffers can emulate push-out buffers if the future packet arrivals are known ahead of time. This suggests that augmenting drop-tail algorithms with predictions about the future arrivals has the potential to significantly improve performance. This paper is the first research attempt in this direction. We propose Credence, a drop-tail buffer sharing algorithm augmented with machine-learned predictions. Credence can unlock the performance only attainable by push-out algorithms so far. Its performance hinges on the accuracy of predictions. Specifically, Credence achieves near-optimal performance of the best known push-out algorithm LQD (Longest Queue Drop) with perfect predictions, but gracefully degrades to the performance of the simplest drop-tail algorithm Complete Sharing when the prediction error gets arbitrarily worse. Our evaluations show that Credence improves throughput by $1.5$x compared to traditional approaches. In terms of flow completion times, we show that Credence improves upon the state-of-the-art approaches by up to $95\%$ using off-the-shelf machine learning techniques that are also practical in today's hardware. We believe this work opens several interesting future work opportunities both in systems and theory that we discuss at the end of this paper.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. Abm: Active buffer management in datacenters. In Proceedings of the ACM SIGCOMM 2022 Conference, SIGCOMM ’22, page 36–52, New York, NY, USA, 2022. Association for Computing Machinery.
  2. Reverie: Low pass filter-based switch buffer sharing for datacenters with rdma and tcp traffic. In 21th USENIX Symposium on Networked Systems Design and Implementation (NSDI 24), Santa Clara, CA, 2024. USENIX Association.
  3. PowerTCP: Pushing the performance limits of datacenter networks. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22), pages 51–70, Renton, WA, April 2022. USENIX Association.
  4. Flowrest: Practical flow-level inference in programmable switches with random forests. In IEEE INFOCOM 2023 - IEEE Conference on Computer Communications, pages 1–10, 2023.
  5. Conga: Distributed congestion-aware load balancing for datacenters. In Proceedings of the 2014 ACM Conference on SIGCOMM, SIGCOMM ’14, page 503–514, New York, NY, USA, 2014. Association for Computing Machinery.
  6. Data center tcp (dctcp). In Proceedings of the ACM SIGCOMM 2010 Conference, SIGCOMM ’10, page 63–74, New York, NY, USA, 2010. Association for Computing Machinery.
  7. Pfabric: Minimal near-optimal datacenter transport. In Proceedings of the ACM SIGCOMM 2013 Conference on SIGCOMM, SIGCOMM ’13, page 435–446, New York, NY, USA, 2013. Association for Computing Machinery.
  8. Breaking the Barrier Of 2 for the Competitiveness of Longest Queue Drop. In Nikhil Bansal, Emanuela Merelli, and James Worrell, editors, 48th International Colloquium on Automata, Languages, and Programming (ICALP 2021), volume 198 of Leibniz International Proceedings in Informatics (LIPIcs), pages 17:1–17:20, Dagstuhl, Germany, 2021. Schloss Dagstuhl – Leibniz-Zentrum für Informatik.
  9. Fab: Toward flow-aware buffer sharing on programmable switches. In Proceedings of the 2019 Workshop on Buffer Sizing, BS ’19, New York, NY, USA, 2020. Association for Computing Machinery.
  10. Empowering azure storage with RDMA. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), pages 49–67, Boston, MA, April 2023. USENIX Association.
  11. One more config is enough: Saving (dc)tcp for high-speed extremely shallow-buffered datacenters. IEEE/ACM Transactions on Networking, 29(2):489–502, 2021.
  12. New competitiveness bounds for the shared memory switch. CoRR, abs/1907.04399, 2019.
  13. Online Computation and Competitive Analysis. 1998.
  14. Broadcom. StrataXGS® Switch Solutions. https://www.broadcom.com/products/ethernet-connectivity/switching/strataxgs.
  15. pforest: In-network inference with random forests. arXiv preprint arXiv:1909.05680, 2019.
  16. Dcpim: Near-optimal proactive datacenter transport. In Proceedings of the ACM SIGCOMM 2022 Conference, SIGCOMM ’22, page 53–65, New York, NY, USA, 2022. Association for Computing Machinery.
  17. Comparison of buffer allocation schemes in atm switches: complete sharing, partial sharing, and dedicated allocation. In Proceedings of ICC/SUPERCOMM’94 - 1994 International Conference on Communications, pages 1164–1168 vol.2, 1994.
  18. Dynamic queue length thresholds for shared-memory packet switches. IEEE/ACM Transactions on Networking, 6(2):130–140, 1998.
  19. S. Floyd and V. Jacobson. Random early detection gateways for congestion avoidance. IEEE/ACM Transactions on Networking, 1(4):397–413, Aug 1993.
  20. A microscopic view of bursts, buffer contention, and loss in data centers. In Proceedings of the 22nd ACM Internet Measurement Conference, IMC ’22, page 567–580, New York, NY, USA, 2022. Association for Computing Machinery.
  21. Drill: Micro load balancing for low-latency data center networks. In Proceedings of the Conference of the ACM Special Interest Group on Data Communication, SIGCOMM ’17, page 225–238, New York, NY, USA, 2017. Association for Computing Machinery.
  22. Michael H. Goldwasser. A survey of buffer management policies for packet switches. SIGACT News, 41(1):100–128, mar 2010.
  23. Backpressure flow control. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22), pages 779–805, Renton, WA, April 2022. USENIX Association.
  24. Competitve buffer management for shared-memory switches. In Proceedings of the Thirteenth Annual ACM Symposium on Parallel Algorithms and Architectures, SPAA ’01, page 53–58, New York, NY, USA, 2001. Association for Computing Machinery.
  25. Finishing flows quickly with preemptive scheduling. SIGCOMM Comput. Commun. Rev., 42(4):127–138, aug 2012.
  26. Traffic-aware buffer management in shared memory switches. IEEE/ACM Transactions on Networking, 30(6):2559–2573, 2022.
  27. Non-clairvoyant scheduling with predictions. In Proceedings of the 33rd ACM Symposium on Parallelism in Algorithms and Architectures, SPAA ’21, page 285–294, New York, NY, USA, 2021. Association for Computing Machinery.
  28. Online dynamic acknowledgement with learned predictions. arXiv preprint arXiv:2305.18227, 2023.
  29. Hula: Scalable load balancing using programmable data planes. In Proceedings of the Symposium on SDN Research, SOSR ’16, New York, NY, USA, 2016. Association for Computing Machinery.
  30. Harmonic buffer management policy for shared memory switches. Theoretical Computer Science, 324(2):161–182, 2004. Online Algorithms: In Memoriam, Steve Seiden.
  31. A tight bound on online buffer management for two-port shared-memory switches. In Proceedings of the Nineteenth Annual ACM Symposium on Parallel Algorithms and Architectures, SPAA ’07, page 358–364, New York, NY, USA, 2007. Association for Computing Machinery.
  32. Dynamic partitioning: a mechanism for shared memory management. In IEEE INFOCOM ’99. Conference on Computer Communications. Proceedings. Eighteenth Annual Joint Conference of the IEEE Computer and Communications Societies. The Future is Now (Cat. No.99CH36320), volume 1, pages 144–152 vol.1, 1999.
  33. Swift: Delay is simple and effective for congestion control in the datacenter. In Proceedings of the Annual Conference of the ACM Special Interest Group on Data Communication on the Applications, Technologies, Architectures, and Protocols for Computer Communication, SIGCOMM ’20, page 514–528, New York, NY, USA, 2020. Association for Computing Machinery.
  34. Hpcc: High precision congestion control. In Proceedings of the ACM Special Interest Group on Data Communication, SIGCOMM ’19, page 44–58, New York, NY, USA, 2019. Association for Computing Machinery.
  35. Michael Mitzenmacher. A model for learned bloom filters and optimizing by sandwiching. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.
  36. Michael Mitzenmacher. Scheduling with predictions and the price of misprediction. arXiv preprint arXiv:1902.00732, 2019.
  37. Michael Mitzenmacher. Queues with small advice. In Proceedings of the 2021 SIAM Conference on Applied and Computational Discrete Algorithms (ACDA21), pages 1–12. SIAM, 2021.
  38. Algorithms with predictions. Commun. ACM, 65(7):33–35, jun 2022.
  39. Controlling queue delay: A modern aqm is just one piece of the solution to bufferbloat. Queue, 10(5):20–34, may 2012.
  40. ns-3. Network Simulator. https://www.nsnam.org/.
  41. Pie: A lightweight control scheme to address the bufferbloat problem. In 2013 IEEE 14th International Conference on High Performance Switching and Routing (HPSR), pages 148–155, July 2013.
  42. Improving online algorithms via ml predictions. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.
  43. Pybind11. Seamless operability between C++11 and Python. https://pybind11.readthedocs.io/en/stable/.
  44. Annulus: A dual congestion control loop for datacenter and wan traffic aggregates. SIGCOMM ’20, page 735–749, New York, NY, USA, 2020. Association for Computing Machinery.
  45. scikit-learn. Machine Learning in Python. https://scikit-learn.org/stable/.
  46. Analyzing and enhancing dynamic threshold policy of data center switches. IEEE Transactions on Parallel and Distributed Systems, 28(9):2454–2470, 2017.
  47. Buffer management schemes for supporting tcp in gigabit routers with per-flow queueing. IEEE Journal on Selected Areas in Communications, 17(6):1159–1169, 1999.
  48. A buffer allocation scheme for atm networks: complete sharing based on virtual partition. IEEE/ACM Transactions on Networking, 3(6):660–670, 1995.
  49. Do switches dream of machine learning? toward in-network classification. In Proceedings of the 18th ACM Workshop on Hot Topics in Networks, HotNets ’19, page 25–33, New York, NY, USA, 2019. Association for Computing Machinery.
  50. Dibs: Just-in-time congestion mitigation for data centers. In Proceedings of the Ninth European Conference on Computer Systems, EuroSys ’14, New York, NY, USA, 2014. Association for Computing Machinery.
  51. High-resolution measurement of data center microbursts. In Proceedings of the 2017 Internet Measurement Conference, IMC ’17, page 78–85, New York, NY, USA, 2017. Association for Computing Machinery.
Citations (5)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.