Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
162 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Cloudy Forecast: How Predictable is Communication Latency in the Cloud? (2309.13169v1)

Published 22 Sep 2023 in cs.DC and cs.NI

Abstract: Many systems and services rely on timing assumptions for performance and availability to perform critical aspects of their operation, such as various timeouts for failure detectors or optimizations to concurrency control mechanisms. Many such assumptions rely on the ability of different components to communicate on time -- a delay in communication may trigger the failure detector or cause the system to enter a less-optimized execution mode. Unfortunately, these timing assumptions are often set with little regard to actual communication guarantees of the underlying infrastructure -- in particular, the variability of communication delays between processes in different nodes/servers. The higher communication variability holds especially true for systems deployed in the public cloud since the cloud is a utility shared by many users and organizations, making it prone to higher performance variance due to noisy neighbor syndrome. In this work, we present Cloud Latency Tester (CLT), a simple tool that can help measure the variability of communication delays between nodes to help engineers set proper values for their timing assumptions. We also provide our observational analysis of running CLT in three major cloud providers and share the lessons we learned.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (79)
  1. Amazon EC2 instance network bandwidth - Amazon Elastic Compute Cloud. https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-network-bandwidth.html.
  2. Network bandwidth | Compute Engine Documentation. https://cloud.google.com/compute/docs/network-bandwidth.
  3. Placement groups - Amazon Elastic Compute Cloud. https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/placement-groups.html.
  4. Use VM instance placement policies | Compute Engine Documentation. https://cloud.google.com/compute/docs/instances/define-instance-placement.
  5. Aws latency monitoring. https://www.cloudping.co/grid, 2023.
  6. https://cloudpingtest.com/, 2023.
  7. Simultaneous ping test for all popular cloud providers. https://webping.cloud/, 2023.
  8. Using Xen and KVM as real-time hypervisors. 106:101709.
  9. Amazon Web Services. General purpose instances. https://aws.amazon.com/message/12721/, 2021.
  10. Amazon Web Services. Amazon ec2 m5 instances: Balanced compute, memory, and networking resources for general purpose workloads. https://aws.amazon.com/ec2/instance-types/m5/, 2023.
  11. Amazon Web Services. Network maximum transmission unit (MTU) for your EC2 instance. https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/network_mtu.html, 2023.
  12. Amazon Web Services. Summary of the aws service event in the northern virginia (US-EAST-1) region. https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/general-purpose-instances.html#general-purpose-network-performance, 2023.
  13. asudbring. Azure virtual machine network throughput. https://learn.microsoft.com/en-us/azure/virtual-network/virtual-machine-network-throughput.
  14. Performance measurement and interference profiling in multi-tenant clouds. In 2015 IEEE 8th International Conference on Cloud Computing, pages 941–949, 2015.
  15. Managing overloaded hosts for dynamic consolidation of virtual machines in cloud data centers under quality of service constraints. IEEE Transactions on Parallel and Distributed Systems, 24(7), 2013.
  16. Metastable failures in distributed systems. In Proceedings of the Workshop on Hot Topics in Operating Systems, HotOS ’21, pages 221–227, New York, NY, USA, 2021. Association for Computing Machinery.
  17. Mike Burrows. The Chubby lock service for loosely-coupled distributed systems. In Proceedings of the 7th symposium on Operating systems design and implementation, pages 335–350. USENIX Association, 2006.
  18. Measuring the latency of cloud gaming systems. In Proceedings of the 19th ACM International Conference on Multimedia, MM ’11, pages 1269–1272, New York, NY, USA, 2011. Association for Computing Machinery.
  19. Spanner: Google’s globally distributed database. ACM Transactions on Computer Systems (TOCS), 31(3), August 2013.
  20. Russ Cox. preliminary network - just dial for now. https://github.com/golang/go/blob/e8a02230f215efb075cccd4146b3d0d1ada4870e/src/lib/net/net.go#L398, 2008.
  21. Dynamo: Amazon’s highly available key-value store. SIGOPS Oper. Syst. Rev., 41(6):205–220, October 2007.
  22. Diagnosing Virtualization Overhead for Multi-threaded Computation on Multicore Platforms. In 2015 IEEE 7th International Conference on Cloud Computing Technology and Science (CloudCom), pages 226–233.
  23. Amazon DynamoDB: A scalable, predictably performant, and fully managed NoSQL database service. In 2022 USENIX Annual Technical Conference (USENIX ATC 22), pages 1037–1048, Carlsbad, CA, July 2022. USENIX Association.
  24. Cep-15: Fast general purpose transactions. https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-15%3A+General+Purpose+Transactions?preview=/188744725/188744736/Accord.pdf, 2021.
  25. Pekka Enberg. A Performance Evaluation of Hypervisor, Unikernel, and Container Network I/O Virtualization.
  26. Google. The go programming language, 2018. https://go.dev/.
  27. Google Cloud Platform. Compute engine general-purpose machine family. https://cloud.google.com/compute/docs/general-purpose-machines, 2023.
  28. Fail-slow at scale: Evidence of hardware performance faults in large production systems. ACM Trans. Storage, 14(3), October 2018.
  29. Performance Evaluation of Low Latency Communication Alternatives in a Containerized Cloud Environment. In 2018 IEEE 11th International Conference on Cloud Computing (CLOUD), pages 9–16.
  30. Protean: Vm allocation service at scale. In OSDI. USENIX, October 2020.
  31. Mittos: Supporting millisecond tail tolerance with fast rejecting slo-aware os interface. In Proceedings of the 26th Symposium on Operating Systems Principles, SOSP ’17, page 168–183, New York, NY, USA, 2017. Association for Computing Machinery.
  32. Mor Harchol-Balter. Performance modeling and design of computer systems: queueing theory in action. Cambridge University Press, 2013.
  33. Metastable failures in the wild. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 73–90, Carlsbad, CA, July 2022. USENIX Association.
  34. Gray failure: The achilles’ heel of cloud-scale systems. In Proceedings of the 16th Workshop on Hot Topics in Operating Systems, HotOS ’17, pages 150–155, New York, NY, USA, 2017. Association for Computing Machinery.
  35. ZooKeeper: wait-free coordination for internet-scale systems. In Proceedings of the 2010 USENIX annual technical conference (ATC 2010), pages 11–11. USENIX Association, 2010.
  36. Intel. DPDK: Data plane development kit. https://www.dpdk.org/, 2014.
  37. Skyplane: Optimizing Transfer Cost and Throughput Using {}Cloud-Aware{} Overlays. pages 1375–1389.
  38. mTCP: A Highly Scalable User-level TCP Stack for Multicore Systems. pages 489–502.
  39. Karen Weise. Amazon’s cloud computing outage disrupts its warehouse operations. The New York Times.
  40. TAS: TCP Acceleration as an OS Service. In Proceedings of the Fourteenth EuroSys Conference 2019, EuroSys ’19, pages 1–16. Association for Computing Machinery.
  41. Hui Kenneth. Aws 101: Regions and availability zones. https://www.rackspace.com/blog/aws-101-regions-availability-zones.
  42. TaxDC: A Taxonomy of Non-Deterministic Concurrency Bugs in Datacenter Distributed Systems. SIGPLAN Not., 51(4):517–530, March 2016.
  43. Cloudcmp: Comparing public cloud providers. In Proceedings of the 10th ACM SIGCOMM Conference on Internet Measurement, IMC ’10, pages 1–14, New York, NY, USA, 2010. Association for Computing Machinery.
  44. An unsupervised approach to online noisy-neighbor detection in cloud data centers. Expert Systems with Applications, 89:188–204, 2017.
  45. mattmcinnes. Proximity placement groups - Azure Virtual Machines. https://learn.microsoft.com/en-us/azure/virtual-machines/co-location.
  46. Microsoft. Noisy neighbor antipattern. https://learn.microsoft.com/en-us/azure/architecture/antipatterns/noisy-neighbor/noisy-neighbor, 2000.
  47. Microsoft Azure. Dv3 and dsv3-series. https://learn.microsoft.com/en-us/azure/virtual-machines/dv3-dsv3-series, 2022.
  48. MongoDB Inc. The MongoDB 4.2 manual.
  49. There is more consensus in egalitarian parliaments. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, pages 358–372. ACM, 2013.
  50. John Nagle. Congestion control in ip/tcp internetworks. https://datatracker.ietf.org/doc/html/rfc896, 1984.
  51. The load, capacity, and availability of quorum systems. SIAM Journal on Computing, 27(2):423–447, 1998.
  52. Tolerating Slowdowns in Replicated State Machines using Copilots, pages 583–598. USENIX Association, November 2020.
  53. In search of an understandable consensus algorithm. In Proceedings of the 2014 USENIX Annual Technical Conference (USENIX ATC 2014), pages 305–319. USENIX Association, 2014.
  54. Deep convolutional neural networks for detecting noisy neighbours in cloud infrastructure. In The European Symposium on Artificial Neural Networks, 2017.
  55. Rabia: Simplifying state-machine replication through randomization. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles, SOSP ’21, pages 472–487, New York, NY, USA, 2021. Association for Computing Machinery.
  56. Pat Helland. Fail-fast is failing… fast! ACM Queue.
  57. Characterizing the impact of network latency on cloud-based applications’ performance. 2017.
  58. Understanding performance interference of i/o workload in virtualized cloud environments. In 2010 IEEE 3rd International Conference on Cloud Computing, pages 51–58, 2010.
  59. Resource management for isolation enhanced cloud services. In Proceedings of the 2009 ACM Workshop on Cloud Computing Security, CCSW ’09, page 77–84, New York, NY, USA, 2009. Association for Computing Machinery.
  60. The ns-3 Network Simulator, pages 15–34. Springer Berlin Heidelberg, Berlin, Heidelberg, 2010.
  61. Continuous in-network round-trip time monitoring. In Proceedings of the ACM SIGCOMM 2022 Conference, SIGCOMM ’22, pages 473–485, New York, NY, USA, 2022. Association for Computing Machinery.
  62. Schema-Agnostic Indexing with Azure DocumentDB. Proc. VLDB Endow., 8(12):1668–1679, August 2015.
  63. Scheduling fair resource allocation policies for cloud computing through flow control. Electronics, 8(11), 2019.
  64. CockroachDB: The Resilient Geo-Distributed SQL Database. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (SIGMOD 2020), SIGMOD ’20, pages 1493–1509, New York, NY, USA, 2020. Association for Computing Machinery.
  65. The Apache Software Foundation. Apache Cassandra. http://cassandra.apache.org, 2019.
  66. EPaxos Revisited. In 18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21). USENIX Association, April 2021.
  67. Multidimensional cloud latency monitoring and evaluation. Computer Networks, 107:104–120, 2016. Machine learning, data mining and Big Data frameworks for network monitoring and troubleshooting.
  68. Congestion control method with fair resource allocation for cloud computing environments. In Proceedings of 2011 IEEE Pacific Rim Conference on Communications, Computers and Signal Processing, pages 1–6, 2011.
  69. Reducing noisy-neighbor impact with a fuzzy affinity-aware scheduler. In 2015 International Conference on Cloud and Autonomic Computing, pages 33–44, 2015.
  70. Uptime Institute. Uptime institute’s 2022 outage analysis finds downtime costs and consequences worsening as industry efforts to curb outage frequency fall short. https://uptimeinstitute.com/about-ui/press-releases/2022-outage-analysis-finds-downtime-costs-and-consequences-worsening, 2022.
  71. Andras Varga. OMNeT++, pages 35–59. Springer Berlin Heidelberg, Berlin, Heidelberg, 2010.
  72. Netsim: Java™-based simulation for the world wide web. Computers & Operations Research, 26(6):607–621, 1999.
  73. Amazon aurora: Design considerations for high throughput cloud-native relational databases. In Proceedings of the 2017 ACM International Conference on Management of Data, pages 1041–1052. ACM, 2017.
  74. Large-scale cluster management at Google with Borg. In Proceedings of the European Conference on Computer Systems (EuroSys), Bordeaux, France, 2015.
  75. Characterizing cloud computing hardware reliability. In Proceedings of the 1st ACM Symposium on Cloud Computing, SoCC ’10, page 193–204, New York, NY, USA, 2010. Association for Computing Machinery.
  76. Read-Write Quorum Systems Made Practical. Association for Computing Machinery, New York, NY, USA, 2021.
  77. Fail-slow fault tolerance needs programming support. In Proceedings of the Workshop on Hot Topics in Operating Systems, HotOS ’21, page 228–235, New York, NY, USA, 2021. Association for Computing Machinery.
  78. Scalable tail latency estimation for data center networks, 2022.
  79. Cloud performance variability prediction. In Companion of the ACM/SPEC International Conference on Performance Engineering, ICPE ’21, page 35–40, New York, NY, USA, 2021. Association for Computing Machinery.
Citations (3)

Summary

We haven't generated a summary for this paper yet.

Reddit Logo Streamline Icon: https://streamlinehq.com