Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
158 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Cooperative Graceful Degradation In Containerized Clouds (2312.12809v3)

Published 20 Dec 2023 in cs.NI and cs.DC

Abstract: Cloud resilience is crucial for cloud operators and the myriad of applications that rely on the cloud. Today, we lack a mechanism that enables cloud operators to perform graceful degradation of applications while satisfying the application's availability requirements. In this paper, we put forward a vision for automated cloud resilience management with cooperative graceful degradation between applications and cloud operators. First, we investigate techniques for graceful degradation and identify an opportunity for cooperative graceful degradation in public clouds. Second, leveraging criticality tags on containers, we propose diagonal scaling -- turning off non-critical containers during capacity crunch scenarios -- to maximize the availability of critical services. Third, we design Phoenix, an automated cloud resilience management system that maximizes critical service availability of applications while also considering operator objectives, thereby improving the overall resilience of the infrastructure during failures. We experimentally show that the Phoenix controller running atop Kubernetes can improve critical service availability by up to $2\times$ during large-scale failures. Phoenix can handle failures in a cluster of 100,000 nodes within 10 seconds. We also develop AdaptLab, an open-source resilience benchmarking framework that can emulate realistic cloud environments with real-world application dependency graphs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (66)
  1. AWS Internet Outage Cause Human Error Incorrect Command. https://www.vox.com/2017/3/2/14792636/amazon-aws-internet-outage-cause-human-error-incorrect-command. (Accessed on 04/04/2023).
  2. Azure Trace for Packing 2020. https://github.com/Azure/AzurePublicDataset/blob/master/AzureTracesForPacking2020.md. (Accessed on 04/04/2023).
  3. Best-fit bin packing. https://en.wikipedia.org/wiki/Best-fit_bin_packing. (Accessed on 04/06/2023).
  4. Bin packing problem. https://en.wikipedia.org/wiki/Bin_packing_problem. (Accessed on 04/06/2023).
  5. Chaos Engineering. https://netflixtechblog.com/tagged/chaos-engineering. (Accessed on 04/04/2023).
  6. Disney + Hotstar and their tale with scalability . https://www.linkedin.com/pulse/disney-hotstar-tale-scalability-achyutha-rao-sathvick/. (Accessed on 04/06/2023).
  7. Failover with AWS. https://docs.aws.amazon.com/whitepapers/latest/web-application-hosting-best-practices/failover-with-aws.html. (Accessed on 04/04/2023).
  8. Fault tolerance through optimal workload placement. https://engineering.fb.com/2020/09/08/data-center-engineering/fault-tolerance-through-optimal-workload-placement/. (Accessed on 03/12/2023).
  9. Feature Toggles (aka Feature Flags). https://martinfowler.com/articles/feature-toggles.html. (Accessed on 11/26/2023).
  10. Google - site reliability engineering. https://sre.google/sre-book/addressing-cascading-failures/#xref_cascading-failure_load-shed-graceful-degredation. (Accessed on 12/05/2023).
  11. Google cloud service health. https://bit.ly/46WTJLb. (Accessed on 12/02/2023).
  12. gRPC: Identifying Failed Connections. https://grpc.io/blog/grpc-on-http2/#identifying-failed-connections. (Accessed on 11/26/2023).
  13. Improving the Resilience of your Software: a Practical Approach. https://medium.com/ssense-tech/improving-the-resilience-of-your-software-a-practical-approach-9ca8952e09bd. (Accessed on 05/12/2022).
  14. Incident Review – Google Cloud Outage has Widespread Downstream Impact. https://www.catchpoint.com/blog/incident-review-google-cloud-outage. (Accessed on 04/04/2023).
  15. Istioldie 1.4 / circuit breaking. https://istio.io/v1.4/docs/tasks/traffic-management/circuit-breaking/. (Accessed on 12/05/2023).
  16. Kubelet. https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/. (Accessed on 11/23/2023).
  17. https://kubernetes.io/docs/concepts/scheduling-eviction/pod-priority-preemption/. (Accessed on 04/04/2023).
  18. Labels and Selectors. https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/. (Accessed on 05/12/2023).
  19. Manage reliability to a higher standard with Gremlin. https://www.gremlin.com. (Accessed on 04/04/2023).
  20. Microsoft Blames "Severe" Weather for Azure Cloud Outage. https://www.datacenterknowledge.com/uptime/microsoft-blames-severe-weather-azure-cloud-outage. (Accessed on 12/01/2023).
  21. MongoDB. https://www.mongodb.com. (Accessed on 10/31/2022).
  22. NetworkX. https://networkx.org. (Accessed on 12/06/2023).
  23. https://www.overleaf.com. (Accessed on 12/06/2023).
  24. Principles of Chaos Engineering. https://principlesofchaos.org. (Accessed on 11/13/2023).
  25. Profile emulab-ops/k8s. https://www.cloudlab.us/show-profile.php?project=emulab-ops&profile=k8s. (Accessed on 09/11/2022).
  26. Shrinking the time to mitigate production incidents—CRE life lessons. https://cloud.google.com/blog/products/management-tools/shrinking-the-time-to-mitigate-production-incidents. (Accessed on 04/04/2023).
  27. https://istio.io. (Accessed on 04/04/2023).
  28. Sorted Containers. https://grantjenks.com/docs/sortedcontainers/introduction.html#sorted-list. (Accessed on 09/25/2023).
  29. SPS: the Pulse of Netflix Streaming. https://netflixtechblog.com/sps-the-pulse-of-netflix-streaming-ae4db0e05f8a. (Accessed on 11/13/2023).
  30. Target group load shedding for application load balancer | networking & content delivery. https://aws.amazon.com/blogs/networking-and-content-delivery/target-group-load-shedding-for-application-load-balancer/. (Accessed on 12/05/2023).
  31. The Bulkhead Pattern. https://learn.microsoft.com/en-us/azure/architecture/patterns/bulkhead. (Accessed on 12/06/2023).
  32. The Netflix Simian Army. http://techblog.netflix.com/2011/07/netflix-simian-army.html. (Accessed on 04/03/2023).
  33. Using load shedding to avoid overload. https://aws.amazon.com/builders-library/using-load-shedding-to-avoid-overload/. (Accessed on 12/05/2023).
  34. Web content adaptation to improve server overload behavior. Computer Networks, 31(11-16):1563–1577, 1999.
  35. Cilantro:{{\{{Performance-Aware}}\}} resource allocation for general objectives via online feedback. In 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23), pages 623–643. USENIX Association, 2023.
  36. Harmony: Towards automated self-adaptive consistency in cloud storage. In 2012 IEEE International Conference on Cluster Computing, pages 293–301. IEEE, 2012.
  37. μ𝜇\muitalic_μbench: an open-source factory of benchmark microservice applications. IEEE Transactions on Parallel and Distributed Systems, 2023.
  38. Indexing strategies for graceful degradation of search quality. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval, pages 575–584, 2011.
  39. An open-source benchmark suite for microservices and their hardware-software implications for cloud & edge systems. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 3–18, 2019.
  40. How to fight production incidents? an empirical study on a large-scale cloud service. In Proceedings of the 13th Symposium on Cloud Computing, pages 126–141, 2022.
  41. Specifying graceful degradation. IEEE Transactions on Parallel and Distributed Systems, 2(1):93–104, 1991.
  42. Difference of degradation schemes among operating systems: Experimental analysis for web application servers. In Workshop on Dependable Software, Tools and Methods, Yokohama, Japan. Citeseer, 2005.
  43. Mesos: A platform for fine-grained resource sharing in the data center. In NSDI, volume 11, pages 22–22, 2011.
  44. Designing for disasters. In FAST, volume 4, pages 59–62, 2004.
  45. Brownout: Building more robust cloud applications. In Proceedings of the 36th International Conference on Software Engineering, pages 700–711, 2014.
  46. Shard manager: A generic shard management framework for geo-distributed applications. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles, pages 553–569, 2021.
  47. Grade: Graceful degradation in byzantine quorum systems. In 2012 IEEE 31st Symposium on Reliable Distributed Systems, pages 171–180. IEEE, 2012.
  48. Characterizing microservice dependency and performance: Alibaba trace analysis. In Proceedings of the ACM Symposium on Cloud Computing, pages 412–426, 2021.
  49. An in-depth study of microservice call graph and runtime performance. IEEE Transactions on Parallel and Distributed Systems, 33(12):3901–3914, 2022.
  50. The power of prediction: Microservice auto scaling via workload learning. In Proceedings of the 13th Symposium on Cloud Computing, pages 355–369, 2022.
  51. Defcon: Preventing overload with graceful feature degradation. In 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23), pages 607–622, 2023.
  52. Self-adaptation of service level in distributed systems. Software: Practice and Experience, 40(3):259–283, 2010.
  53. {{\{{FIRM}}\}}: An intelligent fine-grained resource management framework for {{\{{SLO-Oriented}}\}} microservices. In 14th USENIX symposium on operating systems design and implementation (OSDI 20), pages 805–825, 2020.
  54. Autopilot: workload autoscaling at google. In Proceedings of the Fifteenth European Conference on Computer Systems, pages 1–16, 2020.
  55. Manageability, availability and performance in porcupine: A highly scalable, cluster-based mail service. ACM SIGOPS Operating Systems Review, 33(5):1–15, 1999.
  56. Availability knob: Flexible user-defined availability in the cloud. In Proceedings of the Seventh ACM Symposium on Cloud Computing, pages 42–56, 2016.
  57. Twine: A unified cluster management system for shared infrastructure. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 787–803, 2020.
  58. Sharelatex on the edge: Evaluation of the hybrid core/edge deployment of a microservices-based application. In Proceedings of the 3rd Workshop on Middleware for Edge Clouds & Cloudlets, pages 8–15, 2018.
  59. Sieve: Actionable insights from monitored metrics in distributed systems. In Proceedings of the 18th ACM/IFIP/USENIX Middleware Conference, pages 14–27, 2017.
  60. Apache hadoop yarn: Yet another resource negotiator. In Proceedings of the 4th annual Symposium on Cloud Computing, pages 1–16, 2013.
  61. Maelstrom: Mitigating datacenter-level disasters by draining interdependent traffic safely and efficiently. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 373–389, 2018.
  62. Large-scale cluster management at google with borg. In Proceedings of the Tenth European Conference on Computer Systems, pages 1–17, 2015.
  63. Werner Vogels. Eventually consistent. Communications of the ACM, 52(1):40–44, 2009.
  64. Ninja: A framework for network services. In USENIX Annual Technical Conference, General Track, pages 87–102, 2002.
  65. Brownout approach for adaptive management of resources and applications in cloud computing systems: A taxonomy and future directions. ACM Computing Surveys (CSUR), 52(1):1–27, 2019.
  66. Graceful degradation via versions: specifications and implementations. In Proceedings of the twenty-sixth annual ACM symposium on Principles of distributed computing, pages 264–273, 2007.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com