Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Lifting the Fog of Uncertainties: Dynamic Resource Orchestration for the Containerized Cloud (2309.16962v1)

Published 29 Sep 2023 in cs.DC

Abstract: The advances in virtualization technologies have sparked a growing transition from virtual machine (VM)-based to container-based infrastructure for cloud computing. From the resource orchestration perspective, containers' lightweight and highly configurable nature not only enables opportunities for more optimized strategies, but also poses greater challenges due to additional uncertainties and a larger configuration parameter search space. Towards this end, we propose Drone, a resource orchestration framework that adaptively configures resource parameters to improve application performance and reduce operational cost in the presence of cloud uncertainties. Built on Contextual Bandit techniques, Drone is able to achieve a balance between performance and resource cost on public clouds, and optimize performance on private clouds where a hard resource constraint is present. We show that our algorithms can achieve sub-linear growth in cumulative regret, a theoretically sound convergence guarantee, and our extensive experiments show that Drone achieves an up to 45% performance improvement and a 20% resource footprint reduction across batch processing jobs and microservice workloads.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (84)
  1. 2022. Archive Team: The Twitter Stream Grab. https://archive.org/details/twitterstream/.
  2. 2023a. Amazon EC2 Instance Types – Amazon Web Services (AWS). https://aws.amazon.com/ec2/instance-types/.
  3. 2023b. Amazon EC2 Spot Instances Pricing – Amazon Web Services (AWS). https://aws.amazon.com/ec2/spot/pricing/.
  4. 2023. Azure Spot Virtual Machines. https://azure.microsoft.com/en-us/products/virtual-machines/spot/.
  5. 2023c. Burstable performance instances – Amazon Web Services (AWS). https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/burstable-performance-instances.html.
  6. 2023. Limit a container’s resources — Docker Documentation. https://docs.docker.com/config/containers/resource_constraints/.
  7. 2023. Resource Management for Pods and Containers — Kubernetes Documentation. https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/.
  8. 2023. Spot VMs - Google Cloud Platform. https://cloud.google.com/spot-vms.
  9. ACCRS: autonomic based cloud computing resource scaling. Cluster Computing 20, 3 (2017), 2479–2488.
  10. An adaptive hybrid elasticity controller for cloud infrastructures. In 2012 IEEE Network Operations and Management Symposium. IEEE, 204–212.
  11. CherryPick: Adaptively Unearthing the Best Cloud Configurations for Big Data Analytics.. In NSDI, Vol. 2. 4–2.
  12. Regret bound for safe gaussian process bandit optimization. In Learning for Dynamics and Control. PMLR, 158–159.
  13. Anony. authors. 2023. Due to double-blind review policy, proofs are provided at the request of the program chair or included in a technical report upon acceptance.
  14. Kubernetes Autoscalers. 2023. https://github.com/kubernetes/autoscaler.
  15. Amazon aws autoscaling service. 2021. https://aws.amazon.com/autoscaling/.
  16. Microsoft azure autoscaler. 2021. https://docs.microsoft.com/en-us/azure/cloud-services/cloud-services-how-to-scale-portal/.
  17. Ataollah Fatahi Baarzi and George Kesidis. 2021. Showar: Right-sizing and efficient scheduling of microservices. In Proceedings of the ACM Symposium on Cloud Computing. 427–441.
  18. Susan Baldwin. 2012. Compute Canada: advancing computational research. In Journal of Physics: Conference Series, Vol. 341. IOP Publishing, 012001.
  19. CloudSort Benchmark. 2022. http://sortbenchmark.org/.
  20. Statistical machine learning makes automatic control practical for internet datacenters. In Proceedings of the 2009 conference on Hot topics in cloud computing. USENIX Association, 12.
  21. Predicting cloud resource utilization. In Proceedings of the 9th International Conference on Utility and Cloud Computing. 37–42.
  22. Cgptuner: a contextual gaussian process bandit approach for the automatic tuning of it configurations under varying workload conditions. Proceedings of the VLDB Endowment 14, 8 (2021), 1401–1413.
  23. An online convex optimization approach to proactive network resource allocation. IEEE Transactions on Signal Processing 65, 24 (2017), 6350–6364.
  24. Analyzing alibaba’s co-located datacenter workloads. In 2018 IEEE International Conference on Big Data (Big Data). IEEE, 292–297.
  25. Ayman Chouayakh and Apostolos Destounis. 2022. Towards no regret with no service outages in online resource allocation for edge computing. In ICC 2022-IEEE International Conference on Communications. IEEE, 4378–4383.
  26. Sayak Ray Chowdhury and Aditya Gopalan. 2017. On kernelized multi-armed bandits. In International Conference on Machine Learning. PMLR, 844–853.
  27. Google cloud compute engine autoscaler. 2021. https://cloud.google.com/compute/docs/autoscaler/.
  28. Google cloud compute engine resource-based pricing. 2021. https://cloud.google.com/compute/resource-based-pricing.
  29. Jeffrey Dean and Luiz André Barroso. 2013. The tail at scale. Commun. ACM 56, 2 (2013), 74–80.
  30. Microservices: A performance tester’s dream or nightmare?. In Proceedings of the ACM/SPEC International Conference on Performance Engineering. 138–149.
  31. Above the clouds: A berkeley view of cloud computing. Dept. Electrical Eng. and Comput. Sciences, University of California, Berkeley, Rep. UCB/EECS 28, 13 (2009), 2009.
  32. An open-source benchmark suite for microservices and their hardware-software implications for cloud & edge systems. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems. 3–18.
  33. Seer: Leveraging big data to navigate the complexity of performance debugging in cloud microservices. In Proceedings of the twenty-fourth international conference on architectural support for programming languages and operating systems. 19–33.
  34. An autonomic resource provisioning approach for service-based cloud applications: A hybrid approach. Future Generation Computer Systems 78 (2018), 191–210.
  35. ATOM: Model-driven autoscaling for microservices. In 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS). IEEE, 1994–2004.
  36. A feedback-control approach for resource management in public clouds. In 2015 IEEE Global Communications Conference (GLOBECOM). IEEE, 1–7.
  37. Applied machine learning at facebook: A datacenter infrastructure perspective. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 620–629.
  38. Just how big is Amazon’s AWS business? (hint: it’s absolutely massive). 2014. https://web.archive.org/web/20191223045710/https://www.geek.com/chips/just-how-big-is-amazons-aws-business-hint-its-absolutely-massive-1610221/.
  39. Arrow: Low-level augmented bayesian optimization for finding the best cloud vm. In 2018 IEEE 38th International Conference on Distributed Computing Systems (ICDCS). IEEE, 660–670.
  40. Weaveworks Inc. 2021. Sock Shop. https://github.com/microservices-demo/microservices-demo.
  41. Optimal cloud resource auto-scaling for web applications. In 2013 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing. IEEE, 58–65.
  42. Cloud programming simplified: A berkeley view on serverless computing. arXiv preprint arXiv:1902.03383 (2019).
  43. Uncertainty-Aware Decisions in Cloud Computing: Foundations and Future Directions. ACM Comput. Surv. 54, 4, Article 74 (may 2021), 30 pages. https://doi.org/10.1145/3447583
  44. Grandslam: Guaranteeing slas for jobs in microservices execution frameworks. In Proceedings of the Fourteenth EuroSys Conference 2019. 1–16.
  45. Matthias Keller and Holger Karl. 2014. Response time-optimized distributed cloud resource allocation. In Proceedings of the 2014 ACM SIGCOMM workshop on Distributed cloud computing. 47–52.
  46. Andreas Krause and Cheng Ong. 2011. Contextual gaussian process bandit optimization. Advances in neural information processing systems 24 (2011).
  47. RAMBO: Resource allocation for microservices using Bayesian optimization. IEEE Computer Architecture Letters 20, 1 (2021), 46–49.
  48. No regret in cloud resources reservation with violation guarantees. In IEEE INFOCOM 2019-IEEE Conference on Computer Communications. IEEE, 1747–1755.
  49. Accordia: Adaptive cloud configuration optimization for recurring data-intensive applications. In Proceedings of the ACM Symposium on Cloud Computing. 479–479.
  50. Understanding and Optimizing Workloads for Unified Resource Management in Large Cloud Platforms. In Proceedings of the Eighteenth European Conference on Computer Systems. 416–432.
  51. Characterizing Microservice Dependency and Performance: Alibaba Trace Analysis. In Proceedings of the ACM Symposium on Cloud Computing. 412–426.
  52. The power of prediction: microservice auto scaling via workload learning. In Proceedings of the 13th Symposium on Cloud Computing. 355–369.
  53. Park: An open platform for learning-augmented computer systems. Advances in Neural Information Processing Systems 32 (2019).
  54. Learning scheduling algorithms for data processing clusters. In Proceedings of the ACM special interest group on data communication. 270–288.
  55. Ryan Marcus and Olga Papaemmanouil. 2017. Releasing Cloud Databases for the Chains of Performance Prediction Models.. In CIDR.
  56. Stock Market Data Nifty 100 Stocks (1 min) data. 2022. https://www.kaggle.com/datasets/debashis74017/stock-market-data-nifty-50-stocks-1-min-data.
  57. High-dimensional Bayesian optimization using low-dimensional feature spaces. Machine Learning 109 (2020), 1925–1943.
  58. Prometheus node exporter. 2023. https://github.com/prometheus/node_exporter.
  59. Spark on-k8s-operator by Google Cloud Platform. 2023. https://github.com/GoogleCloudPlatform/spark-on-k8s-operator.
  60. Flink on Kubernetes. 2023. https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/resource-providers/native_kubernetes/.
  61. GRAF: A graph neural network based proactive resource allocation framework for SLO-oriented microservices. In Proceedings of the 17th International Conference on emerging Networking EXperiments and Technologies. 154–167.
  62. Prometheus. 2023. https://prometheus.io/.
  63. FIRM: An intelligent fine-grained resource management framework for slo-oriented microservices. In Proceedings of The 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI ‘20).
  64. Horizontal and vertical scaling of container-based applications using reinforcement learning. In 2019 IEEE 12th International Conference on Cloud Computing (CLOUD). IEEE, 329–338.
  65. Autopilot: workload autoscaling at Google. In Proceedings of the Fifteenth European Conference on Computer Systems. 1–16.
  66. Clash of the titans: Mapreduce vs. spark for large scale data analytics. Proceedings of the VLDB Endowment 8, 13 (2015), 2110–2121.
  67. Aleksandrs Slivkins et al. 2019. Introduction to multi-armed bandits. Foundations and Trends® in Machine Learning 12, 1-2 (2019), 1–286.
  68. Container-Based Operating System Virtualization: A Scalable, High-Performance Alternative to Hypervisors. In Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007. ACM, Lisbon Portugal, 275–287. https://doi.org/10.1145/1272996.1273025
  69. Gaussian process optimization in the bandit setting: No regret and experimental design. arXiv preprint arXiv:0912.3995 (2009).
  70. Safe exploration for optimization with Gaussian processes. In International conference on machine learning. PMLR, 997–1005.
  71. Stagewise safe bayesian optimization with gaussian processes. In International conference on machine learning. PMLR, 4781–4789.
  72. Lubos Takac and Michal Zabovsky. 2012. Data analysis in public social networks. In International scientific conference and international workshop present day trends of innovations, Vol. 1. Present Day Trends of Innovations Lamza Poland.
  73. Scalable Thompson sampling using sparse Gaussian process models. Advances in neural information processing systems 34 (2021), 5631–5643.
  74. Ernest: Efficient performance prediction for large-scale advanced analytics. In 13th {normal-{\{{USENIX}normal-}\}} symposium on networked systems design and implementation ({normal-{\{{NSDI}normal-}\}} 16). 363–378.
  75. Joannes Vermorel and Mehryar Mohri. 2005. Multi-armed bandit algorithms and empirical evaluation. In Machine Learning: ECML 2005: 16th European Conference on Machine Learning, Porto, Portugal, October 3-7, 2005. Proceedings 16. Springer, 437–448.
  76. DeepScaling: microservices autoscaling for stable CPU utilization in large scale cloud systems. In Proceedings of the 13th Symposium on Cloud Computing. 16–30.
  77. Christopher KI Williams and Carl Edward Rasmussen. 2006. Gaussian processes for machine learning. Vol. 2. MIT press Cambridge, MA.
  78. Efficiently sampling functions from Gaussian process posteriors. In International Conference on Machine Learning. PMLR, 10292–10302.
  79. LOCAT: Low-Overhead Online Configuration Auto-Tuning of Spark SQL Applications [Extended Version]. arXiv preprint arXiv:2203.14889 (2022).
  80. Wrangler: Predictable and faster jobs using fewer resources. In Proceedings of the ACM Symposium on Cloud Computing. 1–14.
  81. Mark: Exploiting cloud services for cost-effective, slo-aware machine learning inference serving. In 2019 {normal-{\{{USENIX}normal-}\}} Annual Technical Conference ({normal-{\{{USENIX}normal-}\}}{normal-{\{{ATC}normal-}\}} 19). 1049–1062.
  82. Workload consolidation in alibaba clusters: the good, the bad, and the ugly. In Proceedings of the 13th Symposium on Cloud Computing. 210–225.
  83. Dremel: Adaptive Configuration Tuning of RocksDB KV-Store. In Abstract Proceedings of the 2022 ACM SIGMETRICS/IFIP PERFORMANCE Joint International Conference on Measurement and Modeling of Computer Systems. 61–62.
  84. Fault analysis and debugging of microservice systems: Industrial survey, benchmark system, and empirical study. IEEE Transactions on Software Engineering (2018).

Summary

We haven't generated a summary for this paper yet.