Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Intelligent Monitoring Framework for Cloud Services: A Data-Driven Approach (2403.07927v1)

Published 29 Feb 2024 in cs.NI and cs.LG

Abstract: Cloud service owners need to continuously monitor their services to ensure high availability and reliability. Gaps in monitoring can lead to delay in incident detection and significant negative customer impact. Current process of monitor creation is ad-hoc and reactive in nature. Developers create monitors using their tribal knowledge and, primarily, a trial and error based process. As a result, monitors often have incomplete coverage which leads to production issues, or, redundancy which results in noise and wasted effort. In this work, we address this issue by proposing an intelligent monitoring framework that recommends monitors for cloud services based on their service properties. We start by mining the attributes of 30,000+ monitors from 791 production services at Microsoft and derive a structured ontology for monitors. We focus on two crucial dimensions: what to monitor (resources) and which metrics to monitor. We conduct an extensive empirical study and derive key insights on the major classes of monitors employed by cloud services at Microsoft, their associated dimensions, and the interrelationship between service properties and this ontology. Using these insights, we propose a deep learning based framework that recommends monitors based on the service properties. Finally, we conduct a user study with engineers from Microsoft which demonstrates the usefulness of the proposed framework. The proposed framework along with the ontology driven projections, succeeded in creating production quality recommendations for majority of resource classes. This was also validated by the users from the study who rated the framework's usefulness as 4.27 out of 5.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. Cloud monitoring: A survey. Computer Networks 57, 9 (2013), 2093–2115.
  2. The site reliability workbook: practical ways to implement SRE. " O’Reilly Media, Inc.".
  3. A Comprehensive Study of Bugs in Software Defined Networks. In 2021 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). IEEE, 101–115.
  4. Semantic aware online detection of resource anomalies on the cloud. In 2016 IEEE international conference on cloud computing technology and science (CloudCom). IEEE, 134–143.
  5. Towards intelligent incident management: why we need it and how we make it. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1487–1497.
  6. Jianru Ding. 2020. Characterizing Service Level Objectives for Cloud Services: Motivation of Short-Term Cache Allocation Performance Modeling. Ph. D. Dissertation. The Ohio State University.
  7. Characterizing service level objectives for cloud services: Realities and myths. In 2019 IEEE International Conference on Autonomic Computing (ICAC). IEEE, 200–206.
  8. Performance issue diagnosis for online service systems. In 2012 IEEE 31st Symposium on Reliable Distributed Systems. IEEE, 273–278.
  9. An empirical study on crash recovery bugs in large-scale distributed systems. In Proceedings of the 2018 26th ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering. 539–550.
  10. How to fight production incidents? an empirical study on a large-scale cloud service. In Proceedings of the 13th Symposium on Cloud Computing. 126–141.
  11. Google Cloud. 2020. Adopt SLOs. https://cloud.google.com/architecture/framework/reliability/adopting-slos/.
  12. What bugs live in the cloud? a study of 3000+ issues in cloud systems. In Proceedings of the ACM symposium on cloud computing. 1–14.
  13. Why does the cloud stop computing? lessons from hundreds of service outages. In Proceedings of the Seventh ACM Symposium on Cloud Computing. 1–16.
  14. ALLURE: Auditing and Improving LLM-based Evaluation of Text using Iterative In-Context-Learning. arXiv:2309.13701 [cs.CL]
  15. Jørgen Hilden and Paul Glasziou. 1996. Regret graphs, diagnostic uncertainty and Youden’s Index. Statistics in medicine 15, 10 (1996), 969–986.
  16. Hermann O Hirschfeld. 1935. A connection between correlation and contingency. In Mathematical Proceedings of the Cambridge Philosophical Society, Vol. 31. Cambridge University Press, 520–524.
  17. Response time service level agreements for cloud-hosted web applications. In Proceedings of the Sixth ACM Symposium on Cloud Computing. 315–328.
  18. Towards independent run-time cloud monitoring. In Companion of the ACM/SPEC International Conference on Performance Engineering. 21–26.
  19. Fighting the fog of war: Automated incident detection for cloud systems. In 2021 USENIX Annual Technical Conference (USENIX ATC 21). 131–146.
  20. Deep learning for case-based reasoning through prototypes: A neural network that explains its predictions. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32.
  21. An Intelligent Framework for Timely, Accurate, and Comprehensive Cloud Incident Detection. ACM SIGOPS Operating Systems Review 56, 1 (2022), 1–7.
  22. Unveiling clusters of events for alert and incident management in large-scale enterprise it. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. 1630–1639.
  23. What bugs cause production cloud incidents?. In Proceedings of the Workshop on Hot Topics in Operating Systems. 155–162.
  24. Thinking about availability in large service infrastructures. In Proceedings of the 16th Workshop on Hot Topics in Operating Systems. 12–17.
  25. Jeffrey C Mogul and John Wilkes. 2019. Nines are not enough: Meaningful metrics for clouds. In Proceedings of the Workshop on Hot Topics in Operating Systems. 136–141.
  26. GMonE: A complete approach to cloud monitoring. Future Generation Computer Systems 29, 8 (2013), 2026–2040.
  27. Learning a hierarchical monitoring system for detecting and diagnosing service issues. In Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining. 2029–2038.
  28. Sloc: Service level objectives for next generation cloud computing. IEEE Internet Computing 24, 3 (2020), 39–50.
  29. SLO request modeling, reordering and scaling. In Proceedings of the 27th annual international conference on computer science and software engineering. 180–191.
  30. Karl Pearson. 1900. X. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 50, 302 (1900), 157–175.
  31. {{\{{FIRM}}\}}: An intelligent fine-grained resource management framework for {{\{{SLO-Oriented}}\}} microservices. In 14th USENIX symposium on operating systems design and implementation (OSDI 20). 805–825.
  32. Cloud Monitoring. Essentials of Cloud Computing: A Holistic Perspective (2019), 241–254.
  33. Autothrottle: A Practical Framework for Harvesting CPUs from SLO-Targeted Microservices. arXiv preprint arXiv:2212.12180 (2022).
  34. Understanding and handling alert storm for online service systems. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering: Software Engineering in Practice. 162–171.
  35. Automatically and adaptively identifying severe alerts for online service systems. In IEEE INFOCOM 2020-IEEE Conference on Computer Communications. IEEE, 2420–2429.
Citations (1)

Summary

We haven't generated a summary for this paper yet.