Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

FaultProfIT: Hierarchical Fault Profiling of Incident Tickets in Large-scale Cloud Systems (2402.17583v1)

Published 27 Feb 2024 in cs.SE, cs.CL, and cs.LG

Abstract: Postmortem analysis is essential in the management of incidents within cloud systems, which provides valuable insights to improve system's reliability and robustness. At CloudA, fault pattern profiling is performed during the postmortem phase, which involves the classification of incidents' faults into unique categories, referred to as fault pattern. By aggregating and analyzing these fault patterns, engineers can discern common faults, vulnerable components and emerging fault trends. However, this process is currently conducted by manual labeling, which has inherent drawbacks. On the one hand, the sheer volume of incidents means only the most severe ones are analyzed, causing a skewed overview of fault patterns. On the other hand, the complexity of the task demands extensive domain knowledge, which leads to errors and inconsistencies. To address these limitations, we propose an automated approach, named FaultProfIT, for Fault pattern Profiling of Incident Tickets. It leverages hierarchy-guided contrastive learning to train a hierarchy-aware incident encoder and predicts fault patterns with enhanced incident representations. We evaluate FaultProfIT using the production incidents from CloudA. The results demonstrate that FaultProfIT outperforms state-of-the-art methods. Our ablation study and analysis also verify the effectiveness of hierarchy-guided contrastive learning. Additionally, we have deployed FaultProfIT at CloudA for six months. To date, FaultProfIT has analyzed 10,000+ incidents from 30+ cloud services, successfully revealing several fault trends that have informed system improvements.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (60)
  1. 2021. 2021 Facebook outage. https://en.wikipedia.org/wiki/2021_Facebook_outage. [Online; accessed 31 July 2023].
  2. 2023. AWS Post-Event Summaries. https://aws.amazon.com/cn/premiumsupport/technology/pes/. [Online; accessed 31 July 2023].
  3. 2023. Azure status history. https://azure.status.microsoft/en-us/status/history/. [Online; accessed 31 July 2023].
  4. 2023. Google Cloud Status Dashboard. https://status.cloud.google.com/summary. [Online; accessed 31 July 2023].
  5. Knowledge-based intelligent system for IT incident DevOps. In 2023 IEEE/ACM International Workshop on Cloud Intelligence & AIOps (AIOps). IEEE, 1–7.
  6. Recommending Root-Cause and Mitigation Steps for Cloud Incidents Using Large Language Models. In Proceedings of the 45th International Conference on Software Engineering (Melbourne, Victoria, Australia) (ICSE ’23). IEEE Press, 1737–1749.
  7. Generating Natural Language Adversarial Examples. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2890–2896.
  8. ESRO: Experience Assisted Service Reliability against Outages. In 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 255–267.
  9. An empirical investigation of incident triage for online service systems. In 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). IEEE, 111–120.
  10. Continuous incident triage for large-scale online service systems. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 364–375.
  11. A simple framework for contrastive learning of visual representations. In International conference on machine learning. PMLR, 1597–1607.
  12. Empowering Practical Root Cause Analysis by Large Language Models for Cloud Incidents. arXiv preprint arXiv:2305.15778 (2023).
  13. Identifying linked incidents in large-scale online service systems. In Proceedings of the 28th ACM joint meeting on European software engineering conference and symposium on the foundations of software engineering. 304–314.
  14. Towards intelligent incident management: why we need it and how we make it. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1487–1497.
  15. Graph-based incident aggregation for large-scale online service systems. In 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 430–442.
  16. Revisiting Pre-Trained Models for Chinese Natural Language Processing. In Findings of the Association for Computational Linguistics: EMNLP 2020. 657–668.
  17. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-NLT). 4171–4186.
  18. AutoARTS: Taxonomy, Insights and Tools for Root Cause Labelling of Incidents in Microsoft Azure. In 2023 USENIX Annual Technical Conference (USENIX ATC 23). 359–372.
  19. GLM: General Language Model Pretraining with Autoregressive Blank Infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 320–335.
  20. Scouts: Improving the diagnosis process through domain-customized incident routing. In Proceedings of the Annual conference of the ACM Special Interest Group on Data Communication on the applications, technologies, architectures, and protocols for computer communication. 253–269.
  21. How to fight production incidents? an empirical study on a large-scale cloud service. In Proceedings of the 13th Symposium on Cloud Computing. 126–141.
  22. Efficient incident identification from multi-dimensional issue reports via meta-heuristic search. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 292–303.
  23. Efficient customer incident triage via linking with system incidents. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1296–1307.
  24. Why does the cloud stop computing? lessons from hundreds of service outages. In Proceedings of the Seventh ACM Symposium on Cloud Computing (SOCC). 1–16.
  25. CoSQA: 20,000+ Web Queries for Code Search and Question Answering. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 5690–5700.
  26. Gray failure: The achilles’ heel of cloud-scale systems. In Proceedings of the 16th Workshop on Hot Topics in Operating Systems (HotOS). 150–155.
  27. Categorical Reparameterization with Gumbel-Softmax. In International Conference on Learning Representations.
  28. Assess and Summarize: Improve Outage Understanding with Large Language Models. arXiv preprint arXiv:2305.18084 (2023).
  29. Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 6769–6781.
  30. Self-Guided Contrastive Learning for BERT Sentence Representations. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2528–2540.
  31. Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
  32. Thomas N Kipf and Max Welling. 2016. Semi-Supervised Classification with Graph Convolutional Networks. In International Conference on Learning Representations.
  33. Maat: Performance Metric Anomaly Anticipation for Cloud Services with Conditional Diffusion. In 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 116–128.
  34. Fighting the Fog of War: Automated Incident Detection for Cloud Systems. In 2021 USENIX Annual Technical Conference (USENIX ATC). USENIX Association, 131–146. https://www.usenix.org/conference/atc21/presentation/li-liqun
  35. An Intelligent Framework for Timely, Accurate, and Comprehensive Cloud Incident Detection. ACM SIGOPS Operating Systems Review 56, 1 (2022), 1–7.
  36. Actionable and interpretable fault localization for recurring failures in online service systems. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 996–1008.
  37. What bugs cause production cloud incidents?. In Proceedings of the Workshop on Hot Topics in Operating Systems (HotOS). 155–162.
  38. Incident-Aware Duplicate Ticket Aggregation for Cloud Systems. In Proceedings of the 45th International Conference on Software Engineering (Melbourne, Victoria, Australia) (ICSE ’23). 2299–2311.
  39. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. Comput. Surveys 55, 9 (2023), 1–35.
  40. P-tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 61–68.
  41. Software analytics for incident management of online services: An experience report. In 2013 28th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 475–485.
  42. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems (NeurIPS) 32 (2019).
  43. LogEncoder: Log-based Contrastive Representation Learning for anomaly detection. IEEE Transactions on Network and Service Management (2023).
  44. Nils Rethmeier and Isabelle Augenstein. 2023. A Primer on Contrastive Pretraining in Language Processing: Methods, Lessons Learned, and Perspectives. Comput. Surveys 55, 10 (2023), 1–17.
  45. Contrastive Learning for API Aspect Analysis. In 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 637–648.
  46. SoftNER: Mining knowledge graphs from cloud incidents. Empirical Software Engineering 27, 4 (2022), 93.
  47. Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research 9, 11 (2008).
  48. Attention is all you need. Advances in neural information processing systems 30 (2017).
  49. Graph Attention Networks. In International Conference on Learning Representations.
  50. How long will it take to mitigate this incident for online service systems?. In 2021 IEEE 32nd International Symposium on Software Reliability Engineering (ISSRE). IEEE, 36–46.
  51. Incorporating Hierarchy into Text Encoder: a Contrastive Learning Approach for Hierarchical Text Classification. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 7109–7119.
  52. Clear: contrastive learning for api recommendation. In Proceedings of the 44th International Conference on Software Engineering. 376–387.
  53. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019).
  54. Clear: Contrastive learning for sentence representation. arXiv preprint arXiv:2012.15466 (2020).
  55. Do transformers really perform badly for graph representation? Advances in Neural Information Processing Systems 34 (2021), 28877–28888.
  56. Hierarchical Text Classification: a review of current research. EXPERT SYSTEMS WITH APPLICATIONS 224 (2023).
  57. Halo: Hierarchy-aware fault localization for cloud systems. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 3948–3958.
  58. Onion: identifying incident-indicating logs for cloud systems. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE). 1253–1263.
  59. How to Manage Change-Induced Incidents? Lessons from the Study of Incident Life Cycle. In 2023 IEEE 34th International Symposium on Software Reliability Engineering (ISSRE). IEEE, 264–274.
  60. Hierarchy-aware global model for hierarchical text classification. In Proceedings of the 58th annual meeting of the association for computational linguistics. 1106–1117.
Citations (3)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com