Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

AIOps Solutions for Incident Management: Technical Guidelines and A Comprehensive Literature Review (2404.01363v1)

Published 1 Apr 2024 in cs.OS, cs.AI, and cs.SE

Abstract: The management of modern IT systems poses unique challenges, necessitating scalability, reliability, and efficiency in handling extensive data streams. Traditional methods, reliant on manual tasks and rule-based approaches, prove inefficient for the substantial data volumes and alerts generated by IT systems. Artificial Intelligence for Operating Systems (AIOps) has emerged as a solution, leveraging advanced analytics like machine learning and big data to enhance incident management. AIOps detects and predicts incidents, identifies root causes, and automates healing actions, improving quality and reducing operational costs. However, despite its potential, the AIOps domain is still in its early stages, decentralized across multiple sectors, and lacking standardized conventions. Research and industrial contributions are distributed without consistent frameworks for data management, target problems, implementation details, requirements, and capabilities. This study proposes an AIOps terminology and taxonomy, establishing a structured incident management procedure and providing guidelines for constructing an AIOps framework. The research also categorizes contributions based on criteria such as incident management tasks, application areas, data sources, and technical approaches. The goal is to provide a comprehensive review of technical and research aspects in AIOps for incident management, aiming to structure knowledge, identify gaps, and establish a foundation for future developments in the field.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (300)
  1. 2023. Amazon Kinesis. https://aws.amazon.com/fr/kinesis/
  2. 2023. Apache Kafka. https://kafka.apache.org/
  3. 2023. Apache Nifi. https://nifi.apache.org/
  4. 2023. Apache Superset. https://superset.apache.org/
  5. 2023. Azure Event Hubs. https://learn.microsoft.com/fr-fr/azure/event-hubs/
  6. 2023. Beats from the ELK stack. https://www.elastic.co/fr/beats/
  7. 2023. Build Lakehouses with Delta Lake. https://delta.io/
  8. 2023. Clickhouse. https://clickhouse.com/
  9. 2023. Elasticsearch. https://www.elastic.co/
  10. 2023. Fluentd. https://www.fluentd.org/
  11. 2023. Google Cloud Pub/Sub. https://cloud.google.com/pubsub
  12. 2023. Grafana. https://grafana.com/
  13. 2023. IBM. https://www.ibm.com/cloud/architecture/architectures/sm-aiops/reference-architecture/
  14. 2023. InfluxDB. https://www.influxdata.com/
  15. 2023. Kibana from the ELK stack. https://www.elastic.co/fr/kibana/
  16. 2023. Lucene Apache. https://lucene.apache.org/
  17. 2023. Metabase. https://www.metabase.com/
  18. 2023. RabbitMQ. https://www.rabbitmq.com/
  19. 2023. RocketMQ. https://rocketmq.apache.org/
  20. 2023. Telegraf from InfluxData. https://www.influxdata.com/time-series-platform/telegraf/
  21. An evaluation of similarity coefficients for software fault localization. In 2006 12th Pacific Rim International Symposium on Dependable Computing (PRDC’06). IEEE, 39–46.
  22. Spectrum-based multiple fault localization. In 2009 IEEE/ACM International Conference on Automated Software Engineering. IEEE, 88–99.
  23. Bug triage in open source systems: a review. International Journal of Collaborative Enterprise 4, 4 (2014), 299–319.
  24. Efficient Bug Triaging Using Text Mining. J. Softw. 8, 9 (2013), 2185–2190.
  25. Similarity measures for OLAP sessions. Knowledge and information systems 39, 2 (2014), 463–489.
  26. Adaptive on-line software aging prediction based on machine learning. In 2010 IEEE/IFIP International Conference on Dependable Systems & Networks (DSN). IEEE, 507–516.
  27. Is it a bug or an enhancement? A text-based approach to classify change requests. In Proceedings of the 2008 conference of the center for advanced studies on collaborative research: meeting of minds. 304–318.
  28. Coping with an open bug repository. In Proceedings of the 2005 OOPSLA workshop on Eclipse technology eXchange. 35–39.
  29. Clustering-based materialized view selection in data warehouses. In East European conference on advances in databases and information systems. Springer, 81–95.
  30. Software aging in the eucalyptus cloud computing infrastructure: characterization and rejuvenation. ACM Journal on Emerging Technologies in Computing Systems (JETC) 10, 1 (2014), 1–22.
  31. X-ray: Automating root-cause diagnosis of performance anomalies in production software. In Presented as part of the 10th {normal-{\{{USENIX}normal-}\}} Symposium on Operating Systems Design and Implementation ({normal-{\{{OSDI}normal-}\}} 12). 307–320.
  32. Martin Atzmueller. 2015. Subgroup discovery. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 5, 1 (2015), 35–49.
  33. Usad: Unsupervised anomaly detection on multivariate time series. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 3395–3404.
  34. Basic concepts and taxonomy of dependable and secure computing. IEEE transactions on dependable and secure computing 1, 1 (2004), 11–33.
  35. Towards highly reliable enterprise network services via inference of multi-level dependencies. ACM SIGCOMM Computer Communication Review 37, 4 (2007), 13–24.
  36. Automated duplicate bug report classification using subsequence matching. In 2012 IEEE 14th International Symposium on High-Assurance Systems Engineering. IEEE, 74–81.
  37. Anomaly detection of network-initiated LTE signaling traffic in wireless sensor and actuator networks based on a Hidden semi-Markov Model. Computers & Security 65 (2017), 108–120.
  38. On the time-based conclusion stability of cross-project defect prediction models. Empirical Software Engineering 25, 6 (2020), 5047–5083.
  39. Finding Similar Failures Using Callstack Similarity. In SysML.
  40. Mehdi Bateni and Ahmad Baraani. 2013. Time Window Management for Alert Correlation using Context Information and Classification. International Journal of Computer Network & Information Security 5, 11 (2013).
  41. Towards aiops in edge computing environments. In 2020 IEEE International Conference on Big Data (Big Data). IEEE, 3470–3475.
  42. Towards developing a novel framework for practical phm: A sequential decision problem solved by reinforcement learning and artificial neural networks. International Journal of Prognostics and Health Management 10, 4 (2019).
  43. On-premise Infrastructure for AIOps in a Software Editor SME: An experience report. In 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESECFSE.
  44. Tarek Berghout and Mohamed Benbouzid. 2022. A systematic guide for predicting remaining useful life with machine learning. Electronics 11, 7 (2022), 1125.
  45. Pamela Bhattacharya and Iulian Neamtiu. 2010. Fine-grained incremental learning and multi-feature tossing graphs to improve bug triaging. In 2010 IEEE International Conference on Software Maintenance. IEEE, 1–10.
  46. Artificial Intelligence for IT Operations (AIOPS) Workshop White Paper. arXiv preprint arXiv:2101.06054 (2021).
  47. Enriching word vectors with subword information. Transactions of the association for computational linguistics 5 (2017), 135–146.
  48. Data classification and MTBF prediction with a multivariate analysis approach. Reliability Engineering & System Safety 97, 1 (2012), 27–35.
  49. A unified framework for coupling measurement in object-oriented systems. IEEE Transactions on software Engineering 25, 1 (1999), 91–121.
  50. Quickly Finding Known Software Problems via Automated Symptom Matching. In Second International Conference on Autonomic Computing (ICAC 2005). IEEE Computer Society, 101–110.
  51. A maintenance planning framework using online and offline deep reinforcement learning. Neural Computing and Applications (2023), 1–12.
  52. D Cappuccio. 2013. Ensure cost balances out with risk in highavailability data centers. Gartner, July (2013).
  53. Saul Carliner. 2004. An overview of online learning. (2004).
  54. Proactive management of software aging. IBM Journal of Research and Development 45, 2 (2001), 311–332.
  55. Formal concept analysis enhances fault localization in software. In Formal Concept Analysis: 6th International Conference, ICFCA 2008, Montreal, Canada, February 25-28, 2008. Proceedings 6. Springer, 273–288.
  56. Failure prediction of data centers using time series and fault tree analysis. In 2012 IEEE 18th International Conference on Parallel and Distributed Systems. IEEE, 794–799.
  57. Anomaly detection: A survey. ACM computing surveys (CSUR) 41, 3 (2009), 1–58.
  58. F-fade: Frequency factorization for anomaly detection in edge streams. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining. 589–597.
  59. Compressing SQL workloads. In Proceedings of the 2002 ACM SIGMOD international conference on Management of data. 488–499.
  60. Continuous incident triage for large-scale online service systems. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 364–375.
  61. How incidental are the incidents? characterizing and prioritizing incidents for large-scale online service systems. In Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering. 373–384.
  62. Causeinfer: Automatic and distributed performance diagnosis with hierarchical causality graph in large distributed systems. In IEEE INFOCOM 2014-IEEE Conference on Computer Communications. IEEE, 1887–1895.
  63. Detecting performance anti-patterns for applications developed using object-relational mapping. In Proceedings of the 36th International Conference on Software Engineering. 1001–1012.
  64. Outage prediction and diagnosis for cloud service systems. In The World Wide Web Conference. 2659–2665.
  65. Aiops innovations of incident management for cloud services. (2020).
  66. Towards intelligent incident management: why we need it and how we make it. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1487–1497.
  67. Shyam R Chidamber and Chris F Kemerer. 1994. A metrics suite for object oriented design. IEEE Transactions on software engineering 20, 6 (1994), 476–493.
  68. The mystery machine: End-to-end performance analysis of large-scale internet services. In 11th {normal-{\{{USENIX}normal-}\}} symposium on operating systems design and implementation ({normal-{\{{OSDI}normal-}\}} 14). 217–231.
  69. Correlating Instrumentation Data to System States: A Building Block for Automated Diagnosis and Control.. In OSDI, Vol. 4. 16–16.
  70. Nick Craswell. 2009. Mean Reciprocal Rank. Springer US, 1703–1703.
  71. Logram: Efficient Log Parsing Using n𝑛nitalic_n n-Gram Dictionaries. IEEE Transactions on Software Engineering 48, 3 (2020), 879–892.
  72. An extensive comparison of bug prediction approaches. In 2010 7th IEEE working conference on mining software repositories (MSR 2010). IEEE, 31–41.
  73. AIOps: real-world challenges and research innovations. In 2019 IEEE/ACM 41st International Conference on Software Engineering: Companion Proceedings (ICSE-Companion). IEEE, 4–5.
  74. ReBucket: A method for clustering duplicate crash reports based on call stack similarity. In 34th International Conference on Software Engineering, ICSE. 1084–1093.
  75. A survey on data-driven predictive maintenance for the railway industry. Sensors 21, 17 (2021), 5739.
  76. Failuresim: a system for predicting hardware failures in cloud data centers using neural networks. In 2017 IEEE 10th International Conference on Cloud Computing (CLOUD). IEEE, 544–551.
  77. Tijl De Bie. 2011. Maximum entropy models and subjective interestingness. Data Mining and Knowledge Discovery 23, 3 (2011), 407–446.
  78. Comprehensive and Efficient Workload Compression. Proc. VLDB Endow. 14, 3 (2020), 418–430.
  79. Toward comprehensible software fault prediction models using bayesian network classifiers. IEEE Transactions on Software Engineering 39, 2 (2012), 237–257.
  80. Ailin Deng and Bryan Hooi. 2021. Graph neural network-based anomaly detection in multivariate time series. In Proceedings of the AAAI conference on artificial intelligence, Vol. 35. 4027–4035.
  81. Data mining and cross-checking of execution traces: a re-interpretation of jones, harrold and stasko test information. In Proceedings of the 20th IEEE/ACM International Conference on Automated software engineering. 396–399.
  82. Classifying field crash reports for fixing bugs: A case study of Mozilla Firefox. In IEEE 27th International Conference on Software Maintenance, ICSM. 333–342.
  83. Orfeon: An AIOps framework for the goal-driven operationalization of distributed analytical pipelines. Future Generation Computer Systems 140 (2023), 18–35.
  84. Healing online service systems via mining historical issue repositories. In Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering. 318–321.
  85. Deeplog: Anomaly detection and diagnosis from system logs through deep learning. In Proceedings of the 2017 ACM SIGSAC conference on computer and communications security. 1285–1298.
  86. Evaluating defect prediction approaches: a benchmark and an extensive comparison. Empirical Software Engineering 17 (2012), 531–577.
  87. Karim O Elish and Mahmoud O Elish. 2008. Predicting defect-prone software modules using support vector machines. Journal of Systems and Software 81, 5 (2008), 649–660.
  88. Stephen Elliot. 2014. DevOps and the cost of downtime: Fortune 1000 best practice metrics quantified. International Data Corporation (IDC) (2014).
  89. Metric selection and anomaly detection for cloud operations using log and metric correlation analysis. Journal of Systems and Software 137 (2018), 531–549.
  90. Olga Fink. 2020. Data-driven intelligent predictive maintenance of industrial assets. Women in Industrial and Systems Engineering: Key Advances and Perspectives on Emerging Topics (2020), 589–605.
  91. Anomaly detection: How to artificially increase your f1-score with a biased evaluation protocol. In Machine Learning and Knowledge Discovery in Databases. Applied Data Science Track: European Conference, ECML PKDD 2021, Bilbao, Spain, September 13–17, 2021, Proceedings, Part IV. Springer, 3–18.
  92. Failure prediction based on log files using random indexing and support vector machines. Journal of Systems and Software 86, 1 (2013), 2–11.
  93. Execution anomaly detection in distributed systems through unstructured log analysis. In 2009 ninth IEEE international conference on data mining. IEEE, 149–158.
  94. A survey of fault diagnosis and fault-tolerant techniques—Part I: Fault diagnosis with model-based and signal-based approaches. IEEE transactions on industrial electronics 62, 6 (2015), 3757–3767.
  95. A methodology for detection and estimation of software aging. In Proceedings Ninth International Symposium on Software Reliability Engineering (Cat. No. 98TB100257). IEEE, 283–292.
  96. Predicting fault incidence using software change history. IEEE Transactions on software engineering 26, 7 (2000), 653–661.
  97. The fundamentals of software aging. In 2008 IEEE International conference on software reliability engineering workshops (ISSRE Wksp). Ieee, 1–6.
  98. A Survey of Methods for Explaining Black Box Models. ACM Comput. Surv. 51, 5 (2019), 93:1–93:42.
  99. Logbert: Log anomaly detection via bert. In 2021 international joint conference on neural networks (IJCNN). IEEE, 1–8.
  100. Improving software maintenance with improved bug triaging. Journal of King Saud University-Computer and Information Sciences 34, 10 (2022), 8757–8764.
  101. Maurice H Halstead. 1977. Elements of Software Science (Operating and programming systems series). Elsevier Science Inc.
  102. Nodoze: Combatting threat alert fatigue with automated provenance triage. In network and distributed systems security symposium.
  103. Duplicate bug report detection using dual-channel convolutional neural networks. In Proceedings of the 28th International Conference on Program Comprehension. 117–127.
  104. A survey on automated log analysis for reliability engineering. ACM computing surveys (CSUR) 54, 6 (2021), 1–37.
  105. Identifying impactful service system problems via log analysis. In Proceedings of the 2018 26th ACM joint meeting on European software engineering conference and symposium on the foundations of software engineering. 60–70.
  106. Experience report: System log analysis for anomaly detection. In 2016 IEEE 27th international symposium on software reliability engineering (ISSRE). IEEE, 207–218.
  107. Lyndon Hiew. 2006. Assisted detection of duplicate bug reports. Ph. D. Dissertation. University of British Columbia.
  108. Aaron Huff. 2015. Breaking down the cost. Commercial Carrier Journal (2015).
  109. Performance anomaly detection and bottleneck identification. ACM Computing Surveys (CSUR) 48, 1 (2015), 1–35.
  110. Tariqul Islam and Dakshnamoorthy Manivannan. 2017. Predicting application failure in cloud: A machine learning approach. In 2017 IEEE International Conference on Cognitive Computing (ICCC). IEEE, 24–31.
  111. Query2Vec: An Evaluation of NLP Techniques for Generalized Workload Analytics. arXiv preprint arXiv:1801.05613 (2018).
  112. Nicholas Jalbert and Westley Weimer. 2008. Automated duplicate detection for bug tracking systems. In 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN). IEEE, 52–61.
  113. Predicting faults in high performance computing systems: An in-depth survey of the state-of-the-practice. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–13.
  114. Improving bug triage with bug tossing graphs. In Proceedings of the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering. 111–120.
  115. ExplainIt!–A declarative root-cause analysis engine for time series data. In Proceedings of the 2019 International Conference on Management of Data. 333–348.
  116. James A Jones and Mary Jean Harrold. 2005. Empirical evaluation of the tarantula automatic fault-localization technique. In Proceedings of the 20th IEEE/ACM international Conference on Automated software engineering. 273–282.
  117. Visualization of test information to assist fault localization. In Proceedings of the 24th international conference on Software engineering. 467–477.
  118. HSFal: Effective fault localization using hybrid spectrum of full slices and execution slices. Journal of Systems and Software 90 (2014), 3–17.
  119. Automated memory leak detection for production use. In 36th International Conference on Software Engineering, ICSE ’14, Hyderabad, India - May 31 - June 07, 2014, Pankaj Jalote, Lionel C. Briand, and André van der Hoek (Eds.). ACM, 825–836.
  120. Amin Karami and Manel Guerrero-Zapata. 2015. A fuzzy anomaly detection system based on hybrid PSO-Kmeans algorithm in content-centric networks. Neurocomputing 149 (2015), 1253–1269.
  121. Software metrics for fault prediction using machine learning approaches: A literature review with PROMISE repository dataset. In 2017 IEEE international conference on cybernetics and computational intelligence (CyberneticsCom). IEEE, 19–23.
  122. Revisiting Numerical Pattern Mining with Formal Concept Analysis. In IJCAI. IJCAI/AAAI, 1342–1347.
  123. Machine learning-based approach for hardware faults prediction. IEEE Transactions on Circuits and Systems I: Regular Papers 67, 11 (2020), 3880–3892.
  124. Taghi M Khoshgoftaar and David L Lanning. 1995. A neural network approach for early detection of program modules having high risk in the maintenance phase. Journal of Systems and Software 29, 1 (1995), 85–91.
  125. S3M: Siamese Stack (Trace) Similarity Measure. In 18th IEEE/ACM International Conference on Mining Software Repositories, MSR. 266–270.
  126. Root cause detection in a service-oriented architecture. ACM SIGMETRICS Performance Evaluation Review 41, 1 (2013), 93–104.
  127. Crash graphs: An aggregated view of multiple crashes to improve crash triage. In Proceedings of the 2011 IEEE/IFIP International Conference on Dependable Systems and Networks, DSN. 486–493.
  128. Ettu: Analyzing query intents in corporate databases. In Proceedings of the 25th international conference companion on world wide web. 463–466.
  129. Diagnosing network-wide traffic anomalies. ACM SIGCOMM computer communication review 34, 4 (2004), 219–230.
  130. Mining anomalies using traffic feature distributions. ACM SIGCOMM computer communication review 35, 4 (2005), 217–228.
  131. Comparing mining algorithms for predicting the severity of a reported bug. In 2011 15th European Conference on Software Maintenance and Reengineering. IEEE, 249–258.
  132. Applying deep learning based automatic bug triager to industrial projects. In ESEC/FSE 2017. ACM, 926–931.
  133. Detecting memory leaks through introspective dynamic behavior modelling using machine learning. In Proceedings of the 36th International Conference on Software Engineering. 814–824.
  134. Wan-Jui Lee. 2017. Anomaly detection and severity prediction of air leakage in train braking pipes. International Journal of Prognostics and Health Management 8, 3 (2017).
  135. Johannes Lerch and Mira Mezini. 2013. Finding Duplicates of Your Yet Unwritten Bug Report. In 17th European Conference on Software Maintenance and Reengineering, CSMR. 69–78.
  136. AIOps for a cloud object storage service. In 2019 IEEE International Congress on Big Data (BigDataCongress). IEEE, 165–169.
  137. Software defect prediction via convolutional neural network. In 2017 IEEE international conference on software quality, reliability and security (QRS). IEEE, 318–328.
  138. Being accurate is not enough: New metrics for disk failure prediction. In 2016 IEEE 35th Symposium on Reliable Distributed Systems (SRDS). IEEE, 71–80.
  139. Causal Inference-Based Root Cause Analysis for Online Service Systems with Intervention Recognition. arXiv preprint arXiv:2206.05871 (2022).
  140. Data-driven techniques in computing system management. ACM Computing Surveys (CSUR) 50, 3 (2017), 1–43.
  141. Deepfl: Integrating multiple fault diagnosis dimensions for deep fault localization. In Proceedings of the 28th ACM SIGSOFT international symposium on software testing and analysis. 169–180.
  142. Data alignments in machinery remaining useful life prediction using deep adversarial neural networks. Knowledge-Based Systems 197 (2020), 105843.
  143. Predicting node failures in an ultra-large-scale cloud computing platform: an aiops solution. ACM Transactions on Software Engineering and Methodology (TOSEM) 29, 2 (2020), 1–24.
  144. Fault localization with code coverage representation learning. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 661–673.
  145. Ze Li and Yingnong Dang. 2019. Aiops: Challenges and experiences in azure. USENIX Association, Santa Clara (2019), 51–53.
  146. Zhiguo Li and Qing He. 2015. Prediction of railcar remaining useful life by multiple data source fusion. IEEE Transactions on Intelligent Transportation Systems 16, 4 (2015), 2226–2235.
  147. Generic and robust localization of multi-dimensional root causes. In 2019 IEEE 30th International Symposium on Software Reliability Engineering (ISSRE). IEEE, 47–57.
  148. Actionable and interpretable fault localization for recurring failures in online service systems. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 996–1008.
  149. Multivariate time series anomaly detection and interpretation using hierarchical inter-metric and temporal embedding. In Proceedings of the 27th ACM SIGKDD conference on knowledge discovery & data mining. 3220–3230.
  150. Robust and rapid clustering of kpis for large-scale anomaly detection. In 2018 IEEE/ACM 26th International Symposium on Quality of Service (IWQoS). IEEE, 1–10.
  151. Identifying recurrent and unknown performance issues. In 2014 IEEE International Conference on Data Mining. IEEE, 320–329.
  152. Hardware remediation at scale. In 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W). IEEE, 14–17.
  153. Predicting node failure in cloud service systems. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 480–490.
  154. iDice: problem identification for emerging issues. In Proceedings of the 38th International Conference on Software Engineering. 214–224.
  155. Log clustering based problem identification for online service systems. In Proceedings of the 38th International Conference on Software Engineering Companion. 102–111.
  156. Collaborative alert ranking for anomaly detection. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management. 1987–1995.
  157. Statistical debugging: A hypothesis testing-based approach. IEEE Transactions on software engineering 32, 10 (2006), 831–848.
  158. Opprentice: Towards practical and automatic anomaly detection through machine learning. In Proceedings of the 2015 internet measurement conference. 211–224.
  159. What bugs cause production cloud incidents?. In Proceedings of the Workshop on Hot Topics in Operating Systems. 155–162.
  160. Mining Invariants from Console Logs for System Problem Detection.. In USENIX annual technical conference. 1–14.
  161. Software analytics for incident management of online services: An experience report. In 2013 28th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 475–485.
  162. Making disk failure predictions smarter!. In FAST. 151–167.
  163. Correlating events with time series for incident diagnosis. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. 1583–1592.
  164. An empirical study of the impact of data splitting decisions on the performance of AIOps solutions. ACM Transactions on Software Engineering and Methodology (TOSEM) 30, 4 (2021), 1–38.
  165. Towards a consistent interpretation of aiops models. ACM Transactions on Software Engineering and Methodology (TOSEM) 31, 1 (2021), 1–38.
  166. Meng Ma and Zhu Mao. 2020. Deep-convolution-based LSTM network for remaining useful life prediction. IEEE Transactions on Industrial Informatics 17, 3 (2020), 1658–1667.
  167. Diagnosing root causes of intermittent slow queries in cloud databases. Proceedings of the VLDB Endowment 13, 8 (2020), 1176–1189.
  168. Code4Bench: A multidimensional benchmark of Codeforces data for different program analysis techniques. Journal of Computer Languages 53 (2019), 38–52.
  169. SLDeep: Statement-level software defect prediction using deep-learning model on static code features. Expert Systems with Applications 147 (2020), 113156.
  170. A lightweight algorithm for message type extraction in system application logs. IEEE Transactions on Knowledge and Data Engineering 24, 11 (2011), 1921–1936.
  171. Diagnosing memory leaks using graph mining on heap dumps. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining. 115–124.
  172. Thomas J McCabe. 1976. A complexity measure. IEEE Transactions on software Engineering 4 (1976), 308–320.
  173. LogAnomaly: Unsupervised detection of sequential and quantitative anomalies in unstructured logs.. In IJCAI, Vol. 19. 4739–4745.
  174. Tim Menzies. 2019. The five laws of SE for AI. IEEE Software 37, 1 (2019), 81–85.
  175. Data mining static code attributes to learn defect predictors. IEEE transactions on software engineering 33, 1 (2006), 2–13.
  176. Duplicate bug report detection using an attention-based neural language model. IEEE Transactions on Reliability (2022).
  177. A search-based approach for accurate identification of log message formats. In Proceedings of the 26th Conference on Program Comprehension. 167–177.
  178. A large-scale study of flash memory failures in the field. ACM SIGMETRICS Performance Evaluation Review 43, 1 (2015), 177–190.
  179. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).
  180. Alert correlation algorithms: A survey and taxonomy. In Cyberspace Safety and Security: 5th International Symposium, CSS 2013, Zhangjiajie, China, November 13-15, 2013, Proceedings 5. Springer, 183–197.
  181. Automatically Identifying Known Software Problems. In Proceedings of the 23rd International Conference on Data Engineering Workshops. IEEE Computer Society, 433–441.
  182. Christoph Molnar. 2019. Interpretable Machine Learning.
  183. Reranking-based Crash Report Deduplication. In The 29th International Conference on Software Engineering and Knowledge Engineering, Xudong He (Ed.). KSI Research Inc. and Knowledge Systems Institute Graduate School, 507–510.
  184. A comparative analysis of the efficiency of change metrics and static code attributes for defect prediction. In Proceedings of the 30th international conference on Software engineering. 181–190.
  185. Mukosi Abraham Mukwevho and Turgay Celik. 2018. Toward a smart cloud: A review of fault-tolerance methods in cloud systems. IEEE Transactions on Services Computing 14, 2 (2018), 589–605.
  186. G Murphy and Davor Cubranic. 2004. Automatic bug triage using text categorization. In Proceedings of the Sixteenth International Conference on Software Engineering & Knowledge Engineering. Citeseer, 1–6.
  187. Mining metrics to predict component failures. In Proceedings of the 28th international conference on Software engineering. 452–461.
  188. Transfer defect learning. In 2013 35th international conference on software engineering (ICSE). IEEE, 382–391.
  189. Anomaly detection using program control flow graph mining from execution logs. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. 215–224.
  190. On fault representativeness of software fault injection. IEEE Transactions on Software Engineering 39, 1 (2012), 80–96.
  191. Multi-source distributed system data for ai-powered analytics. In Service-Oriented and Cloud Computing: 8th IFIP WG 2.14 European Conference, ESOCC 2020, Heraklion, Crete, Greece, September 28–30, 2020, Proceedings 8. Springer, 161–176.
  192. Saul B Needleman and Christian D Wunsch. 1970. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of molecular biology 48, 3 (1970), 443–453.
  193. Fault localization using N-gram analysis. In Proceedings of the 3rd International Conference on Wireless Algorithms, Systems, and Applications. 548–559.
  194. FChain: Toward black-box online fault localization for cloud systems. In 2013 IEEE 33rd International Conference on Distributed Computing Systems. IEEE, 21–30.
  195. A Systematic Mapping Study in AIOps. arXiv preprint arXiv:2012.09108 (2020).
  196. A survey of AIOps methods for failure management. ACM Transactions on Intelligent Systems and Technology (TIST) 12, 6 (2021), 1–45.
  197. A hybrid ARIMA–SVM model for the study of the remaining useful life of aircraft engines. J. Comput. Appl. Math. 346 (2019), 184–191.
  198. Predicting the location and number of faults in large software systems. IEEE Transactions on Software Engineering 31, 4 (2005), 340–355.
  199. Sinno Jialin Pan and Qiang Yang. 2010. A survey on transfer learning. IEEE Transactions on knowledge and data engineering 22, 10 (2010), 1345–1359.
  200. Robust feature selection and robust PCA for internet traffic anomaly detection. In 2012 Proceedings Ieee Infocom. IEEE, 1755–1763.
  201. Anomaly detection using the correlational paraconsistent machine with digital signatures of network segment. Information Sciences 420 (2017), 313–328.
  202. DeepTriage: Automated Transfer Assistance for Incidents in Cloud Services. In ACM SIGKDD. 3281–3289.
  203. Classifying bug reports to bugs and other requests using topic modeling. In 2013 20Th asia-pacific software engineering conference (APSEC), Vol. 2. IEEE, 13–18.
  204. Hora: Architecture-aware online failure prediction. Journal of Systems and Software 137 (2018), 669–685.
  205. Pankaj Prasad and Charley Rich. 2018. Market Guide for AIOps Platforms. Retrieved March 12 (2018), 2020.
  206. Xianping Qu and Jingjing Ha. 2017. Next generation of devops: Aiops in practice@ baidu. SREcon17 (2017).
  207. On the” naturalness” of buggy code. In Proceedings of the 38th International Conference on Software Engineering. 428–439.
  208. Lena Reiter and FH Wedel. 2021. AIOps–A Systematic Literature Review. (2021).
  209. Subjectively Interesting Subgroups with Hierarchical Targets: Application to Java Memory Analysis. In International Conference on Data Mining Workshops, ICDMW. IEEE.
  210. ”What makes my queries slow?”: Subgroup Discovery for SQL Workload Analysis. In 36th IEEE/ACM International Conference on Automated Software Engineering, ASE 2021, Melbourne, Australia, November 15-19, 2021. IEEE, 642–652.
  211. DeepLSH: Deep Locality-Sensitive Hash Learning for Fast and Efficient Near-Duplicate Crash Report Detection. In 46th International Conference on Software Engineering, ICSE.
  212. Interpretable Summaries of Black Box Incident Triaging with Subgroup Discovery. In 8th IEEE International Conference on Data Science and Advanced Analytics, DSAA 2021, Porto, Portugal, October 6-9, 2021. IEEE, 1–10.
  213. Time-series anomaly detection service at microsoft. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 3009–3017.
  214. A Survey of Deep Active Learning. ACM Comput. Surv. 54, 9 (2022), 180:1–180:40.
  215. Manos Renieres and Steven P Reiss. 2003. Fault localization with nearest neighbor queries. In 18th IEEE International Conference on Automated Software Engineering, 2003. Proceedings. IEEE, 30–39.
  216. Aiops: A multivocal literature review. Artificial Intelligence for Cloud and Edge Computing (2022), 31–50.
  217. Sensitivity of PCA for traffic anomaly detection. In Proceedings of the 2007 ACM SIGMETRICS international conference on Measurement and modeling of computer systems. 109–120.
  218. Detection of duplicate defect reports using natural language processing. In 29th International Conference on Software Engineering (ICSE’07). IEEE, 499–510.
  219. DURFEX: A Feature Extraction Technique for Efficient Detection of Duplicate Bug Reports. In 2017 IEEE International Conference on Software Quality, Reliability and Security, QRS 2017. 240–250.
  220. A survey of online failure prediction methods. ACM Computing Surveys (CSUR) 42, 3 (2010), 1–42.
  221. Areeg Samir and Claus Pahl. 2019. A controller architecture for anomaly detection, root cause analysis and self-adaptation for cluster architectures. In Intl Conf Adaptive and Self-Adaptive Systems and Applications.
  222. Mark Sanderson. 2010. Test Collection Based Evaluation of Information Retrieval Systems. Found. Trends Inf. Retr. 4, 4 (2010), 247–375.
  223. e-diagnosis: Unsupervised and real-time diagnosis of small-window long-tail latency in large-scale microservice platforms. In The World Wide Web Conference. 3215–3222.
  224. Efficient ticket routing by resolution sequence mining. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. 605–613.
  225. CloudPD: Problem determination and diagnosis in shared dynamic clouds. In 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). IEEE, 1–12.
  226. Evolving from Traditional Systems to AIOps: Design, Implementation and Measurements. In 2020 IEEE International Conference on Advances in Electrical Engineering and Computer Applications (AEECA). IEEE, 276–280.
  227. Sufficient mutation operators for measuring test effectiveness. In Proceedings of the 30th international conference on Software engineering. 351–360.
  228. Anomaly detection in streams with extreme value theory. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining. 1067–1075.
  229. Jeongju Sohn and Shin Yoo. 2017. Fluccs: Using code and change metrics to improve fault localization. In Proceedings of the 26th ACM SIGSOFT International Symposium on Software Testing and Analysis. 273–283.
  230. Survey on models and techniques for root-cause analysis. arXiv preprint arXiv:1701.08546 (2017).
  231. Vladimir Sor and Satish Narayana Srirama. 2014. Memory leak detection in Java: Taxonomy and classification of approaches. J. Syst. Softw. 96 (2014), 139–151.
  232. Charles Spearman. 1961. The proof and measurement of association between two things. (1961).
  233. Robust anomaly detection for multivariate time series through stochastic recurrent neural network. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining. 2828–2837.
  234. CoFlux: robustly correlating KPIs by fluctuations for service troubleshooting. In Proceedings of the International Symposium on Quality of Service. 1–10.
  235. Software rejuvenation in cloud systems using neural networks. In 2014 International Conference on Parallel, Distributed and Grid Computing. IEEE, 230–233.
  236. Towards more accurate retrieval of duplicate bug reports. In 2011 26th IEEE/ACM International Conference on Automated Software Engineering (ASE 2011). IEEE, 253–262.
  237. System-level hardware failure prediction using deep learning. In Proceedings of the 56th Annual Design Automation Conference 2019. 1–6.
  238. Hotspot: Anomaly localization for additive kpis with multi-dimensional attributes. IEEE Access 6 (2018), 10909–10923.
  239. Ashish Sureka and Pankaj Jalote. 2010. Detecting duplicate bug report using character n-gram-based features. In 2010 Asia Pacific software engineering conference. IEEE, 366–374.
  240. A system for denial-of-service attack detection based on multivariate correlation analysis. IEEE transactions on parallel and distributed systems 25, 2 (2013), 447–456.
  241. LogSig: Generating system events from raw textual logs. In Proceedings of the 20th ACM international conference on Information and knowledge management. 785–794.
  242. Information retrieval based nearest neighbor classification for fine-grained bug severity prediction. In 2012 19th Working Conference on Reverse Engineering. IEEE, 215–224.
  243. Drone: Predicting priority of reported bugs by multi-factor analysis. In 2013 IEEE International Conference on Software Maintenance. IEEE, 200–209.
  244. Software defect prediction employing BiLSTM and BERT-based semantic feature. Soft Computing 26, 16 (2022), 7877–7891.
  245. Risto Vaarandi. 2003. A data clustering algorithm for mining patterns from event logs. In Proceedings of the 3rd IEEE Workshop on IP Operations & Management (IPOM 2003)(IEEE Cat. No. 03EX764). Ieee, 119–126.
  246. Kalyanaraman Vaidyanathan and Kishor S Trivedi. 1999. A measurement-based model for estimation of resource exhaustion in operational software systems. In Proceedings 10th International Symposium on Software Reliability Engineering (Cat. No. PR00443). IEEE, 84–93.
  247. A Novel Technique for {{\{{Long-Term}}\}} Anomaly Detection in the Cloud. In 6th USENIX workshop on hot topics in cloud computing (HotCloud 14).
  248. TraceSim: a method for calculating stack trace similarity. In Proceedings of the 4th ACM SIGSOFT International Workshop on Machine Learning Techniques for Software Quality Evaluation, FSE 2020. 25–30.
  249. Matching networks for one shot learning. Advances in neural information processing systems 29 (2016).
  250. Kashi Venkatesh Vishwanath and Nachiappan Nagappan. 2010. Characterizing cloud computing hardware reliability. In Proceedings of the 1st ACM symposium on Cloud computing. 193–204.
  251. Network anomaly detection: A survey and comparative analysis of stochastic and deterministic methods. In 52nd IEEE Conference on Decision and Control. IEEE, 182–187.
  252. Constructing the knowledge base for cognitive it service management. In 2017 IEEE International Conference on Services Computing (SCC). IEEE, 410–417.
  253. Automatically learning semantic features for defect prediction. In Proceedings of the 38th International Conference on Software Engineering. 297–308.
  254. FixerCache: Unsupervised caching active developers for diverse bug triage. In Proceedings of the 8th ACM/IEEE international symposium on empirical software engineering and measurement. 1–10.
  255. End-to-end encrypted traffic classification with one-dimensional convolution neural networks. In 2017 IEEE international conference on intelligence and security informatics (ISI). IEEE, 43–48.
  256. Identifying Erroneous Software Changes through Self-Supervised Contrastive Learning on Time Series Data. In 2022 IEEE 33rd International Symposium on Software Reliability Engineering (ISSRE). IEEE, 366–377.
  257. Online anomaly detection for hard disk drives based on mahalanobis distance. IEEE Transactions on Reliability 62, 1 (2013), 136–145.
  258. Frank Wilcoxon. 1992. Individual comparisons by ranking methods. Springer.
  259. The DStar method for effective software fault localization. IEEE Transactions on Reliability 63, 1 (2013), 290–308.
  260. Towards better fault localization: A crosstab-based statistical approach. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 42, 3 (2011), 378–396.
  261. A survey on software fault localization. IEEE Transactions on Software Engineering 42, 8 (2016), 707–740.
  262. Stefan Wrobel. 1997. An Algorithm for Multi-relational Discovery of Subgroups. In Principles of Data Mining and Knowledge Discovery, First European Symposium, PKDD ’97, Trondheim, Norway, June 24-27, 1997, Proceedings, Vol. 1263. Springer, 78–87.
  263. Detecting leaders from correlated time series. In Database Systems for Advanced Applications: 15th International Conference, DASFAA 2010, Tsukuba, Japan, April 1-4, 2010, Proceedings, Part I 15. Springer, 352–367.
  264. CrashLocator: locating crashing faults based on crash stacks. In International Symposium on Software Testing and Analysis, ISSTA. 204–214.
  265. Remaining useful life estimation of engineered systems using vanilla LSTM neural networks. Neurocomputing 275 (2018), 167–179.
  266. Bug triaging based on tossing sequence modeling. Journal of Computer Science and Technology 34 (2019), 942–956.
  267. Loggan: a log-level generative adversarial network for anomaly detection using permutation event modeling. Information Systems Frontiers 23 (2021), 285–298.
  268. Improving automated bug triaging with specialized topic model. IEEE Transactions on Software Engineering 43, 3 (2016), 272–297.
  269. Disk failure prediction in data centers via online learning. In Proceedings of the 47th International Conference on Parallel Processing. 1–10.
  270. Distributed segment-based anomaly detection with Kullback–Leibler divergence in wireless sensor networks. IEEE Transactions on Information Forensics and Security 12, 1 (2016), 101–110.
  271. Detecting duplicate bug reports with convolutional neural networks. In 2018 25th Asia-Pacific Software Engineering Conference (APSEC). IEEE, 416–425.
  272. Health status assessment and failure prediction for hard drives with recurrent neural networks. IEEE Trans. Comput. 65, 11 (2016), 3502–3508.
  273. Guoqing Xu and Atanas Rountev. 2013. Precise memory leak detection for java software using container profiling. ACM Trans. Softw. Eng. Methodol. 22, 3 (2013), 17:1–17:28.
  274. Unsupervised anomaly detection via variational auto-encoder for seasonal kpis in web applications. In Proceedings of the 2018 world wide web conference. 187–196.
  275. Defect prediction with semantics and context features of codes based on graph representation learning. IEEE Transactions on Reliability 70, 2 (2020), 613–625.
  276. Detecting large-scale system problems by mining console logs. In Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles. 117–132.
  277. Towards effective bug triage with software data reduction techniques. IEEE transactions on knowledge and data engineering 27, 1 (2014), 264–280.
  278. Automatic Bug Triage using Semi-Supervised Text Classification.. In SEKE. 209–214.
  279. Jifeng Xuan and Martin Monperrus. 2014. Learning to combine multiple ranking metrics for fault localization. In 2014 IEEE International Conference on Software Maintenance and Evolution. IEEE, 191–200.
  280. Towards semi-automatic bug triage and severity prediction based on topic model and multi-feature of bug reports. In 2014 IEEE 38th Annual Computer Software and Applications Conference. IEEE, 97–106.
  281. Fast and accurate anomaly detection in dynamic graphs with a two-pronged approach. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 647–657.
  282. Statistical fault localization using execution sequence. In 2012 International Conference on Machine Learning and Cybernetics, Vol. 3. IEEE, 899–905.
  283. Netwalk: A flexible deep embedding approach for anomaly detection in dynamic networks. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining. 2672–2681.
  284. Knowledge guided hierarchical multi-label classification over ticket data. IEEE Transactions on Network and Service Management 14, 2 (2017), 246–260.
  285. A deep neural network for unsupervised anomaly detection and diagnosis in multivariate time series data. In Proceedings of the AAAI conference on artificial intelligence, Vol. 33. 1409–1416.
  286. A survey on bug-report analysis. Sci. China Inf. Sci. 58, 2 (2015), 1–24.
  287. Automated IT system failure prediction: A deep learning approach. In 2016 IEEE International Conference on Big Data (Big Data). IEEE, 1291–1300.
  288. Syslog processing for switch failure diagnosis and prediction in datacenter networks. In 2017 IEEE/ACM 25th International Symposium on Quality of Service (IWQoS). IEEE, 1–10.
  289. Sai Zhang and Congle Zhang. 2014. Software bug localization with markov logic. In Companion Proceedings of the 36th International Conference on Software Engineering. 424–427.
  290. A literature review of research in bug resolution: Tasks, challenges and future directions. Comput. J. 59, 5 (2016), 741–773.
  291. Robust log-based anomaly detection on unstable log data. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 807–817.
  292. Understanding and handling alert storm for online service systems. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering: Software Engineering in Practice. 162–171.
  293. Real-time incident prediction for online service systems. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 315–326.
  294. Identifying bad software changes via multimodal anomaly detection for online service systems. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 527–539.
  295. Automatically and adaptively identifying severe alerts for online service systems. In IEEE INFOCOM 2020-IEEE Conference on Computer Communications. IEEE, 2420–2429.
  296. An empirical investigation of practical log anomaly detection for online service systems. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1404–1415.
  297. Predicting disk failures with HMM-and HSMM-based approaches. In Advances in Data Mining. Applications and Theoretical Aspects: 10th Industrial Conference, ICDM 2010, Berlin, Germany, July 12-14, 2010. Proceedings 10. Springer, 390–404.
  298. Long short-term memory network for remaining useful life estimation. In 2017 IEEE international conference on prognostics and health management (ICPHM). IEEE, 88–95.
  299. Jian Zhou and Hongyu Zhang. 2012. Learning to rank duplicate bug reports. In Proceedings of the 21st ACM international conference on Information and knowledge management. 852–861.
  300. Resolution recommendation for event tickets in service management. IEEE Transactions on Network and Service Management 13, 4 (2016), 954–967.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com