Papers
Topics
Authors
Recent
Search
2000 character limit reached

Cross-System Categorization of Abnormal Traces in Microservice-Based Systems via Meta-Learning

Published 27 Mar 2024 in cs.SE, cs.AI, and cs.LG | (2403.18998v4)

Abstract: Microservice-based systems (MSS) may fail with various fault types. While existing AIOps methods excel at detecting abnormal traces and locating the responsible service(s), human efforts are still required for diagnosing specific fault types and failure causes.This paper presents TraFaultDia, a novel AIOps framework to automatically classify abnormal traces into fault categories for MSS. We treat the classification process as a series of multi-class classification tasks, where each task represents an attempt to classify abnormal traces into specific fault categories for a MSS. TraFaultDia leverages meta-learning to train on several abnormal trace classification tasks with a few labeled instances from a MSS, enabling quick adaptation to new, unseen abnormal trace classification tasks with a few labeled instances across MSS. TraFaultDia's use cases are scalable depending on how fault categories are built from anomalies within MSS. We evaluated TraFaultDia on two MSS, TrainTicket and OnlineBoutique, with open datasets where each fault category is linked to faulty system components (service/pod) and a root cause. TraFaultDia automatically classifies abnormal traces into these fault categories, thus enabling the automatic identification of faulty system components and root causes without manual analysis. TraFaultDia achieves 93.26% and 85.20% accuracy on 50 new classification tasks for TrainTicket and OnlineBoutique, respectively, when trained within the same MSS with 10 labeled instances per category. In the cross-system context, when TraFaultDia is applied to a MSS different from the one it is trained on, TraFaultDia gets an average accuracy of 92.19% and 84.77% for the same set of 50 new, unseen abnormal trace classification tasks of the respective systems, also with 10 labeled instances provided for each fault category per task in each system.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. X. Zhou, X. Peng, T. Xie, J. Sun, C. Ji, W. Li, and D. Ding, “Fault analysis and debugging of microservice systems: Industrial survey, benchmark system, and empirical study,” IEEE Transactions on Software Engineering, vol. 47, no. 2, pp. 243–260, 2018.
  2. B. Li, X. Peng, Q. Xiang, H. Wang, T. Xie, J. Sun, and X. Liu, “Enjoy your observability: an industrial survey of microservice tracing and analysis,” Empirical Software Engineering, vol. 27, pp. 1–28, 2022.
  3. OpenTelemetry, “OpenTelemetry,” 2024, accessed: 2024-03-14. [Online]. Available: https://opentelemetry.io/
  4. J. Soldani and A. Brogi, “Anomaly detection and failure root cause analysis in (micro) service-based cloud applications: A survey,” ACM Computing Surveys (CSUR), vol. 55, no. 3, pp. 1–39, 2022.
  5. Á. Brandón, M. Solé, A. Huélamo, D. Solans, M. S. Pérez, and V. Muntés-Mulero, “Graph-based root cause analysis for service-oriented and microservice architectures,” Journal of Systems and Software, vol. 159, p. 110432, 2020.
  6. S. Nedelkoski, J. Cardoso, and O. Kao, “Anomaly detection from system tracing data using multimodal deep learning,” in 2019 IEEE 12th International Conference on Cloud Computing (CLOUD).   IEEE, 2019, pp. 179–186.
  7. J. Kaldor, J. Mace, M. Bejda, E. Gao, W. Kuropatwa, J. O’Neill, K. W. Ong, B. Schaller, P. Shan, B. Viscomi et al., “Canopy: An end-to-end peng xinrformance tracing and analysis system,” in Proceedings of the 26th symposium on operating systems principles, 2017, pp. 34–50.
  8. H. Chen, K. Wei, A. Li, T. Wang, and W. Zhang, “Trace-based intelligent fault diagnosis for microservices with deep learning,” in 2021 IEEE 45th Annual Computers, Software, and Applications Conference (COMPSAC).   IEEE, 2021, pp. 884–893.
  9. C. Zhang, X. Peng, C. Sha, K. Zhang, Z. Fu, X. Wu, Q. Lin, and D. Zhang, “Deeptralog: Trace-log combined microservice anomaly detection through graph-based deep learning,” in Proceedings of the 44th International Conference on Software Engineering, 2022, pp. 623–634.
  10. J. Chen, F. Liu, J. Jiang, G. Zhong, D. Xu, Z. Tan, and S. Shi, “Tracegra: A trace-based anomaly detection for microservice using graph deep learning,” Computer Communications, vol. 204, pp. 109–117, 2023.
  11. K. Zhang, C. Zhang, X. Peng xinng, and C. Sha, “Putracead: Trace anomaly detection with partial labels based on gnn and pu learning,” in 2022 IEEE 33rd International Symposium on Software Reliability Engineering (ISSRE).   IEEE, 2022, pp. 239–250.
  12. R. Chen, J. Ren, L. Wang, Y. Pu, K. Yang, and W. Wu, “Microegrcl: An edge-attention-based graph neural network approach for root cause localization in microservice systems,” in International Conference on Service-Oriented Computing.   Springer, 2022, pp. 264–272.
  13. C. Padurariu and M. E. Breaban, “Dealing with data imbalance in text classification,” Procedia Computer Science, vol. 159, pp. 736–745, 2019.
  14. Y. Wang, Q. Yao, J. T. Kwok, and L. M. Ni, “Generalizing from a few examples: A survey on few-shot learning,” ACM computing surveys (csur), vol. 53, no. 3, pp. 1–34, 2020.
  15. P. Liu, H. Xu, Q. Ouyang, R. Jiao, Z. Chen, S. Zhang, J. Yang, L. Mo, J. Zeng, W. Xue et al., “Unsupervised detection of microservice trace anomalies through service-level deep bayesian networks,” in 2020 IEEE 31st International Symposium on Software Reliability Engineering (ISSRE).   IEEE, 2020, pp. 48–58.
  16. Z. Xie, H. Xu, W. Chen, W. Li, H. Jiang, L. Su, H. Wang, and D. Pei, “Unsupervised anomaly detection on microservice traces through graph vae,” in Proceedings of the ACM Web Conference 2023, 2023, pp. 2874–2884.
  17. G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” science, vol. 313, no. 5786, pp. 504–507, 2006.
  18. C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for fast adaptation of deep networks,” in International conference on machine learning.   PMLR, 2017, pp. 1126–1135.
  19. R. Chen, S. Zhang, D. Li, Y. Zhang, F. Guo, W. Meng, D. Pei, Y. Zhang, X. Chen, and Y. Liu, “Logtransfer: Cross-system log anomaly detection for software systems with transfer learning,” in 2020 IEEE 31st International Symposium on Software Reliability Engineering (ISSRE).   IEEE, 2020, pp. 37–47.
  20. C. Zhang, T. Jia, G. x. Shen, P. Zhu, and Y. Li, “Metalog: Generalizable cross-system anomaly detection from logs with meta-learning,” in 2024 IEEE/ACM 46th International Conference on Software Engineering (ICSE).   IEEE Computer Society, 2024, pp. 938–938.
  21. X. Han and S. Yuan, “Unsupervised cross-system log anomaly detection via domain adaptation,” in Proceedings of the 30th ACM international conference on information & knowledge management, 2021, pp. 3068–3072.
  22. N. Holla, P. Mishra, H. Yannakoudakis, and E. Shutova, “Learning to learn to disambiguate: Meta-learning for few-shot word sense disambiguation,” arXiv preprint arXiv:2004.14355, 2020.
  23. S. Nedelkoski, J. Cardoso, and O. Kao, “Anomaly detection and classification using distributed tracing and deep learning,” in 2019 19th IEEE/ACM international symposium on cluster, cloud and grid computing (CCGRID).   IEEE, 2019, pp. 241–250.
  24. Y. Fu, M. Yan, J. Xu, J. Li, Z. Liu, X. Zhang, and D. Yang, “Investigating and improving log parsing in practice,” in Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2022, pp. 1566–1577.
  25. N. Dragoni, S. Giallorenzo, A. L. Lafuente, M. Mazzara, F. Montesi, R. Mustafin, and L. Safina, “Microservices: yesterday, today, and tomorrow,” Present and ulterior software engineering, pp. 195–216, 2017.
  26. V.-H. Le and H. Zhang, “Log-based anomaly detection without log parsing,” in 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE).   IEEE, 2021, pp. 492–504.
  27. Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey et al., “Google’s neural machine translation system: Bridging the gap between human and machine translation,” arXiv preprint arXiv:1609.08144, 2016.
  28. Google Research, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” https://github.com/google-research/bert, 2018, accessed: 2024-03-14.
  29. S. Hashemi and M. Mäntylä, “Onelog: Towards end-to-end training in software log anomaly detection,” arXiv preprint arXiv:2104.07324, 2021.
  30. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in neural information processing systems, 2017, pp. 5998–6008.
  31. FudanSELab, “Deeptralog,” https://github.com/FudanSELab/DeepTraLog, 2024.
  32. IntelligentDDS, “Nezha,” https://github.com/IntelligentDDS/Nezha, 2024.
  33. G. Yu, P. Chen, Y. Li, H. Chen, X. Li, and Z. Zheng, “Nezha: Interpretable fine-grained root causes analysis for microservices on multi-modal observability data,” in Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2023, pp. 553–565.
  34. H.-J. Ye and W.-L. Chao, “How to train your maml to excel in few-shot classification,” in International Conference on Learning Representations, 2021.
  35. J. Snell, K. Swersky, and R. Zemel, “Prototypical networks for few-shot learning,” Advances in neural information processing systems, vol. 30, 2017.
  36. O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra et al., “Matching networks for one shot learning,” 2016.
  37. D. Jung, D. Kang, S. Kwak, and M. Cho, “Few-shot metric learning: Online adaptation of embedding for retrieval,” in Proceedings of the Asian Conference on Computer Vision, 2022, pp. 1875–1891.
  38. M. A. Jamal and G.-J. Qi, “Task agnostic meta-learning for few-shot learning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 11 719–11 727.
  39. Z. Yu, L. Chen, Z. Cheng, and J. Luo, “Transmatch: A transfer-learning scheme for semi-supervised few-shot learning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 12 856–12 864.
Citations (2)

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 0 likes about this paper.