FlaKat: A Machine Learning-Based Categorization Framework for Flaky Tests (2403.01003v1)
Abstract: Flaky tests can pass or fail non-deterministically, without alterations to a software system. Such tests are frequently encountered by developers and hinder the credibility of test suites. State-of-the-art research incorporates machine learning solutions into flaky test detection and achieves reasonably good accuracy. Moreover, the majority of automated flaky test repair solutions are designed for specific types of flaky tests. This research work proposes a novel categorization framework, called FlaKat, which uses machine-learning classifiers for fast and accurate prediction of the category of a given flaky test that reflects its root cause. Sampling techniques are applied to address the imbalance between flaky test categories in the International Dataset of Flaky Test (IDoFT). A new evaluation metric, called Flakiness Detection Capacity (FDC), is proposed for measuring the accuracy of classifiers from the perspective of information theory and provides proof for its effectiveness. The final FDC results are also in agreement with F1 score regarding which classifier yields the best flakiness classification.
- M. Barboni, A. Bertolino, G. De Angelis, “What we talk about when we talk about software test flakiness”, Quality of Information and Communications Technology (QUATIC), vol 1439, pp. 29-39, 2021.
- B. Vancsics, T. Gergely, and A. Beszédes, “Simulating the effect of test flakiness on fault localization effectiveness”, International Workshop on Validation, Analysis and Evolution of Software Tests (VST), pp. 28–35, 2020.
- A. Shi, J. Bell, and D. Marinov, “Mitigating the effects of flaky tests on mutation testing”, International Symposium on Software Testing and Analysis (ISSTA), pp. 296–306, 2019.
- M. Eck, F. Palomba, M. Castelluccio, and A. Bacchelli, “Understanding flaky tests: The developer’s perspective”, Joint Meeting of the European Software Engineering Conference and the Symposium on the Foundations of Software Engineering (ESEC/FSE), pp. 830–840, 2019.
- W. Lam, S. Winter, A. Wei, T. Xie, D. Marinov, and J. Bell. , “A large-scale longitudinal study of flaky tests”, ACM on Programming Language, vol. 4, pp. 1-29, 2020.
- A. Alshammari, C. Morris, M. Hilton, and J. Bell, “FlakeFlagger: Predicting flakiness without rerunning tests”, International Conference on Software Engineering (ICSE), pp. 1572-1584, 2021.
- E. Kowalczyk, K. Nair, Z. Gao, L. Silberstein, T. Long, and A. Memon, “Modeling and ranking flaky tests at Apple”, International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), pp. 110–119, 2020.
- Z. Dong, A. Tiwari, X. L. Yu, and A. Roychoudhury, “Flaky test detection in Android via event order exploration”, ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), pp. 367–378, 2021.
- W. Lam, R. Oei, A. Shi, D. Marinov, and T. Xie, “IDFlakies: A framework for detecting and partially classifying flaky tests”, International Conference on Software Testing, Verification and Validation (ICST). pp. 312–32, 2019.
- K. Herzig and N. Nagappan, “Empirically detecting false test alarms using association rules”, International Conference on Software Engineering (ICSE), vol. 2, pp. 39–48, 2015.
- T. M. King, D. Santiago, J. Phillips and P. J. Clarke, “Towards a Bayesian Network Model for Predicting Flaky Automated Tests,” International Conference on Software Quality, Reliability and Security Companion (QRS-C), pp. 100-107, 2018.
- G. Pinto, B. Miranda, S. Dissanayake, M. D. Amorim, C. Treude, A. Bertolino, and M. D’amorim, “What is the vocabulary of flaky tests?”, International Conference on Mining Software Repositories (MSR), pp. 492–502, 2020.
- Q. Luo, F. Hariri, L. Eloussi, and D. Marinov, “An empirical analysis of flaky tests”, Symposium on the Foundations of Software Engineering (FSE), pp. 643–653, 2014.
- A. Shi, W. Lam, R. Oei, T. Xie, and D. Marinov, “iFixFlakies: A framework for automatically fixing order- dependent flaky tests”, Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), pp. 545–555, 2019.
- P. Zhang, Y. Jiang, A. Wei, V. Stodden, D. Marinov, and A. Shi, “Domain-specific fixes for flaky tests with wrong assumptions on underdetermined specifications”, International Conference on Software Engineering (ICSE), pp. 50–61, 2021.
- S. Fatima, H. Hemmati, and L. Briand, “FlakyFix: Using Large Language Models for Predicting Flaky Test Fix Categories and Test Code Repair”, digital preprint, arXiv:2307.00012, 2024.
- A. Bertolino, E. Cruciani, B. Miranda, and R. Verdecchia, “Know your neighbor: Fast static prediction of test flakiness”, IEEE Access, vol. 9, pp. 76119-76134, 2021.
- Q. V. Le, and T. Mikolov. “Distributed Representations of Sentences and Documents”, International Conference on Machine Learning (ICML), pp. 1188-1196, 2014.
- U. Alon, M. Zilberstein, O. Levy and E. Yahav, “code2vec: learning distributed representations of code”, ACM on Programming Languages, pp. 1-29, 2019.
- A. LeClair, Z. Eberhart and C. McMillan, “Adapting Neural Text Classification for Improved Software Categorization”, International Conference on Software Maintenance and Evolution (ICSME), pp. 461-472, 2018.
- G. Gu, P. Fogla, D. Dagon, W. Lee, and B. Skorić, “Measuring intrusion detection capability: an information-theoretic approach”, ACM Symposium on Information, Computer and Communications Security (ASIACCS), pp. 90–101, 2006.
- O. Parry, G. M. Kapfhammer, M. Hilton, P. McMinn, “A Survey of flaky tests”, ACM Transactions on Software Engineering Methodology, 31(1), pp. 1-74, 2022.
- A. Memon, Z. Gao, B. Nguyen, S. Dhanda, E. Nickell, R. Siemborski, and J. Micco, “Taming Google-Scale continuous testing”, International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), pp. 233–242, 2017.
- W. Lam, P. Godefroid, S. Nath, A. Santhiar, and S. Thummalapenta, “Root causing flaky tests in a large-scale industrial setting”, International Symposium on Software Testing and Analysis (ISSTA), pp. 204–215, 2019.
- W. Lam, K. Muşlu, H. Sajnani, and S. Thummalapenta, “A study on the lifecycle of flaky tests”, International Conference on Software Engineering (ICSE), pp. 1471–1482, 2020.
- M. T. Rahman and P. C. Rigby, “The impact of failing, flaky, and high failure tests on the number of crash reports associated with Firefox builds”, Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), pp 857–862, 2018.
- J. Bell, O. Legunsen, M. Hilton, L. Eloussi, T. Yung, and D. Marinov, “DeFlaker: Automatically detecting flaky tests”, International Conference on Software Engineering (ICSE), pp. 433–444, 2018.
- M. Gruber, and G. Fraser, “Debugging Flaky Tests using Spectrum-based Fault Localization”, International Conference on Automation of Software Test (AST), 2023.
- W. Lam, S. Winter, A. Astorga, V. Stodden, and D. Marinov, “Understanding reproducibility and characteristics of flaky tests through test reruns in Java projects”, International Conference on Software Reliability Engineering (ISSRE), pp. 403–413, 2020.
- W. Lam, “Detecting, characterizing, and taming flaky tests”, Ph.D dissertation, Dep. of Computer Science, Univ. of Illinois, Urbana-Champaign, 2021
- A. Gyori, B. Lambeth, A. Shi, O. Legunsen, and D. Marinov, “NonDex: A tool for detecting and debugging wrong assumptions on Java API Specification”, Symposium on the Foundations of Software Engineering (FSE), pp. 223–233, 2015.
- S. Rasheed, J. Dietrich, and A. Tahir, “On the Effect of Instrumentation on Test Flakiness”, International Conference on Automation of Software Test (AST), 2023.
- A. Shi, A. Gyori, O. Legunsen, and D. Marinov, “Detecting assumptions on deterministic implementations of non-deterministic specifications”, International Conference on Software Testing, Verification and Validation (ICST). pp. 80–90, 2016.
- D. Silva, L. Teixeira, and M. D’Amorim, “Shake It! Detecting flaky tests caused by concurrency with Shaker”, International Conference on Software Maintenance and Evolution (ICSME), pp. 301–311, 2020.
- G. Haben, S. Habchi, M. Papadakis, M. Cordy, and Y. Le Traon, “A replication study on the usability of code vocabulary in predicting flaky tests”, International Conference on Mining Software Repositories (MSR), pp. 219-229, 2021.
- A. Vahabzadeh, A. A. Fard, and A. Mesbah, “An empirical study of bugs in test code”, International Conference on Software Maintenance and Evolution (ICSME), pp. 101–110, 2015.
- S. Zhang, D. Jalali, J. Wuttke, K. Muşlu, W. Lam, M. D. Ernst, and D. Notkin, “Empirically revisiting the test independence assumption”, International Symposium on Software Testing and Analysis (ISSTA), pp. 385–396, 2014.
- S. Dutta, A. Shi, R. Choudhary, Z. Zhang, A. Jain, and S. Misailovic, “Detecting flaky tests in probabilistic and machine learning applications”, International Symposium on Software Testing and Analysis (ISSTA), pp. 211–224, 2020.
- M. Nejadgholi and J. Yang, “A study of oracle approximations in testing deep learning libraries”, International Conference on Automated Software Engineering (ASE), pp. 785–796, 2019.
- L. McInnes and J. Healy. “UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction”, Journal of Open Source Software, 2018.
- W. Amme, T. S. Heinze and A. Schäfer, “You Look so Different: Finding Structural Clones and Subclones in Java Source Code”, International Conference on Software Maintenance and Evolution (ICSME), pp. 70-80, 2021.
- L. Li, H. Feng, W. Zhuang, N. Meng and B. Ryder, “CCLearner: A Deep Learning-Based Clone Detection Approach”, International Conference on Software Maintenance and Evolution (ICSME), pp. 249-260, 2017.