Do Automatic Test Generation Tools Generate Flaky Tests? (2310.05223v1)
Abstract: Non-deterministic test behavior, or flakiness, is common and dreaded among developers. Researchers have studied the issue and proposed approaches to mitigate it. However, the vast majority of previous work has only considered developer-written tests. The prevalence and nature of flaky tests produced by test generation tools remain largely unknown. We ask whether such tools also produce flaky tests and how these differ from developer-written ones. Furthermore, we evaluate mechanisms that suppress flaky test generation. We sample 6 356 projects written in Java or Python. For each project, we generate tests using EvoSuite (Java) and Pynguin (Python), and execute each test 200 times, looking for inconsistent outcomes. Our results show that flakiness is at least as common in generated tests as in developer-written tests. Nevertheless, existing flakiness suppression mechanisms implemented in EvoSuite are effective in alleviating this issue (71.7 % fewer flaky tests). Compared to developer-written flaky tests, the causes of generated flaky tests are distributed differently. Their non-deterministic behavior is more frequently caused by randomness, rather than by networking and concurrency. Using flakiness suppression, the remaining flaky tests differ significantly from any flakiness previously reported, where most are attributable to runtime optimizations and EvoSuite-internal resource thresholds. These insights, with the accompanying dataset, can help maintainers to improve test generation tools, give recommendations for developers using these tools, and serve as a foundation for future research in test flakiness or test generation.
- [n.d.]. Class Calendar. https://docs.oracle.com/javase/8/docs/api/java/util/Calendar.html
- [n.d.]. Class Random. https://docs.oracle.com/javase/8/docs/api/java/util/Random.html
- [n.d.]. JUnit 4. https://junit.org/junit4/
- [n.d.]. Maven Central Repository. https://repo.maven.apache.org/maven2/
- [n.d.]. Maven Surefire plugin. https://maven.apache.org/surefire/maven-surefire-plugin/
- [n.d.]. OSS-Fuzz: How do you handle timeouts and OOMs? https://google.github.io/oss-fuzz/faq/#how-do-you-handle-timeouts-and-ooms
- [n.d.]. Pynguin documentation: Generating Assertions. https://pynguin.readthedocs.io/en/latest/user/assertions.html#simple
- [n.d.]. pytest. https://docs.pytest.org/en/7.2.x/
- [n.d.]. pytest-random-order: a pytest plugin that randomises the order of tests. https://pypi.org/project/pytest-random-order/
- [n.d.]. Python Package Index (PyPI). https://pypi.org/
- 2023. Do Automatic Test Generation Tools Generate Flaky Tests? [Dataset]. https://doi.org/10.6084/m9.figshare.22344706
- Identifying Randomness related Flaky Tests through Divergence and Execution Tracing. In International Conference on Software Testing, Verification and Validation Workshops (ICST-Workshops). 293–300.
- FlakeFlagger: Predicting Flakiness Without Rerunning Tests. In International Conference on Software Engineering (ICSE). 1572–1584.
- Automated Unit Test Generation for Classes with Environment Dependencies. In International Conference on Automated Software Engineering (ASE). 79–89.
- Continuous test generation: Enhancing continuous integration with automated test generation. In International Conference on Automated Software Engineering (ASE). 55–66.
- Albert Danial. 2021. cloc: v1.92. https://doi.org/10.5281/zenodo.5760077
- Flaky Test Sanitisation via On-the-Fly Assumption Inference for Tests with Network Dependencies. In IEEE Working Conference on Source Code Analysis and Manipulation (SCAM). 264–275.
- Zhen Yu Ding and Claire Le Goues. 2021. An Empirical Study of OSS-Fuzz Bugs. In International Conference on Mining Software Repositories (MSR). 131–142.
- Empirical Study of Restarted and Flaky Builds on Travis CI. In International Conference on Mining Software Repositories (MSR). 254–264.
- Detecting Flaky Tests in Probabilistic and Machine Learning Applications. In International Symposium on Software Testing and Analysis (ISSTA). 211–224.
- Understanding Flaky Tests: The Developer’s Perspective. In Joint Meeting of the European Software Engineering Conference and the Symposium on the Foundations of Software Engineering (ESEC/FSE). 830–840.
- Zhiyu Fan. 2019. A systematic evaluation of problematic tests generated by EvoSuite. In International Conference on Software Engineering: Companion Proceedings (ICSE Companion). 165–167.
- Joseph L Fleiss. 1971. Measuring nominal scale agreement among many raters. Psychological bulletin (1971), 378.
- Gordon Fraser. 2018. A tutorial on using and extending the EvoSuite search-based test generator. In International Symposium on Search Based Software Engineering (SSBSE). 106–130.
- Gordon Fraser and Andrea Arcuri. 2011. EvoSuite: automatic test suite generation for object-oriented software. In ACM SIGSOFT Software Engineering Notes. 416–419.
- Gordon Fraser and Andrea Arcuri. 2013. EvoSuite: On the challenges of test case generation in the real world. In International Conference on Software Testing, Verification and Validation (ICST). 362–369.
- Gordon Fraser and Andrea Arcuri. 2014. A large-scale evaluation of automated unit test generation using EvoSuite. ACM Transactions on Software Engineering and Methodology (2014), 1–42.
- Does automated unit test generation really help software testers? a controlled empirical study. ACM Transactions on Software Engineering and Methodology (2015), 1–49.
- Martin Gruber and Gordon Fraser. 2022. A Survey on How Test Flakiness Affects Developers and What Support They Need To Address It. In International Conference on Software Testing, Verification and Validation (ICST). 82–92.
- Martin Gruber and Gordon Fraser. 2023a. Debugging Flaky Tests using Spectrum-based Fault Localization. In International Conference on Automation of Software Test (AST@ICSE). 128–139.
- Martin Gruber and Gordon Fraser. 2023b. FlaPy: Mining Flaky Python Tests at Scale. In International Conference on Software Engineering: Companion Proceedings (ICSE Companion). 127–131.
- An Empirical Study of Flaky Tests in Python. In International Conference on Software Testing, Verification and Validation (ICST). 148–158.
- Yue Jia and Mark Harman. 2011. An Analysis and Survey of the Development of Mutation Testing. IEEE Transactions on Software Engineering (2011), 649–678.
- Design of the Java HotSpot™ client compiler for Java 6. ACM Transactions on Architecture and Code Optimization (TACO) (2008), 1–32.
- Wing Lam. 2020. International Dataset of Flaky Tests (IDoFT). http://mir.cs.illinois.edu/flakytests
- Root Causing Flaky Tests in a Large-Scale Industrial Setting. In International Symposium on Software Testing and Analysis (ISSTA). 204–215.
- A Study on the Lifecycle of Flaky Tests. In International Conference on Software Engineering (ICSE). 1471–1482.
- iDFlakies: A Framework for Detecting and Partially Classifying Flaky Tests. In International Conference on Software Testing, Verification and Validation (ICST). 312–322.
- Dependent-Test-Aware Regression Testing Techniques. In International Symposium on Software Testing and Analysis (ISSTA). 298–311.
- Repairing Order-Dependent Flaky Tests via Test Generation. In International Conference on Software Engineering (ICSE). 1881–1892.
- Stephan Lukasczyk and Gordon Fraser. 2022. Pynguin: Automated Unit Test Generation for Python. In International Conference on Software Engineering: Companion Proceedings (ICSE Companion). 168–172.
- Automated Unit Test Generation for Python. In International Symposium on Search Based Software Engineering (SSBSE). 9–24.
- An empirical study of automated unit test generation for Python. Empirical Software Engineering (2023), 36.
- An Empirical Analysis of Flaky Tests. In International Symposium on Foundations of Software Engineering (FSE). 643–653.
- Predictive Test Selection. In International Conference on Software Engineering (ICSE). 91–100.
- Phil McMinn. 2004. Search-Based Software Test Data Generation: A Survey. Journal of Software Testing, Verification and Reliability (2004), 105–156.
- Taming Google-Scale Continuous Testing. In International Conference on Software Engineering: Software Engineering in Practice Track (ICSE-SEIP). 233–242.
- John Micco. 2016. Flaky Tests at Google and How We Mitigate Them. https://testing.googleblog.com/2016/05/flaky-tests-at-google-and-how-we.html
- The Human Side of Fuzzing: Challenges Faced by Developers During Fuzzing Activities. ACM Transactions on Software Engineering and Methodology (2023).
- The Java HotSpot™ Server Compiler. In Java (TM) Virtual Machine Research and Technology Symposium (JVM 01). 1–12.
- A Survey of Flaky Tests. IEEE Transactions on Software Engineering (2022), 17:1–17:74.
- Surveying the Developer Experience of Flaky Tests. In International Conference on Software Engineering: Software Engineering in Practice Track (ICSE-SEIP). 253–262.
- Samad Paydar and Aidin Azamnouri. 2019. An Experimental Study on Flakiness and Fragility of Randoop Regression Test Suites. Lecture Notes in Computer Science (2019), 111–126.
- Empirically Revisiting and Enhancing IR-Based Test-Case Prioritization. In International Symposium on Software Testing and Analysis (ISSTA). 324–336.
- DSM-5 field trials in the United States and Canada, Part II: test-retest reliability of selected categorical diagnoses. American journal of psychiatry (2013), 59–70.
- Seeding strategies in search-based unit test generation. Journal of Software Testing, Verification and Reliability (2016), 366–401.
- Wing Lam Ruixin Wang, Yang Chen. 2022. iPFlakies: A Framework for Detecting and Fixing Python Order-Dependent Flaky Tests. In International Conference on Software Engineering: Companion Proceedings (ICSE Companion). 120–124.
- EvoSuite at the SBST 2022 Tool Competition. In International Workshop on Search-Based Software Testing (SBST@ICSE). 33–34.
- Kostya Serebryany. 2017. OSS-Fuzz - Google’s continuous fuzzing service for open source software. USENIX Security Symposium (2017).
- Do Automatically Generated Unit Tests Find Real Faults? An Empirical Study of Effectiveness and Challenges. In International Conference on Automated Software Engineering (ASE). 201–211.
- How do automatically generated unit tests influence software maintenance?. In International Conference on Software Testing, Verification and Validation (ICST). 250–261.
- Samuel Sanford Shapiro and Martin B Wilk. 1965. An analysis of variance test for normality (complete samples). Biometrika (1965), 591–611.
- iFixFlakies: A Framework for Automatically Fixing Order-Dependent Flaky Tests. In Joint Meeting of the European Software Engineering Conference and the Symposium on the Foundations of Software Engineering (ESEC/FSE). 545–555.
- An Empirical Study of Bugs in Test Code. In International Conference on Software Maintenance and Evolution (ICSME). 101–110.
- EvoSuite at the SBST 2021 Tool Competition. In International Workshop on Search-Based Software Testing (SBST@ICSE). 28–29.
- Preempting Flaky Tests via Non-Idempotent-Outcome Tests. In International Conference on Software Engineering (ICSE). 1730–1742.
- Frank Wilcoxon. 1945. Individual Comparisons by Ranking Methods. Biometrics Bulletin (1945), 80–83.
- TERMINATOR: Better Automated UI Test Case Prioritization. In Joint Meeting of the European Software Engineering Conference and the Symposium on the Foundations of Software Engineering (ESEC/FSE). 883–894.