A Generic Approach to Fix Test Flakiness in Real-World Projects (2404.09398v1)
Abstract: Test flakiness, a non-deterministic behavior of builds irrelevant to code changes, is a major and continuing impediment to delivering reliable software. The very few techniques for the automated repair of test flakiness are specifically crafted to repair either Order-Dependent (OD) or Implementation-Dependent (ID) flakiness. They are also all symbolic approaches, i.e., leverage program analysis to detect and repair known test flakiness patterns and root causes, failing to generalize. To bridge the gap, we propose FlakyDoctor, a neuro-symbolic technique that combines the power of LLMs-generalizability-and program analysis-soundness-to fix different types of test flakiness. Our extensive evaluation using 873 confirmed flaky tests (332 OD and 541 ID) from 243 real-world projects demonstrates the ability of FlakyDoctor in repairing flakiness, achieving 57% (OD) and 59% (ID) success rate. Comparing to three alternative flakiness repair approaches, FlakyDoctor can repair 8% more ID tests than DexFix, 12% more OD flaky tests than ODRepair, and 17% more OD flaky tests than iFixFlakies. Regardless of underlying LLM, the non-LLM components of FlakyDoctor contribute to 12-31% of the overall performance, i.e., while part of the FlakyDoctor power is from using LLMs, they are not good enough to repair flaky tests in real-world projects alone. What makes the proposed technique superior to related research on test flakiness mitigation specifically and program repair, in general, is repairing 79 previously unfixed flaky tests in real-world projects. We opened pull requests for all cases with corresponding patches; 19 of them were accepted and merged at the time of submission.
- 2023. Apache Maven Surefire. https://github.com/apache/maven-surefire.
- 2023. GPT-4 Technical Report. https://cdn.openai.com/papers/gpt-4.pdf.
- 2024. Defects4J Bug Dataset. https://github.com/jkoppel/QuixBugs/tree/master.
- 2024. International Dataset of Flaky tests. https://github.com/TestingResearchIllinois/idoft.
- 2024. QuixBugs Dataset. https://github.com/jkoppel/QuixBugs/tree/master.
- 2024. Repository Hadoop. https://github.com/apache/hadoop.
- 2024. Repository shardingsphere-elasticjob. https://github.com/apache/shardingsphere-elasticjob.
- DeFlaker: Automatically detecting flaky tests. In 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE). IEEE, 433–444.
- Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
- Test suite parallelization in open-source projects: A study on its usage and impact. In 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 838–848.
- Codet: Code generation with generated tests. arXiv preprint arXiv:2207.10397 (2022).
- Transforming test suites into croissants. In Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis. 1080–1092.
- Detecting flaky tests in probabilistic and machine learning applications. In Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis. 211–224.
- Flex: fixing flaky tests in machine learning projects by updating assertion bounds. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 603–614.
- Understanding flaky tests: The developer’s perspective. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 830–840.
- Towards a Bayesian network model for predicting flaky automated tests. In 2018 IEEE International Conference on Software Quality, Reliability and Security Companion (QRS-C). IEEE, 100–107.
- Modeling and ranking flaky tests at Apple. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering: Software Engineering in Practice. 110–119.
- Root causing flaky tests in a large-scale industrial setting. In Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis. 101–111.
- A study on the lifecycle of flaky tests. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering. 1471–1482.
- iDFlakies: A framework for detecting and partially classifying flaky tests. In 2019 12th ieee conference on software testing, validation and verification (icst). IEEE, 312–322.
- Understanding reproducibility and characteristics of flaky tests through test reruns in Java projects. In 2020 IEEE 31st International Symposium on Software Reliability Engineering (ISSRE). IEEE, 403–413.
- A large-scale longitudinal study of flaky tests. Proceedings of the ACM on Programming Languages 4, OOPSLA (2020), 1–29.
- Overfitting in semantics-based automated program repair. In Proceedings of the 40th International Conference on Software Engineering. 163–163.
- Repairing order-dependent flaky tests via test generation. In Proceedings of the 44th International Conference on Software Engineering. 1881–1892.
- CodeMind: A Framework to Challenge Large Language Models for Code Reasoning. arXiv preprint arXiv:2402.09664 (2024).
- Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. arXiv preprint arXiv:2305.01210 (2023).
- Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics 12 (2024), 157–173.
- Maximiliano A Mascheroni and Emanuel Irrazabal. 2018. Identifying key success factors in stopping flaky tests in automated REST service testing. Journal of Computer Science and Technology 18, 02 (2018), e16–e16.
- John Micco. 2017. The state of continuous integration testing@ google. (2017).
- LLM is Like a Box of Chocolates: the Non-determinism of ChatGPT in Code Generation. arXiv preprint arXiv:2308.02828 (2023).
- Understanding the Effectiveness of Large Language Models in Code Translation. arXiv preprint arXiv:2308.03109 (2023).
- TRaf: Time-based Repair for Asynchronous Wait Flaky Tests in Web Testing. arXiv preprint arXiv:2305.08592 (2023).
- Suzette Person and Sebastian Elbaum. 2015. Test analysis: Searching for faults in tests (N). In 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 149–154.
- What is the vocabulary of flaky tests?. In Proceedings of the 17th International Conference on Mining Software Repositories. 492–502.
- Test case prioritization: An empirical study. In Proceedings IEEE International Conference on Software Maintenance-1999 (ICSM’99).’Software Maintenance for Business Change’(Cat. No. 99CB36360). IEEE, 179–188.
- iFixFlakies: A framework for automatically fixing order-dependent flaky tests. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 545–555.
- Repository-level prompt generation for large language models of code. In International Conference on Machine Learning. PMLR, 31693–31715.
- Know you neighbor: Fast static prediction of test flakiness. IEEE Access 9 (2021), 76119–76134.
- iPFlakies: A framework for detecting and fixing python order-dependent flaky tests. In Proceedings of the ACM/IEEE 44th International Conference on Software Engineering: Companion Proceedings. 120–124.
- LeTI: Learning to Generate from Textual Interactions. arXiv preprint arXiv:2305.10314 (2023).
- Aligning large language models with human: A survey. arXiv preprint arXiv:2307.12966 (2023).
- Preempting flaky tests via Non-Idempotent-Outcome tests. In International Conference on Software Engineering (ICSE’22). 1730–1742.
- Probabilistic and systematic coverage of consecutive test-method pairs for detecting order-dependent flaky tests. In International Conference on Tools and Algorithms for the Construction and Analysis of Systems. Springer, 270–287.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems 35 (2022), 24824–24837.
- Magicoder: Source Code Is All You Need. arXiv:2312.02120 [cs.CL]
- A prompt pattern catalog to enhance prompt engineering with chatgpt. arXiv preprint arXiv:2302.11382 (2023).
- Chunqiu Steven Xia and Lingming Zhang. 2023a. Conversational automated program repair. arXiv preprint arXiv:2301.13246 (2023).
- Chunqiu Steven Xia and Lingming Zhang. 2023b. Keep the Conversation Going: Fixing 162 out of 337 bugs for $0.42 each using ChatGPT. arXiv preprint arXiv:2304.00385 (2023).
- He Ye and Martin Monperrus. 2024. ITER: Iterative Neural Repair for Multi-Location Patches. In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering (, Lisbon, Portugal,) (ICSE ’24). Association for Computing Machinery, New York, NY, USA, Article 10, 13 pages. https://doi.org/10.1145/3597503.3623337
- Finding polluter tests using Java PathFinder. ACM SIGSOFT Software Engineering Notes 46, 3 (2021), 37–41.
- Domain-specific fixes for flaky tests with wrong assumptions on underdetermined specifications. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 50–61.
- Empirically revisiting the test independence assumption. In Proceedings of the 2014 International Symposium on Software Testing and Analysis. 385–396.
- Celal Ziftci and Diego Cavalcanti. 2020. De-flake your tests: Automatically locating root causes of flaky tests in code at google. In 2020 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 736–745.