LTM: Scalable and Black-box Similarity-based Test Suite Minimization based on Language Models (2304.01397v5)
Abstract: Test suites tend to grow when software evolves, making it often infeasible to execute all test cases with the allocated testing budgets, especially for large software systems. Test suite minimization (TSM) is employed to improve the efficiency of software testing by removing redundant test cases, thus reducing testing time and resources, while maintaining the fault detection capability of the test suite. Most existing TSM approaches rely on code coverage (white-box) or model-based features, which are not always available to test engineers. Recent TSM approaches that rely only on test code (black-box) have been proposed, such as ATM and FAST-R. To address the scalability, we propose LTM (LLM-based Test suite Minimization), a novel, scalable, and black-box similarity-based TSM approach based on LLMs, which is the first application of LLMs in the context of TSM. To support similarity measurement for test code embeddings, we investigate five pre-trained LLMs: CodeBERT, GraphCodeBERT, UniXcoder, StarEncoder, and CodeLlama, on which we compute two similarity measures: Cosine Similarity and Euclidean Distance. Our goal is to find similarity measures that are not only computationally more efficient but can also better guide a Genetic Algorithm (GA) to search for optimal minimized test suites, thus reducing the overall search time. Experimental results show that the best configuration of LTM (UniXcoder/Cosine) outperforms ATM in three aspects: (a) achieving a slightly greater saving rate of testing time (41.72% versus 41.02%, on average); (b) attaining a significantly higher fault detection rate (0.84 versus 0.81, on average); and, most importantly, (c) minimizing test suites nearly five times faster on average, with higher gains for larger test suites and systems, thus achieving much higher scalability.
- A systematic review on test suite reduction: Approaches, experiment’s quality evaluation, and guidelines. IEEE Access, 6:11816–11841, 2018.
- Achieving scalable model-based testing through test case diversity. ACM Transactions on Software Engineering and Methodology (TOSEM), 22(1):1–42, 2013.
- Regression testing minimization, selection and prioritization: A survey. Software testing, verification and reliability, 22(2):67–120, 2012.
- Test case selection and prioritization using machine learning: a systematic literature review. Empirical Software Engineering, 27(2):29, 2022.
- Pareto efficient multi-objective black-box test case selection for simulation-based testing. Information and Software Technology, 114:137–154, 2019.
- Scalable approaches for test suite reduction. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), pages 419–429. IEEE, 2019.
- Kim Herzig. Testing and continuous integration at scale: Limits, costs, and expectations. In Proceedings of the 11th International Workshop on Search-Based Software Testing, pages 38–38, 2018.
- Atm: Black-box test case minimization based on test code similarity and evolutionary search. In Proceedings of the 45th IEEE/ACM International Conference on Software Engineering, pages 1–12, 2023.
- An evaluation of test suite minimization techniques. In International Conference on Software Quality, pages 51–66. Springer, 2020.
- CodeBERT: A pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155, 2020.
- Graphcodebert: Pre-training code representations with data flow. arXiv preprint arXiv:2009.08366, 2020.
- Unixcoder: Unified cross-modal pre-training for code representation. arXiv preprint arXiv:2203.03850, 2022.
- Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161, 2023.
- Code Llama: Open Foundation Models for Code. arXiv preprint arXiv:2308.12950, 2023.
- Reducing the cost of model-based testing through test case diversity. In Testing Software and Systems: 22nd IFIP WG 6.1 International Conference, ICTSS 2010, Natal, Brazil, November 8-10, 2010. Proceedings 22, pages 63–78. Springer, 2010.
- Deepgd: A multi-objective black-box test selection approach for deep neural networks. ACM Transactions on Software Engineering and Methodology, 2023.
- Scope-aided test prioritization, selection and minimization for software reuse. Journal of Systems and Software, 131:528–549, 2017.
- Clustering support for inadequate test suite reduction. In 2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER), pages 95–105. IEEE, 2018.
- User-session-based test cases optimization method based on agglutinate hierarchy clustering. In 2011 International Conference on Internet of Things and 4th International Conference on Cyber, Physical and Social Computing, pages 413–418. IEEE, 2011.
- Uncertainty-wise test case generation and minimization for cyber-physical systems. Journal of Systems and Software, 153:1–21, 2019.
- Identifying similar test cases that are specified in natural language. IEEE Transactions on Software Engineering, 2022.
- Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, 26, 2013.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084, 2019.
- Universal sentence encoder. arXiv preprint arXiv:1803.11175, 2018.
- Thorsten Joachims et al. A probabilistic analysis of the rocchio algorithm with tfidf for text categorization. In ICML, volume 97, pages 143–151. Citeseer, 1997.
- From word embeddings to document distances. In International conference on machine learning, pages 957–966. PMLR, 2015.
- Fastlane: Test minimization for rapidly deployed large-scale online services. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), pages 408–418. IEEE, 2019.
- A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE transactions on evolutionary computation, 6(2):182–197, 2002.
- Putting them under microscope: a fine-grained approach for detecting redundant test cases in natural language. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 1161–1172, 2022.
- From frequency to meaning: Vector space models of semantics. Journal of artificial intelligence research, 37:141–188, 2010.
- William B Johnson. Extensions of lipschitz mappings into a hilbert space. Contemp. Math., 26:189–206, 1984.
- MK Vijaymeena and K Kavitha. A survey on similarity measures in text mining. Machine Learning and Applications: An International Journal, 3(2):19–28, 2016.
- Codexglue: A machine learning benchmark dataset for code understanding and generation. arXiv preprint arXiv:2102.04664, 2021.
- Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909, 2015.
- Codesearchnet challenge: Evaluating the state of semantic code search. arXiv preprint arXiv:1909.09436, 2019.
- The stack: 3 tb of permissively licensed source code. arXiv preprint arXiv:2211.15533, 2022.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Electra: Pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555, 2020.
- Simcse: Simple contrastive learning of sentence embeddings. arXiv preprint arXiv:2104.08821, 2021.
- Universal text representation from bert: An empirical study. arXiv preprint arXiv:1910.07973, 2019.
- A survey of text similarity approaches. international journal of Computer Applications, 68(13):13–18, 2013.
- Encyclopedia of distances. In Encyclopedia of distances, pages 1–583. Springer, 2009.
- J. Blank and K. Deb. Pymoo: Multi-objective optimization in Python. IEEE Access, 8:89497–89509, 2020.
- An exact test for population differentiation. Evolution, pages 1280–1283, 1995.
- Effectiveness of the euclidean distance in high dimensional spaces. Optik, 126(24):5614–5619, 2015.
- Large language models for software engineering: Survey and open problems. arXiv preprint arXiv:2310.03533, 2023.
- Cosine similarity based directional comparison scheme for subcycle transmission line protection. IEEE Transactions on Power Delivery, 35(5):2159–2167, 2019.
- Flakify: A black-box, language model-based predictor for flaky tests. IEEE Transactions on Software Engineering, 2022.