2000 character limit reached
Tests4Py: A Benchmark for System Testing (2307.05147v2)
Published 11 Jul 2023 in cs.SE
Abstract: Benchmarks are among the main drivers of progress in software engineering research. However, many current benchmarks are limited by inadequate system oracles and sparse unit tests. Our Tests4Py benchmark, derived from the BugsInPy benchmark, addresses these limitations. It includes 73 bugs from seven real-world Python applications and six bugs from example programs. Each subject in Tests4Py is equipped with an oracle for verifying functional correctness and supports both system and unit test generation. This allows for comprehensive qualitative studies and extensive evaluations, making Tests4Py a cutting-edge benchmark for research in test generation, debugging, and automatic program repair.
- Remita Amine. 2021. youtube-dl. https://github.com/ytdl-org/youtube-dl https://github.com/ytdl-org/youtube-dl.
- Where is the Bug and How is It Fixed? An Experiment with Practitioners. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering (Paderborn, Germany) (ESEC/FSE 2017). 117–128. https://doi.org/10.1145/3106237.3106255
- Evolutionary Grammar-Based Fuzzing. In Search-Based Software Engineering - 12th International Symposium, SSBSE 2020, Bari, Italy, October 7-8, 2020, Proceedings (Lecture Notes in Computer Science, Vol. 12420), Aldeida Aleti and Annibale Panichella (Eds.). Springer, 105–120. https://doi.org/10.1007/978-3-030-59762-7_8
- Semantic Debugging. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2023, San Francisco, CA, USA, December 3-9, 2023, Satish Chandra, Kelly Blincoe, and Paolo Tonella (Eds.). ACM, 438–449. https://doi.org/10.1145/3611643.3616296
- Audrey Roy Greenfeld. 2022. Cookiecutter. https://www.cookiecutter.io/ https://www.cookiecutter.io/.
- BugsJS: a Benchmark of JavaScript Bugs. In 2019 12th IEEE Conference on Software Testing, Validation and Verification (ICST). 90–101. https://doi.org/10.1109/ICST.2019.00019
- Re-Factoring Based Program Repair Applied to Programming Assignments. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). 388–398. https://doi.org/10.1109/ASE.2019.00044
- Vladimir Iakovlev. 2022. The Fuck. https://github.com/nvbn/thefuck https://github.com/nvbn/thefuck.
- Defects4J: A Database of Existing Faults to Enable Controlled Testing Studies for Java Programs. In Proceedings of the 2014 International Symposium on Software Testing and Analysis (San Jose, CA, USA) (ISSTA 2014). 437–440. https://doi.org/10.1145/2610384.2628055
- When Does My Program Do This? Learning Circumstances of Software Behavior. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (Virtual Event, USA) (ESEC/FSE 2020). 1228–1239. https://doi.org/10.1145/3368089.3409687
- BEARS: An Extensible Java Bug Benchmark for Automatic Program Repair Studies. In 2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER). 468–478. https://doi.org/10.1109/SANER.2019.8667991
- FuzzBench: An Open Fuzzer Benchmarking Platform and Service. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (Athens, Greece) (ESEC/FSE 2021). 1393–1403. https://doi.org/10.1145/3468264.3473932
- Sanic Community Organization. 2024. Sanic. https://sanic.dev https://sanic.dev.
- PySnooper: Never use print for debugging again. https://doi.org/10.5281/zenodo.10462459 https://github.com/cool-RR/PySnooper.
- Sebastián Ramírez. 2018. FastAPI. https://fastapi.tiangolo.com/ https://fastapi.tiangolo.com/.
- Jakub Roztocil. 2022. HTTPie. https://httpie.io/ https://httpie.io/.
- Bugs.Jar: A Large-Scale, Diverse Dataset of Real-World Java Bugs. In Proceedings of the 15th International Conference on Mining Software Repositories (Gothenburg, Sweden) (MSR ’18). 10–13. https://doi.org/10.1145/3196398.3196473
- Marius Smytzek and Andreas Zeller. 2022. SFLKit: a workbench for statistical fault localization. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (¡conf-loc¿, ¡city¿Singapore¡/city¿, ¡country¿Singapore¡/country¿, ¡/conf-loc¿) (ESEC/FSE 2022). Association for Computing Machinery, New York, NY, USA, 1701–1705. https://doi.org/10.1145/3540250.3558915
- Inputs From Hell. IEEE Trans. Software Eng. 48, 4 (2022), 1138–1153. https://doi.org/10.1109/TSE.2020.3013716
- Codeflaws: a programming competition benchmark for evaluating automated program repair tools. In 2017 IEEE/ACM 39th International Conference on Software Engineering Companion (ICSE-C). 180–182. https://doi.org/10.1109/ICSE-C.2017.76
- BugSwarm: mining and continuously growing a dataset of reproducible failures and fixes. In ICSE. IEEE / ACM, 339–349.
- BugsInPy: a database of existing bugs in Python programs to enable controlled testing and debugging studies. In ESEC/FSE ’20: 28th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Virtual Event, USA, November 8-13, 2020, Prem Devanbu, Myra B. Cohen, and Thomas Zimmermann (Eds.). 1556–1560. https://doi.org/10.1145/3368089.3417943
- Better Test Cases for Better Automated Program Repair. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering (Paderborn, Germany) (ESEC/FSE 2017). 831–841. https://doi.org/10.1145/3106237.3106274