GitBug-Java: A Reproducible Benchmark of Recent Java Bugs (2402.02961v2)
Abstract: Bug-fix benchmarks are essential for evaluating methodologies in automatic program repair (APR) and fault localization (FL). However, existing benchmarks, exemplified by Defects4J, need to evolve to incorporate recent bug-fixes aligned with contemporary development practices. Moreover, reproducibility, a key scientific principle, has been lacking in bug-fix benchmarks. To address these gaps, we present GitBug-Java, a reproducible benchmark of recent Java bugs. GitBug-Java features 199 bugs extracted from the 2023 commit history of 55 notable open-source repositories. The methodology for building GitBug-Java ensures the preservation of bug-fixes in fully-reproducible environments. We publish GitBug-Java at https://github.com/gitbugactions/gitbug-java.
- Minecraft: Automated Mining of Software Bug Fixes with Precise Code Context. In Proceedings of the 38th IEEE/ACM International Conference on Automated Software Engineering.
- Defexts: A curated dataset of reproducible real-world bugs for modern jvm languages. In 2019 IEEE/ACM 41st International Conference on Software Engineering: Companion Proceedings (ICSE-Companion). IEEE, 47–50.
- Viktor Csuvik and László Vidács. 2022. FixJS: a dataset of bug-fixing JavaScript commits. In Proceedings of the 19th International Conference on Mining Software Repositories. 712–716.
- Thomas Durieux and Rui Abreu. 2019. Critical review of bugswarm for fault localization and program repair. arXiv preprint arXiv:1905.09375 (2019).
- Bugsjs: a benchmark of javascript bugs. In 2019 12th IEEE Conference on Software Testing, Validation and Verification (ICST). IEEE, 90–101.
- Stop uploading test data in plain text: Practical strategies for mitigating data contamination by evaluation benchmarks. arXiv preprint arXiv:2305.10160 (2023).
- Impact of Code Language Models on Automated Program Repair. In Proceedings of the 45th International Conference on Software Engineering (Melbourne, Victoria, Australia) (ICSE ’23). IEEE Press, 1430–1442. https://doi.org/10.1109/ICSE48619.2023.00125
- Defects4J: A database of existing faults to enable controlled testing studies for Java programs. In Proceedings of the 2014 international symposium on software testing and analysis. 437–440.
- The GitHub Recent Bugs Dataset for Evaluating LLM-based Debugging Applications. arXiv preprint arXiv:2310.13229 (2023).
- QuixBugs: A multi-lingual program repair benchmark set based on the Quixey Challenge. In Proceedings Companion of the 2017 ACM SIGPLAN international conference on systems, programming, languages, and applications: software for humanity. 55–56.
- Jacontebe: A benchmark suite of real-world java concurrency bugs (T). In 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 178–189.
- An empirical analysis of flaky tests. In Proceedings of the 22nd ACM SIGSOFT international symposium on foundations of software engineering. 643–653.
- Bears: An extensible java bug benchmark for automatic program repair studies. In 2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 468–478.
- Code4Bench: A multidimensional benchmark of Codeforces data for different program analysis techniques. Journal of Computer Languages 53 (2019), 38–52.
- Julian Aron Prenner and Romain Robbes. 2023. RunBugRun–An Executable Dataset for Automated Program Repair. arXiv preprint arXiv:2304.01102 (2023).
- GitBug-Actions: Building Reproducible Bug-Fix Benchmarks with GitHub Actions. In 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings (ICSE-Companion). IEEE.
- Bugs. jar: A large-scale, diverse dataset of real-world java bugs. In Proceedings of the 15th international conference on mining software repositories. 10–13.
- Flacoco: Fault localization for java based on industry-grade coverage. arXiv preprint arXiv:2111.12513 (2021).
- Using benchmarking to advance research: A challenge to software engineering. In 25th International Conference on Software Engineering, 2003. Proceedings. IEEE, 74–83.
- DebugBench: Evaluating Debugging Capability of Large Language Models. arXiv preprint arXiv:2401.04621 (2024).
- Bugswarm: Mining and continuously growing a dataset of reproducible failures and fixes. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). IEEE, 339–349.
- Validity concerns in software engineering research. In Proceedings of the FSE/SDP workshop on Future of software engineering research. 411–414.
- A Critical Review of Large Language Model on Software Engineering: An Example from ChatGPT and Automated Program Repair. arXiv:2310.08879 [cs.SE]
- Hao-Nan Zhu and Cindy Rubio-González. 2023. On the reproducibility of software defect datasets. ICSE. IEEE (2023).