Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
133 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

GitBug-Java: A Reproducible Benchmark of Recent Java Bugs (2402.02961v2)

Published 5 Feb 2024 in cs.SE

Abstract: Bug-fix benchmarks are essential for evaluating methodologies in automatic program repair (APR) and fault localization (FL). However, existing benchmarks, exemplified by Defects4J, need to evolve to incorporate recent bug-fixes aligned with contemporary development practices. Moreover, reproducibility, a key scientific principle, has been lacking in bug-fix benchmarks. To address these gaps, we present GitBug-Java, a reproducible benchmark of recent Java bugs. GitBug-Java features 199 bugs extracted from the 2023 commit history of 55 notable open-source repositories. The methodology for building GitBug-Java ensures the preservation of bug-fixes in fully-reproducible environments. We publish GitBug-Java at https://github.com/gitbugactions/gitbug-java.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (24)
  1. Minecraft: Automated Mining of Software Bug Fixes with Precise Code Context. In Proceedings of the 38th IEEE/ACM International Conference on Automated Software Engineering.
  2. Defexts: A curated dataset of reproducible real-world bugs for modern jvm languages. In 2019 IEEE/ACM 41st International Conference on Software Engineering: Companion Proceedings (ICSE-Companion). IEEE, 47–50.
  3. Viktor Csuvik and László Vidács. 2022. FixJS: a dataset of bug-fixing JavaScript commits. In Proceedings of the 19th International Conference on Mining Software Repositories. 712–716.
  4. Thomas Durieux and Rui Abreu. 2019. Critical review of bugswarm for fault localization and program repair. arXiv preprint arXiv:1905.09375 (2019).
  5. Bugsjs: a benchmark of javascript bugs. In 2019 12th IEEE Conference on Software Testing, Validation and Verification (ICST). IEEE, 90–101.
  6. Stop uploading test data in plain text: Practical strategies for mitigating data contamination by evaluation benchmarks. arXiv preprint arXiv:2305.10160 (2023).
  7. Impact of Code Language Models on Automated Program Repair. In Proceedings of the 45th International Conference on Software Engineering (Melbourne, Victoria, Australia) (ICSE ’23). IEEE Press, 1430–1442. https://doi.org/10.1109/ICSE48619.2023.00125
  8. Defects4J: A database of existing faults to enable controlled testing studies for Java programs. In Proceedings of the 2014 international symposium on software testing and analysis. 437–440.
  9. The GitHub Recent Bugs Dataset for Evaluating LLM-based Debugging Applications. arXiv preprint arXiv:2310.13229 (2023).
  10. QuixBugs: A multi-lingual program repair benchmark set based on the Quixey Challenge. In Proceedings Companion of the 2017 ACM SIGPLAN international conference on systems, programming, languages, and applications: software for humanity. 55–56.
  11. Jacontebe: A benchmark suite of real-world java concurrency bugs (T). In 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 178–189.
  12. An empirical analysis of flaky tests. In Proceedings of the 22nd ACM SIGSOFT international symposium on foundations of software engineering. 643–653.
  13. Bears: An extensible java bug benchmark for automatic program repair studies. In 2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 468–478.
  14. Code4Bench: A multidimensional benchmark of Codeforces data for different program analysis techniques. Journal of Computer Languages 53 (2019), 38–52.
  15. Julian Aron Prenner and Romain Robbes. 2023. RunBugRun–An Executable Dataset for Automated Program Repair. arXiv preprint arXiv:2304.01102 (2023).
  16. GitBug-Actions: Building Reproducible Bug-Fix Benchmarks with GitHub Actions. In 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings (ICSE-Companion). IEEE.
  17. Bugs. jar: A large-scale, diverse dataset of real-world java bugs. In Proceedings of the 15th international conference on mining software repositories. 10–13.
  18. Flacoco: Fault localization for java based on industry-grade coverage. arXiv preprint arXiv:2111.12513 (2021).
  19. Using benchmarking to advance research: A challenge to software engineering. In 25th International Conference on Software Engineering, 2003. Proceedings. IEEE, 74–83.
  20. DebugBench: Evaluating Debugging Capability of Large Language Models. arXiv preprint arXiv:2401.04621 (2024).
  21. Bugswarm: Mining and continuously growing a dataset of reproducible failures and fixes. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). IEEE, 339–349.
  22. Validity concerns in software engineering research. In Proceedings of the FSE/SDP workshop on Future of software engineering research. 411–414.
  23. A Critical Review of Large Language Model on Software Engineering: An Example from ChatGPT and Automated Program Repair. arXiv:2310.08879 [cs.SE]
  24. Hao-Nan Zhu and Cindy Rubio-González. 2023. On the reproducibility of software defect datasets. ICSE. IEEE (2023).
Citations (6)

Summary

We haven't generated a summary for this paper yet.