Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

GitBug-Actions: Building Reproducible Bug-Fix Benchmarks with GitHub Actions (2310.15642v3)

Published 24 Oct 2023 in cs.SE

Abstract: Bug-fix benchmarks are fundamental in advancing various sub-fields of software engineering such as automatic program repair (APR) and fault localization (FL). A good benchmark must include recent examples that accurately reflect technologies and development practices of today. To be executable in the long term, a benchmark must feature test suites that do not degrade overtime due to, for example, dependencies that are no longer available. Existing benchmarks fail in meeting both criteria. For instance, Defects4J, one of the foremost Java benchmarks, last received an update in 2020. Moreover, full-reproducibility has been neglected by the majority of existing benchmarks. In this paper, we present GitBug-Actions: a novel tool for building bug-fix benchmarks with modern and fully-reproducible bug-fixes. GitBug-Actions relies on the most popular CI platform, GitHub Actions, to detect bug-fixes and smartly locally execute the CI pipeline in a controlled and reproducible environment. To the best of our knowledge, we are the first to rely on GitHub Actions to collect bug-fixes. To demonstrate our toolchain, we deploy GitBug-Actions to build a proof-of-concept Go bug-fix benchmark containing executable, fully-reproducible bug-fixes from different repositories. A video demonstrating GitBug-Actions is available at: https://youtu.be/aBWwa1sJYBs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (20)
  1. S. E. Sim, S. Easterbrook, and R. C. Holt, “Using benchmarking to advance research: A challenge to software engineering,” in 25th International Conference on Software Engineering, 2003. Proceedings.   IEEE, 2003, pp. 74–83.
  2. H. K. Wright, M. Kim, and D. E. Perry, “Validity concerns in software engineering research,” in Proceedings of the FSE/SDP workshop on Future of software engineering research, 2010, pp. 411–414.
  3. R. Just, D. Jalali, and M. D. Ernst, “Defects4j: A database of existing faults to enable controlled testing studies for java programs,” in Proceedings of the 2014 international symposium on software testing and analysis, 2014, pp. 437–440.
  4. A. Jacovi, A. Caciularu, O. Goldman, and Y. Goldberg, “Stop uploading test data in plain text: Practical strategies for mitigating data contamination by evaluation benchmarks,” arXiv preprint arXiv:2305.10160, 2023.
  5. Q. Zhang, T. Zhang, J. Zhai, C. Fang, B. Yu, W. Sun, and Z. Chen, “A critical review of large language model on software engineering: An example from chatgpt and automated program repair,” 2023.
  6. H.-N. Zhu and C. Rubio-González, “On the reproducibility of software defect datasets,” ICSE. IEEE, 2023.
  7. D. A. Tomassi, N. Dmeiri, Y. Wang, A. Bhowmick, Y.-C. Liu, P. T. Devanbu, B. Vasilescu, and C. Rubio-González, “Bugswarm: Mining and continuously growing a dataset of reproducible failures and fixes,” in 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE).   IEEE, 2019, pp. 339–349.
  8. F. Madeiral, S. Urli, M. Maia, and M. Monperrus, “Bears: An extensible java bug benchmark for automatic program repair studies,” in 2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER).   IEEE, 2019, pp. 468–478.
  9. “The state of developer ecosystem 2022,” https://www.jetbrains.com/lp/devecosystem-2022/team-tools/#ci-tools, accessed: 2023-09-29.
  10. D. Lin, J. Koppel, A. Chen, and A. Solar-Lezama, “Quixbugs: A multi-lingual program repair benchmark set based on the quixey challenge,” in Proceedings Companion of the 2017 ACM SIGPLAN international conference on systems, programming, languages, and applications: software for humanity, 2017, pp. 55–56.
  11. S. H. Tan, J. Yi, S. Mechtaev, A. Roychoudhury et al., “Codeflaws: a programming competition benchmark for evaluating automated program repair tools,” in 2017 IEEE/ACM 39th International Conference on Software Engineering Companion (ICSE-C).   IEEE, 2017, pp. 180–182.
  12. A. Majd, M. Vahidi-Asl, A. Khalilian, A. Baraani-Dastjerdi, and B. Zamani, “Code4bench: A multidimensional benchmark of codeforces data for different program analysis techniques,” Journal of Computer Languages, vol. 53, pp. 38–52, 2019.
  13. J. A. Prenner and R. Robbes, “Runbugrun–an executable dataset for automated program repair,” arXiv preprint arXiv:2304.01102, 2023.
  14. N. Jiang, K. Liu, T. Lutellier, and L. Tan, “Impact of code language models on automated program repair,” in Proceedings of the 45th International Conference on Software Engineering, ser. ICSE ’23.   IEEE Press, 2023, p. 1430–1442. [Online]. Available: https://doi.org/10.1109/ICSE48619.2023.00125
  15. V. Csuvik and L. Vidács, “Fixjs: a dataset of bug-fixing javascript commits,” in Proceedings of the 19th International Conference on Mining Software Repositories, 2022, pp. 712–716.
  16. S. K. Avula, V. Vobbilisetti, and S. Mondal, “Minecraft: Automated mining of software bug fixes with precise code context,” in Proceedings of the 38th IEEE/ACM International Conference on Automated Software Engineering, 2023.
  17. R. K. Saha, Y. Lyu, W. Lam, H. Yoshida, and M. R. Prasad, “Bugs. jar: A large-scale, diverse dataset of real-world java bugs,” in Proceedings of the 15th international conference on mining software repositories, 2018, pp. 10–13.
  18. T. Durieux and R. Abreu, “Critical review of bugswarm for fault localization and program repair,” arXiv preprint arXiv:1905.09375, 2019.
  19. A. Silva, M. Martinez, B. Danglot, D. Ginelli, and M. Monperrus, “Flacoco: Fault localization for java based on industry-grade coverage,” arXiv preprint arXiv:2111.12513, 2021.
  20. H.-N. Zhu, K. Z. Guan, R. M. Furth, and C. Rubio-González, “Actionsremaker: Reproducing github actions,” ICSE-Companion. IEEE, 2023.
Citations (2)

Summary

We haven't generated a summary for this paper yet.