Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Demystifying Faulty Code with LLM: Step-by-Step Reasoning for Explainable Fault Localization (2403.10507v1)

Published 15 Mar 2024 in cs.SE

Abstract: Fault localization is a critical process that involves identifying specific program elements responsible for program failures. Manually pinpointing these elements, such as classes, methods, or statements, which are associated with a fault is laborious and time-consuming. To overcome this challenge, various fault localization tools have been developed. These tools typically generate a ranked list of suspicious program elements. However, this information alone is insufficient. A prior study emphasized that automated fault localization should offer a rationale. In this study, we investigate the step-by-step reasoning for explainable fault localization. We explore the potential of LLMs (LLM) in assisting developers in reasoning about code. We proposed FuseFL that utilizes several combinations of information to enhance the LLM results which are spectrum-based fault localization results, test case execution outcomes, and code description (i.e., explanation of what the given code is intended to do). We conducted our investigation using faulty code from Refactory dataset. First, we evaluate the performance of the automated fault localization. Our results demonstrate a more than 30% increase in the number of successfully localized faults at Top-1 compared to the baseline. To evaluate the explanations generated by FuseFL, we create a dataset of human explanations that provide step-by-step reasoning as to why specific lines of code are considered faulty. This dataset consists of 324 faulty code files, along with explanations for 600 faulty lines. Furthermore, we also conducted human studies to evaluate the explanations. We found that for 22 out of the 30 randomly sampled cases, FuseFL generated correct explanations.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (62)
  1. W. E. Wong, R. Gao, Y. Li, R. Abreu, and F. Wotawa, “A survey on software fault localization,” IEEE Transactions on Software Engineering (TSE), vol. 42, no. 8, 2016.
  2. R. Widyasari, G. A. A. Prana, S. A. Haryono, Y. Tian, H. N. Zachiary, and D. Lo, “Xai4fl: enhancing spectrum-based fault localization with explainable artificial intelligence,” in Proceedings of the 30th IEEE/ACM International Conference on Program Comprehension, 2022, pp. 499–510.
  3. J. Sohn and S. Yoo, “Fluccs: Using code and change metrics to improve fault localization,” in Proceedings of the 26th ACM SIGSOFT International Symposium on Software Testing and Analysis, 2017, pp. 273–283.
  4. N. Neelofar, L. Naish, J. Lee, and K. Ramamohanarao, “Improving spectral-based fault localization using static analysis,” Software: Practice and Experience, vol. 47, no. 11, pp. 1633–1655, 2017.
  5. F. Feyzi and S. Parsa, “Fpa-fl: Incorporating static fault-proneness analysis into statistical fault localization,” Journal of Systems and Software, vol. 136, pp. 39–58, 2018.
  6. W. E. Wong, V. Debroy, R. Gao, and Y. Li, “The dstar method for effective software fault localization,” IEEE Transactions on Reliability, vol. 63, no. 1, 2013.
  7. J. A. Jones, M. J. Harrold, and J. T. Stasko, “Visualization for fault localization,” in in Proceedings of ICSE 2001 Workshop on Software Visualization.   Citeseer, 2001.
  8. R. Abreu, P. Zoeteweij, and A. J. Van Gemund, “An evaluation of similarity coefficients for software fault localization,” in 2006 12th Pacific Rim International Symposium on Dependable Computing (PRDC’06).   IEEE, 2006. [Online]. Available: https://doi.org/10.1109/PRDC.2006.18
  9. ——, “Spectrum-based multiple fault localization,” in 2009 IEEE/ACM International Conference on Automated Software Engineering.   IEEE, 2009. [Online]. Available: https://doi.org/10.1109/ASE.2009.25
  10. L. Naish, H. J. Lee, and K. Ramamohanarao, “A model for spectra-based software diagnosis,” ACM Transactions on software engineering and methodology (TOSEM), vol. 20, no. 3, 2011.
  11. P. S. Kochhar, X. Xia, D. Lo, and S. Li, “Practitioners’ expectations on automated fault localization,” in Proceedings of the 25th International Symposium on Software Testing and Analysis, 2016. [Online]. Available: https://doi.org/10.1145/2931037.2931051
  12. R. N. Shepard, “The step to rationality: The efficacy of thought experiments in science, ethics, and free will,” Cognitive Science, vol. 32, no. 1, pp. 3–35, 2008.
  13. C. S. Xia, Y. Wei, and L. Zhang, “Automated program repair in the era of large pre-trained language models,” in Proceedings of the 45th International Conference on Software Engineering (ICSE 2023). Association for Computing Machinery, 2023.
  14. Y. Liu, T. Le-Cong, R. Widyasari, C. Tantithamthavorn, L. Li, X.-B. D. Le, and D. Lo, “Refining chatgpt-generated code: Characterizing and mitigating code quality issues,” arXiv preprint arXiv:2307.12596, 2023.
  15. Y. Feng, S. Vanam, M. Cherukupally, W. Zheng, M. Qiu, and H. Chen, “Investigating code generation performance of chat-gpt with crowdsourcing social data,” in Proceedings of the 47th IEEE Computer Software and Applications Conference, 2023, pp. 1–10.
  16. M. Fu, C. Tantithamthavorn, V. Nguyen, and T. Le, “Chatgpt for vulnerability detection, classification, and repair: How far are we?” arXiv preprint arXiv:2310.09810, 2023.
  17. Y. Wu, Z. Li, J. M. Zhang, M. Papadakis, M. Harman, and Y. Liu, “Large language models in fault localisation,” arXiv preprint arXiv:2308.15276, 2023.
  18. S. Kang, G. An, and S. Yoo, “A preliminary evaluation of llm-based fault localization,” arXiv preprint arXiv:2308.05487, 2023.
  19. D. Lo, “Trustworthy and synergistic artificial intelligence for software engineering: Vision and roadmaps,” arXiv preprint arXiv:2309.04142, 2023.
  20. Y. Hu, U. Z. Ahmed, S. Mechtaev, B. Leong, and A. Roychoudhury, “Re-factoring based program repair applied to programming assignments,” in 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE).   IEEE, 2019, pp. 388–398.
  21. S. Pearson, J. Campos, R. Just, G. Fraser, R. Abreu, M. D. Ernst, D. Pang, and B. Keller, “Evaluating and improving fault localization,” in 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE), 2017.
  22. T. Sellam, D. Das, and A. Parikh, “Bleurt: Learning robust metrics for text generation,” pp. 7881–7892, Jul. 2020. [Online]. Available: https://aclanthology.org/2020.acl-main.704
  23. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
  24. T. Zhang, I. C. Irsan, F. Thung, D. Han, D. Lo, and L. Jiang, “itiger: an automatic issue title generation tool,” in Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2022, pp. 1637–1641.
  25. R. Widyasari, G. A. A. Prana, S. A. Haryono, S. Wang, and D. Lo, “Real world projects, real faults: evaluating spectrum based fault localization techniques on python projects,” Empirical Software Engineering, vol. 27, no. 6, p. 147, 2022.
  26. G. Laghari and S. Demeyer, “On the use of sequence mining within spectrum based fault localisation,” in Proceedings of the 33rd Annual ACM Symposium on Applied Computing, 2018, pp. 1916–1924. [Online]. Available: https://doi.org/10.1145/3167132.3167337
  27. M. Zeng, Y. Wu, Z. Ye, Y. Xiong, X. Zhang, and L. Zhang, “Fault localization via efficient probabilistic modeling of program semantics,” in Proceedings of the 44th International Conference on Software Engineering, 2022, pp. 958–969.
  28. D. Yang, Y. Qi, X. Mao, and Y. Lei, “Evaluating the usage of fault localization in automated program repair: an empirical study,” Frontiers of Computer Science, vol. 15, no. 1, pp. 1–15, 2021.
  29. M. Wen, J. Chen, Y. Tian, R. Wu, D. Hao, S. Han, and S.-C. Cheung, “Historical spectrum based fault localization,” IEEE Transactions on Software Engineering (TSE), vol. 47, no. 11, pp. 2348–2368, 2021. [Online]. Available: https://doi.org/10.1109/TSE.2019.2948158
  30. A. J. Ko and B. A. Myers, “Designing the whyline: a debugging interface for asking questions about program behavior,” in Proceedings of the SIGCHI conference on Human factors in computing systems, 2004, pp. 151–158.
  31. OpenAI, “Chatgpt version.” [Online]. Available: https://help.openai.com/en/articles/6825453-chatgpt-release-notes#h_2235297035
  32. T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
  33. “Python coverage.” [Online]. Available: https://pypi.org/project/coverage/
  34. J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou et al., “Chain-of-thought prompting elicits reasoning in large language models,” Advances in Neural Information Processing Systems, vol. 35, pp. 24 824–24 837, 2022.
  35. T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwasawa, “Large language models are zero-shot reasoners,” Advances in neural information processing systems, vol. 35, pp. 22 199–22 213, 2022.
  36. Y. Zhou, A. I. Muresanu, Z. Han, K. Paster, S. Pitis, H. Chan, and J. Ba, “Large language models are human-level prompt engineers,” in The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023.   OpenReview.net, 2023.
  37. C. S. Xia and L. Zhang, “Keep the conversation going: Fixing 162 out of 337 bugs for $0.42 each using chatgpt,” arXiv preprint arXiv:2304.00385, 2023.
  38. R. Just, D. Jalali, and M. D. Ernst, “Defects4j: A database of existing faults to enable controlled testing studies for java programs,” in Proceedings of the 2014 International Symposium on Software Testing and Analysis, 2014. [Online]. Available: https://doi.org/10.1145/2610384.2628055
  39. P. Gyimesi, B. Vancsics, A. Stocco, D. Mazinanian, A. Beszédes, R. Ferenc, and A. Mesbah, “Bugsjs: a benchmark of javascript bugs,” in 2019 12th IEEE Conference on Software Testing, Validation and Verification (ICST).   IEEE, 2019, pp. 90–101.
  40. R. Widyasari, S. Q. Sim, C. Lok, H. Qi, J. Phan, Q. Tay, C. Tan, F. Wee, J. E. Tan, Y. Yieh et al., “Bugsinpy: a database of existing bugs in python programs to enable controlled testing and debugging studies,” in Proceedings of the 28th ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering, 2020, pp. 1556–1560.
  41. S. Gunawardena, E. Tempero, and K. Blincoe, “Concerns identified in code review: A fine-grained, faceted classification,” Information and Software Technology, vol. 153, p. 107054, 2023.
  42. R. Widyasari, J. W. And, T. G. Nguyen, N. Sharma, and D. Lo, “Fusefl explainable fault localization.” [Online]. Available: https://figshare.com/s/bce02cb607a6c30043ad
  43. C. Parnin and A. Orso, “Are automated debugging techniques actually helping programmers?” in Proceedings of the 2011 international symposium on software testing and analysis, 2011.
  44. X. Ju, S. Jiang, X. Chen, X. Wang, Y. Zhang, and H. Cao, “Hsfal: Effective fault localization using hybrid spectrum of full slices and execution slices,” Journal of Systems and Software, vol. 90, 2014. [Online]. Available: https://doi.org/10.1016/j.jss.2013.11.1109
  45. T.-D. B. Le, D. Lo, and F. Thung, “Should i follow this fault localization tool’s output?” Empirical Software Engineering, vol. 20, no. 5, Oct. 2015. [Online]. Available: https://doi.org/10.1007/s10664-014-9349-1
  46. “Scipy rankdata.” [Online]. Available: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.rankdata.html
  47. Z. Chen, S. J. Kommrusch, M. Tufano, L.-N. Pouchet, D. Poshyvanyk, and M. Monperrus, “Sequencer: Sequence-to-sequence learning for end-to-end program repair,” IEEE Transactions on Software Engineering, 2019.
  48. T. Lutellier, H. V. Pham, L. Pang, Y. Li, M. Wei, and L. Tan, “Coconut: combining context-aware neural translation models using ensemble for program repair,” in Proceedings of the 29th ACM SIGSOFT ISSTA, 2020.
  49. N. Jiang, T. Lutellier, and L. Tan, “Cure: Code-aware neural machine translation for automatic program repair,” in 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE).   IEEE, 2021.
  50. “Bleurt.” [Online]. Available: https://github.com/google-research/bleurt
  51. M.-A. Clinciu, A. Eshghi, and H. Hastie, “A study of automatic metrics for the evaluation of natural language explanations,” in Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume.   Online: Association for Computational Linguistics, Apr. 2021, pp. 2376–2387. [Online]. Available: https://aclanthology.org/2021.eacl-main.202
  52. H. Schuff, H.-Y. Yang, H. Adel, and N. T. Vu, “Does external knowledge help explainable natural language inference? automatic evaluation vs. human ratings,” arXiv preprint arXiv:2109.07833, 2021.
  53. K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” in Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 2002, pp. 311–318.
  54. C.-Y. Lin, “Rouge: A package for automatic evaluation of summaries,” in Text summarization branches out, 2004, pp. 74–81.
  55. F. Huang, H. Kwak, and J. An, “Is chatgpt better than human annotators? potential and limitations of chatgpt in explaining implicit hate speech,” in Companion Proceedings of the ACM Web Conference 2023, ser. WWW ’23 Companion.   New York, NY, USA: Association for Computing Machinery, 2023, p. 294–297. [Online]. Available: https://doi.org/10.1145/3543873.3587368
  56. F. Wilcoxon, “Individual comparisons by ranking methods,” in Breakthroughs in statistics.   Springer, 1992, pp. 196–202.
  57. L. A. Becker, “Effect size (es),” 2000.
  58. J. Cohen, “A power primer.” Tutorials in Quantitative Methods for Psychology, vol. 112, 07 2016.
  59. X. Yu, J. Liu, Z. Yang, and X. Liu, “The bayesian network based program dependence graph and its application to fault localization,” Journal of Systems and Software, vol. 134, pp. 44–53, 2017.
  60. B. Xu, L. An, F. Thung, F. Khomh, and D. Lo, “Why reinventing the wheels? an empirical study on library reuse and re-implementation,” Empirical Software Engineering, vol. 25, pp. 755–789, 2020.
  61. A. Bacchelli and C. Bird, “Expectations, outcomes, and challenges of modern code review,” in 2013 35th International Conference on Software Engineering (ICSE).   IEEE, 2013, pp. 712–721.
  62. OpenAI, “Chatgpt models.” [Online]. Available: https://platform.openai.com/docs/models
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Ratnadira Widyasari (18 papers)
  2. Jia Wei Ang (1 paper)
  3. Truong Giang Nguyen (5 papers)
  4. Neil Sharma (2 papers)
  5. David Lo (229 papers)
Citations (4)