An Empirical Study on Code Review Activity Prediction and Its Impact in Practice (2404.10703v2)
Abstract: During code reviews, an essential step in software quality assurance, reviewers have the difficult task of understanding and evaluating code changes to validate their quality and prevent introducing faults to the codebase. This is a tedious process where the effort needed is highly dependent on the code submitted, as well as the author's and the reviewer's experience, leading to median wait times for review feedback of 15-64 hours. Through an initial user study carried with 29 experts, we found that re-ordering the files changed by a patch within the review environment has potential to improve review quality, as more comments are written (+23%), and participants' file-level hot-spot precision and recall increases to 53% (+13%) and 28% (+8%), respectively, compared to the alphanumeric ordering. Hence, this paper aims to help code reviewers by predicting which files in a submitted patch need to be (1) commented, (2) revised, or (3) are hot-spots (commented or revised). To predict these tasks, we evaluate two different types of text embeddings (i.e., Bag-of-Words and LLMs encoding) and review process features (i.e., code size-based and history-based features). Our empirical study on three open-source and two industrial datasets shows that combining the code embedding and review process features leads to better results than the state-of-the-art approach. For all tasks, F1-scores (median of 40-62%) are significantly better than the state-of-the-art (from +1 to +9%).
- “Chatgpt introduction,” https://openai.com/blog/chatgpt, accessed: 2023-09-24.
- “Countvectorizer python library,” https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html, accessed: 2023-09-24.
- “Hugging face website,” https://huggingface.co/, accessed: 2023-09-24.
- “Replication package,” https://zenodo.org/records/10783562, accessed: 2023-09-28.
- “Sentence transformer,” https://huggingface.co/sentence-transformers, accessed: 2023-09-24.
- J. Anvik, L. Hiew, and G. C. Murphy, “Who should fix this bug?” in Proceedings of the 28th international conference on Software engineering, 2006, pp. 361–370.
- J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le et al., “Program synthesis with large language models,” arXiv preprint arXiv:2108.07732, 2021.
- M. Barnett, C. Bird, J. Brunet, and S. K. Lahiri, “Helping developers help themselves: Automatic decomposition of code review changesets,” in 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, vol. 1. IEEE, 2015, pp. 134–144.
- T. Baum and K. Schneider, “On the need for a new generation of code review tools,” in Product-Focused Software Process Improvement: 17th International Conference, PROFES 2016, Trondheim, Norway, November 22-24, 2016, Proceedings 17. Springer, 2016, pp. 301–308.
- T. Baum, K. Schneider, and A. Bacchelli, “On the optimal order of reading source code changes for review,” in 2017 IEEE international conference on software maintenance and evolution (ICSME). IEEE, 2017, pp. 329–340.
- BigScience, “Bigscience language open-science open-access multilingual (bloom) language model,” International, May 2021 - May 2022.
- M. Borg, O. Svensson, K. Berg, and D. Hansson, “Szz unleashed: an open implementation of the szz algorithm-featuring example usage in a study of just-in-time bug prediction for the jenkins project,” in Proceedings of the 3rd ACM SIGSOFT International Workshop on Machine Learning Techniques for Software Quality Evaluation, 2019, pp. 7–12.
- A. Bosu, M. Greiler, and C. Bird, “Characteristics of useful code reviews: An empirical study at microsoft,” in 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories. IEEE, 2015, pp. 146–156.
- Z. Chase Lipton, C. Elkan, and B. Narayanaswamy, “Thresholding classifiers to maximize f1 score,” arXiv e-prints, pp. arXiv–1402, 2014.
- N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “Smote: synthetic minority over-sampling technique,” Journal of artificial intelligence research, vol. 16, pp. 321–357, 2002.
- J. Cuzick, “A wilcoxon-type test for trend,” Statistics in medicine, vol. 4, no. 1, pp. 87–90, 1985.
- D. A. Da Costa, S. McIntosh, W. Shang, U. Kulesza, R. Coelho, and A. E. Hassan, “A framework for evaluating the results of the szz approach for identifying bug-introducing changes,” IEEE Transactions on Software Engineering, vol. 43, no. 7, pp. 641–657, 2016.
- N. Dahlbäck, A. Jönsson, and L. Ahrenberg, “Wizard of oz studies: why and how,” in Proceedings of the 1st international conference on Intelligent user interfaces, 1993, pp. 193–200.
- Y. Deng, C. S. Xia, H. Peng, C. Yang, and L. Zhang, “Fuzzing deep-learning libraries via large language models,” arXiv preprint arXiv:2212.14834, 2022.
- J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
- V. Efstathiou and D. Spinellis, “Code review comments: language matters,” in Proceedings of the 40th International Conference on Software Engineering: New Ideas and Emerging Results, 2018, pp. 69–72.
- M. Fagan, “A history of software inspections,” Software pioneers: contributions to software engineering, pp. 562–573, 2002.
- E. Fregnan, L. Braz, M. D’Ambros, G. Çalıklı, and A. Bacchelli, “First come first served: the impact of file position on code review,” in Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2022, pp. 483–494.
- W. Fu and T. Menzies, “Easy over hard: A case study on deep learning,” in Proceedings of the 2017 11th joint meeting on foundations of software engineering, 2017, pp. 49–60.
- E. Giger, M. D’Ambros, M. Pinzger, and H. C. Gall, “Method-level bug prediction,” in Proceedings of the ACM-IEEE international symposium on Empirical software engineering and measurement, 2012, pp. 171–180.
- A. Gupta and N. Sundaresan, “Intelligent code reviews using deep learning,” in Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’18) Deep Learning Day, 2018.
- K. Han, A. Xiao, E. Wu, J. Guo, C. Xu, and Y. Wang, “Transformer in transformer,” Advances in Neural Information Processing Systems, vol. 34, pp. 15 908–15 919, 2021.
- V. J. Hellendoorn and P. Devanbu, “Are deep neural networks the best choice for modeling source code?” in Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering, 2017, pp. 763–773.
- A. Z. Henley, K. Muçlu, M. Christakis, S. D. Fleming, and C. Bird, “Cfar: A tool to increase communication, productivity, and review quality in collaborative code reviews,” in Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, 2018, pp. 1–13.
- Y. Hong, C. Tantithamthavorn, P. Thongtanunam, and A. Aleti, “Commentfinder: a simpler, faster, more accurate code review comments recommendation,” in Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2022, pp. 507–519.
- Y. Hong, C. K. Tantithamthavorn, and P. P. Thongtanunam, “Where should i look at? recommending lines that reviewers should pay attention to,” in 2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 2022, pp. 1034–1045.
- X. Hou, Y. Zhao, Y. Liu, Z. Yang, K. Wang, L. Li, X. Luo, D. Lo, J. Grundy, and H. Wang, “Large language models for software engineering: A systematic literature review,” arXiv preprint arXiv:2308.10620, 2023.
- Y. Huang, N. Jia, X. Chen, K. Hong, and Z. Zheng, “Salient-class location: Help developers understand code change in code review,” in Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2018, pp. 770–774.
- Y. Kamei, E. Shihab, B. Adams, A. E. Hassan, A. Mockus, A. Sinha, and N. Ubayashi, “A large-scale empirical study of just-in-time quality assurance,” IEEE Transactions on Software Engineering, vol. 39, no. 6, pp. 757–773, 2012.
- D. Kim, Y. Tao, S. Kim, and A. Zeller, “Where should we fix this bug? a two-phase recommendation model,” IEEE transactions on software Engineering, vol. 39, no. 11, pp. 1597–1610, 2013.
- O. Kononenko, O. Baysal, and M. W. Godfrey, “Code review quality: How developers see it,” in Proceedings of the 38th international conference on software engineering, 2016, pp. 1028–1038.
- Z. Li, S. Lu, D. Guo, N. Duan, S. Jannu, G. Jenks, D. Majumder, J. Green, A. Svyatkovskiy, S. Fu et al., “Automating code review activities by large-scale pre-training,” in Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2022, pp. 1035–1047.
- G. Macbeth, E. Razumiejczyk, and R. D. Ledesma, “Cliff’s delta calculator: A non-parametric effect size program for two groups of observations,” Universitas Psychologica, vol. 10, no. 2, pp. 545–555, 2011.
- S. Majumder, N. Balaji, K. Brey, W. Fu, and T. Menzies, “500+ times faster than deep learning: A case study exploring faster methods for text mining stackoverflow,” in Proceedings of the 15th International Conference on Mining Software Repositories, 2018, pp. 554–563.
- M. Nayrolles and A. Hamou-Lhadj, “Clever: Combining code metrics with clone detection for just-in-time fault prevention and resolution in large industrial projects,” in Proceedings of the 15th international conference on mining software repositories, 2018, pp. 153–164.
- OpenAI, “Gpt-4 technical report,” 2023.
- G. Petrović and M. Ivanković, “State of mutation testing at google,” in Proceedings of the 40th international conference on software engineering: Software engineering in practice, 2018, pp. 163–171.
- M. M. Rahman, C. K. Roy, and J. A. Collins, “Correct: code reviewer recommendation in github based on cross-project and technology experience,” in Proceedings of the 38th international conference on software engineering companion, 2016, pp. 222–231.
- M. M. Rahman, C. K. Roy, and R. G. Kula, “Predicting usefulness of code review comments using textual features and developer experience,” in 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR). IEEE, 2017, pp. 215–226.
- P. Rigby, B. Cleary, F. Painchaud, M.-A. Storey, and D. German, “Contemporary peer review in action: Lessons from open source development,” IEEE software, vol. 29, no. 6, pp. 56–61, 2012.
- C. Sadowski, E. Söderberg, L. Church, M. Sipko, and A. Bacchelli, “Modern code review: a case study at google,” in Proceedings of the 40th international conference on software engineering: Software engineering in practice, 2018, pp. 181–190.
- A. Sharma, A. Gupta, and M. Bilalpur, “Argumentative stance prediction: An exploratory study on multimodality and few-shot learning,” arXiv preprint arXiv:2310.07093, 2023.
- E. Shihab, A. E. Hassan, B. Adams, and Z. M. Jiang, “An industrial study on the risk of software changes,” in Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering, 2012, pp. 1–11.
- J. K. Siow, C. Gao, L. Fan, S. Chen, and Y. Liu, “Core: Automating review recommendation for code changes,” in 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 2020, pp. 284–295.
- W. Takuya and H. Masuhara, “A spontaneous code recommendation tool based on associative search,” in Proceedings of the 3rd International Workshop on search-driven development: Users, infrastructure, tools, and evaluation, 2011, pp. 17–20.
- P. Thongtanunam, R. G. Kula, A. E. C. Cruz, N. Yoshida, and H. Iida, “Improving code review effectiveness through reviewer recommendations,” in Proceedings of the 7th International Workshop on Cooperative and Human Aspects of Software Engineering, 2014, pp. 119–122.
- P. Thongtanunam, C. Tantithamthavorn, R. G. Kula, N. Yoshida, H. Iida, and K.-i. Matsumoto, “Who should review my code? a file location-based code-reviewer recommendation approach for modern code review,” in 2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER). IEEE, 2015, pp. 141–150.
- R. Tufano, S. Masiero, A. Mastropaolo, L. Pascarella, D. Poshyvanyk, and G. Bavota, “Using pre-trained models to boost code review automation,” in Proceedings of the 44th International Conference on Software Engineering, 2022, pp. 2291–2302.
- R. Tufano, L. Pascarella, M. Tufano, D. Poshyvanyk, and G. Bavota, “Towards automating code review activities,” in 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 2021, pp. 163–174.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
- J. Wang, M. Xu, H. Wang, and J. Zhang, “Classification of imbalanced data by using the smote algorithm and locally linear embedding,” in 2006 8th international Conference on Signal Processing, vol. 3. IEEE, 2006.
- S. Wattanakriengkrai, P. Thongtanunam, C. Tantithamthavorn, H. Hata, and K. Matsumoto, “Predicting defective lines using a model-agnostic technique,” IEEE Transactions on Software Engineering, vol. 48, no. 5, pp. 1480–1496, 2020.
- J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler et al., “Emergent abilities of large language models,” arXiv preprint arXiv:2206.07682, 2022.
- X. Xia and D. Lo, “An effective change recommendation approach for supplementary bug fixes,” automated software engineering, vol. 24, pp. 455–498, 2017.
- D. Zan, B. Chen, F. Zhang, D. Lu, B. Wu, B. Guan, W. Yongji, and J.-G. Lou, “Large language models meet nl2code: A survey,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023, pp. 7443–7464.
- Doriane Olewicki (3 papers)
- Sarra Habchi (13 papers)
- Bram Adams (47 papers)