Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Code Review Automation: Strengths and Weaknesses of the State of the Art (2401.05136v1)

Published 10 Jan 2024 in cs.SE

Abstract: The automation of code review has been tackled by several researchers with the goal of reducing its cost. The adoption of deep learning in software engineering pushed the automation to new boundaries, with techniques imitating developers in generative tasks, such as commenting on a code change as a reviewer would do or addressing a reviewer's comment by modifying code. The performance of these techniques is usually assessed through quantitative metrics, e.g., the percentage of instances in the test set for which correct predictions are generated, leaving many open questions on the techniques' capabilities. For example, knowing that an approach is able to correctly address a reviewer's comment in 10% of cases is of little value without knowing what was asked by the reviewer: What if in all successful cases the code change required to address the comment was just the removal of an empty line? In this paper we aim at characterizing the cases in which three code review automation techniques tend to succeed or fail in the two above-described tasks. The study has a strong qualitative focus, with ~105 man-hours of manual inspection invested in manually analyzing correct and wrong predictions generated by the three techniques, for a total of 2,291 inspected predictions. The output of this analysis are two taxonomies reporting, for each of the two tasks, the types of code changes on which the experimented techniques tend to succeed or to fail, pointing to areas for future work. A result of our manual analysis was also the identification of several issues in the datasets used to train and test the experimented techniques. Finally, we assess the importance of researching in techniques specialized for code review automation by comparing their performance with ChatGPT, a general purpose LLM, finding that ChatGPT struggles in commenting code as a human reviewer would do.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. “Chatgpt,” https://openai.com/blog/chatgpt, accessed: 2023-03-27.
  2. “Copilot website,” https://copilot.github.com, accessed: 2022-11-10.
  3. “Prettier,” https://prettier.io, accessed: 2023-03-25.
  4. “https://github.com/CodeReviewAutomationSota/code_review_automation_sota.”
  5. W. H. A. Al-Zubaidi, P. Thongtanunam, H. K. Dam, C. Tantithamthavorn, and A. Ghose, “Workload-aware reviewer recommendation using a multi-objective search-based approach,” in 16th ACM International Conference on Predictive Models and Data Analytics in Software Engineering, PROMISE, 2020, pp. 21–30.
  6. U. Alon, M. Zilberstein, O. Levy, and E. Yahav, “Code2vec: Learning distributed representations of code,” Proc. ACM Program. Lang., vol. 3, no. POPL, pp. 40:1–40:29, 2019.
  7. S. Asthana, R. Kumar, R. Bhagwan, C. Bird, C. Bansal, C. S. Maddila, S. Mehta, and B. Ashok, “Whodo: Automating reviewer suggestions at scale,” in 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE, 2019, p. 937–945.
  8. A. Bacchelli and C. Bird, “Expectations, outcomes, and challenges of modern code review,” in Proceedings of the 2013 international conference on software engineering.   IEEE Press, 2013, pp. 712–721.
  9. G. Bavota and B. Russo, “Four eyes are better than two: On the impact of code reviews on software quality,” in 31th IEEE International Conference on Software Maintenance and Evolution, ICSME, 2015, pp. 81–90.
  10. M. Beller, A. Bacchelli, A. Zaidman, and E. Juergens, “Modern code reviews in open-source projects: Which problems do they fix?” in 11th IEEE/ACM Working Conference on Mining Software Repositories, MSR, 2014, pp. 202–211.
  11. A. Bosu and J. C. Carver, “Impact of peer code review on peer impression formation: A survey,” in 7th IEEE/ACM International Symposium on Empirical Software Engineering and Measurement, ESEM, 2013, pp. 133–142.
  12. M. Chouchen, A. Ouni, M. W. Mkaouer, R. G. Kula, and K. Inoue, “Whoreview: A multi-objective search-based approach for code reviewers recommendation in modern code review,” Applied Soft Computing, vol. 100, p. 106908, 2021.
  13. J. Falleri, F. Morandat, X. Blanc, M. Martinez, and M. Monperrus, “Fine-grained and accurate source code differencing,” in 29th IEEE/ACM International Conference on Automated Software Engineering, ASE, 2014, pp. 313–324.
  14. G. Fraser and A. Arcuri, “Evosuite: automatic test suite generation for object-oriented software,” in 21st ACM Joint Meeting of the European Software Engineering Conference and the ACM/SIGSOFT Symposium on the Foundations of Software Engineering, ESEC-FSE, 2011, pp. 416–419.
  15. S. Holm, “A simple sequentially rejective bonferroni test procedure,” Scandinavian Journal on Statistics, vol. 6, no. 2, pp. 65–70, 1979.
  16. Y. Hong, C. Tantithamthavorn, P. Thongtanunam, and A. Aleti, “Commentfinder: A simpler, faster, more accurate code review comments recommendation,” in 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC-FSE, 2022, p. 507–519.
  17. J. Jiang, D. Lo, J. Zheng, X. Xia, Y. Yang, and L. Zhang, “Who should make decision on this pull request? analyzing time-decaying relationships and file similarities for integrator prediction,” J. Syst. Softw., vol. 154, no. C, p. 196–210, 2019.
  18. J. Jiang, Y. Yang, J. He, X. Blanc, and L. Zhang, “Who should comment on this pull request? analyzing attributes for more accurate commenter recommendation in pull-based development,” Inf. Softw. Technol., vol. 84, no. C, p. 48–62, apr 2017. [Online]. Available: https://doi.org/10.1016/j.infsof.2016.10.006
  19. T. Kudo and J. Richardson, “Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing,” in 8th Conference on Empirical Methods in Natural Language Processing, EMNLP, 2018, pp. 66–71.
  20. L. Li, L. Yang, H. Jiang, J. Yan, T. Luo, Z. Hua, G. Liang, and C. Zuo, “Auger: Automatically generating review comments with pre-training models,” in 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE, 2022, p. 1009–1021.
  21. Z. Li, S. Lu, D. Guo, N. Duan, S. Jannu, G. Jenks, D. Majumder, J. Green, A. Svyatkovskiy, S. Fu, and N. Sundaresan, “Automating code review activities by large-scale pre-training,” in 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE, 2022, pp. 1035–1047.
  22. M. V. Mäntylä and C. Lassenius, “What types of defects are really discovered in code reviews?” IEEE Transactions on Software Engineering, vol. 35, no. 3, pp. 430–448, 2009.
  23. A. Mastropaolo, L. Pascarella, and G. Bavota, “Using deep learning to generate complete log statements,” in 44th IEEE/ACM International Conference on Software Engineering, ICSE, 2022, pp. 2279–2290.
  24. S. McIntosh, Y. Kamei, B. Adams, and A. E. Hassan, “The impact of code review coverage and code review participation on software quality: A case study of the qt, vtk, and itk projects,” in 11th IEEE/ACM Working Conference on Mining Software Repositories, MSR, 2014, pp. 192–201.
  25. E. Mirsaeedi and P. C. Rigby, “Mitigating turnover with code review recommendation: Balancing expertise, workload, and knowledge distribution,” in 42nd ACM/IEEE International Conference on Software Engineering, ICSE, 2020, p. 1183–1195.
  26. R. Morales, S. McIntosh, and F. Khomh, “Do code review practices impact design quality? a case study of the qt, vtk, and itk projects,” in Proc. of the 22nd Int’l Conf. on Software Analysis, Evolution, and Reengineering (SANER), 2015, pp. 171–180.
  27. A. Ouni, R. G. Kula, and K. Inoue, “Search-based peer reviewers recommendation in modern code review,” in 32nd IEEE International Conference on Software Maintenance and Evolution, ICSME, 2016, pp. 367–377.
  28. C. Pacheco and M. D. Ernst, “Randoop: feedback-directed random testing for Java,” in ACM/SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software, OOPSLA, 2007, pp. 815–816.
  29. K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: A method for automatic evaluation of machine translation,” in 40th Annual Meeting on Association for Computational Linguistics, ACL, 2002, pp. 311–318.
  30. L. Pascarella, D. Spadini, F. Palomba, M. Bruntink, and A. Bacchelli, “Information needs in contemporary code review,” vol. 2, no. CSCW, 2018.
  31. C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” J. Mach. Learn. Res., vol. 21, pp. 140:1–140:67, 2020.
  32. M. M. Rahman, C. K. Roy, and J. A. Collins, “Correct: code reviewer recommendation in github based on cross-project and technology experience,” in 38th International Conference on Software Engineering, ICSE, 2016, pp. 222–231.
  33. P. C. Rigby and C. Bird, “Convergent contemporary software peer review practices,” in 21st ACM/SIGSOFT Joint Meeting of the European Software Engineering Conference and the Symposium on the Foundations of Software Engineering, ESEC-FSE, 2013, pp. 202–212.
  34. P. C. Rigby, D. M. Germán, L. L. E. Cowen, and M. D. Storey, “Peer review on open-source software projects: Parameters, statistical models, and theory,” ACM Trans. Softw. Eng. Methodol., vol. 23, no. 4, pp. 35:1–35:33, 2014.
  35. C. Sadowski, E. Söderberg, L. Church, M. Sipko, and A. Bacchelli, “Modern code review: a case study at google,” in 40th International Conference on Software Engineering: Software Engineering in Practice, ICSE (SEIP), 2018, pp. 181–190.
  36. S. Shi, M. Li, D. Lo, F. Thung, and X. Huo, “Automatic code review by learning the revision of source code,” in The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, 2019, pp. 4910–4917.
  37. A. Strand, M. Gunnarson, R. Britto, and M. Usman, “Using a context-aware approach to recommend code reviewers: findings from an industrial case study,” in 42nd International Conference on Software Engineering, Software Engineering in Practice, ICSE-SEIP, 2020, pp. 1–10.
  38. P. Thongtanunam, C. Pornprasit, and C. Tantithamthavorn, “Autotransform: Automated code transformation to support modern code review process,” in 2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE), 2022, pp. 237–248.
  39. P. Thongtanunam, C. Tantithamthavorn, R. G. Kula, N. Yoshida, H. Iida, and K. Matsumoto, “Who should review my code? A file location-based code-reviewer recommendation approach for modern code review,” in 22nd IEEE International Conference on Software Analysis, Evolution, and Reengineering, SANER, 2015, pp. 141–150.
  40. F. Tian and C. Treude, “Adding context to source code representations for deep learning,” in IEEE International Conference on Software Maintenance and Evolution, ICSME, 2022, pp. 374–378.
  41. N. Tsantalis, A. Ketkar, and D. Dig, “Refactoringminer 2.0,” IEEE Trans. Software Eng., vol. 48, no. 3, pp. 930–950, 2022.
  42. M. Tufano, J. Pantiuchina, C. Watson, G. Bavota, and D. Poshyvanyk, “On learning meaningful code changes via neural machine translation,” in 41st IEEE/ACM International Conference on Software Engineering, ICSE, 2019, pp. 25–36.
  43. M. Tufano, C. Watson, G. Bavota, M. Di Penta, M. White, and D. Poshyvanyk, “An empirical study on learning bug-fixing patches in the wild via neural machine translation,” ACM Trans. Softw. Eng. Methodol., vol. 28, no. 4, pp. 19:1–19:29, 2019.
  44. R. Tufano, S. Masiero, A. Mastropaolo, L. Pascarella, D. Poshyvanyk, and G. Bavota, “Using pre-trained models to boost code review automation,” in 44th IEEE/ACM International Conference on Software Engineering, ICSE, 2022, pp. 2291–2302.
  45. R. Tufano, L. Pascarella, M. Tufano, D. Poshyvanyk, and G. Bavota, “Towards automating code review activities,” in 43rd IEEE/ACM International Conference on Software Engineering, ICSE, 2021, pp. 163–174.
  46. F. Wilcoxon, “Individual comparisons by ranking methods,” International Biometric Society, Wiley, vol. 1, no. 6, pp. 80–83, 1945.
  47. X. Xia, D. Lo, X. Wang, and X. Yang, “Who should review this change?: Putting text and file location analyses together for more accurate recommendations,” in 31th IEEE International Conference on Software Maintenance and Evolution, ICSME, 2015, pp. 261–270.
  48. Z. Xia, H. Sun, J. Jiang, X. Wang, and X. Liu, “A hybrid approach to code reviewer recommendation with collaborative filtering,” in 6th International Workshop on Software Mining, SoftwareMining, 2017, pp. 24–31.
  49. H. Ying, L. Chen, T. Liang, and J. Wu, “Earec: leveraging expertise and authority for pull-request reviewer recommendation in github,” in 3rd International Workshop on CrowdSourcing in Software Engineering, CSI-SE@ICSE, 2016, pp. 29–35.
  50. S. Yu, T. Wang, and J. Wang, “Data augmentation by program transformation,” J. Syst. Softw., vol. 190, p. 111304, 2022.
  51. Y. Yu, H. Wang, G. Yin, and T. Wang, “Reviewer recommendation for pull-requests in github: What can we learn from code review and bug assignment?” Inf. Softw. Technol., vol. 74, pp. 204–218, 2016.
  52. M. B. Zanjani, H. Kagdi, and C. Bird, “Automatically recommending peer reviewers in modern code review,” IEEE Transactions on Software Engineering, vol. 42, no. 6, pp. 530–543, 2016.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Rosalia Tufano (15 papers)
  2. Ozren Dabić (5 papers)
  3. Antonio Mastropaolo (25 papers)
  4. Matteo Ciniselli (11 papers)
  5. Gabriele Bavota (60 papers)
Citations (9)

Summary

We haven't generated a summary for this paper yet.