Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Study on Developer Behaviors for Validating and Repairing LLM-Generated Code Using Eye Tracking and IDE Actions (2405.16081v1)

Published 25 May 2024 in cs.SE and cs.HC

Abstract: The increasing use of LLM-powered code generation tools, such as GitHub Copilot, is transforming software engineering practices. This paper investigates how developers validate and repair code generated by Copilot and examines the impact of code provenance awareness during these processes. We conducted a lab study with 28 participants, who were tasked with validating and repairing Copilot-generated code in three software projects. Participants were randomly divided into two groups: one informed about the provenance of LLM-generated code and the other not. We collected data on IDE interactions, eye-tracking, cognitive workload assessments, and conducted semi-structured interviews. Our results indicate that, without explicit information, developers often fail to identify the LLM origin of the code. Developers generally employ similar validation and repair strategies for LLM-generated code, but exhibit behaviors such as frequent switching between code and comments, different attentional focus, and a tendency to delete and rewrite code. Being aware of the code's provenance led to improved performance, increased search efforts, more frequent Copilot usage, and higher cognitive workload. These findings enhance our understanding of how developers interact with LLM-generated code and carry implications for designing tools that facilitate effective human-LLM collaboration in software development.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (67)
  1. J. Cámara, J. Troya, L. Burgueño, and A. Vallecillo, “On the assessment of generative ai in modeling tasks: an experience report with chatgpt and uml,” Software and Systems Modeling, vol. 22, no. 3, pp. 781–793, 2023.
  2. T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
  3. M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman et al., “Evaluating large language models trained on code,” arXiv preprint arXiv:2107.03374, 2021.
  4. J. T. Liang, C. Yang, and B. A. Myers, “Understanding the usability of ai programming assistants,” arXiv preprint arXiv:2303.17125, 2023.
  5. S. Peng, E. Kalliamvakou, P. Cihon, and M. Demirer, “The impact of ai on developer productivity: Evidence from github copilot,” arXiv preprint arXiv:2302.06590, 2023.
  6. Y. Liu, T. Le-Cong, R. Widyasari, C. Tantithamthavorn, L. Li, X.-B. D. Le, and D. Lo, “Refining chatgpt-generated code: Characterizing and mitigating code quality issues,” ACM Transactions on Software Engineering and Methodology, 2023.
  7. H. Mozannar, G. Bansal, A. Fourney, and E. Horvitz, “Reading between the lines: Modeling user behavior and costs in ai-assisted programming,” arXiv preprint arXiv:2210.14306, 2022.
  8. T. J.-J. Li, J. Chen, H. Xia, T. M. Mitchell, and B. A. Myers, “Multi-Modal Repairs of Conversational Breakdowns in Task-Oriented Dialogs,” in Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology, ser. UIST 2020.   ACM, 2020.
  9. S. A. Gebreegziabher, Z. Zhang, X. Tang, Y. Meng, E. Glassman, and T. J.-J. Li, “Patat: Human-ai collaborative qualitative coding with explainable interactive rule synthesis,” in Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, ser. CHI ’23.   ACM, 2023.
  10. Z. Ning, Z. Zhang, T. Sun, Y. Tian, T. Zhang, and T. J.-J. Li, “An empirical study of model errors and user error discovery and repair strategies in natural language database queries,” in Proceedings of the 28th International Conference on Intelligent User Interfaces, ser. IUI ’23, 2023.
  11. A. M. Dakhel, V. Majdinasab, A. Nikanjam, F. Khomh, M. C. Desmarais, and Z. M. J. Jiang, “Github copilot ai pair programmer: Asset or liability?” Journal of Systems and Software, vol. 203, p. 111734, 2023.
  12. P. Vaithilingam, T. Zhang, and E. L. Glassman, “Expectation vs. experience: Evaluating the usability of code generation tools powered by large language models,” in CHI Conference on Human Factors in Computing Systems Extended Abstracts, 2022, pp. 1–7.
  13. S. Barke, M. B. James, and N. Polikarpova, “Grounded copilot: How programmers interact with code-generating models,” arXiv preprint arXiv:2206.15000, 2022.
  14. N. Al Madi, “How readable is model-generated code? examining readability and visual inspection of github copilot,” in Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering, 2022, pp. 1–5.
  15. J. Terrell, A. Kofink, J. Middleton, C. Rainear, E. Murphy-Hill, C. Parnin, and J. Stallings, “Gender differences and bias in open source: Pull request acceptance of women versus men,” PeerJ Computer Science, vol. 3, p. e111, 2017.
  16. N. Imtiaz, J. Middleton, J. Chakraborty, N. Robson, G. Bai, and E. Murphy-Hill, “Investigating the effects of gender bias on github,” in 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE).   IEEE, 2019, pp. 700–711.
  17. Y. Huang, K. Leach, Z. Sharafi, N. McKay, T. Santander, and W. Weimer, “Biases and differences in code review using medical imaging and eye-tracking: genders, humans, and machines,” in Proceedings of the 28th ACM joint meeting on European software engineering conference and symposium on the foundations of software engineering, 2020, pp. 456–468.
  18. I. Bertram, J. Hong, Y. Huang, W. Weimer, and Z. Sharafi, “Trustworthiness perceptions in code review: An eye-tracking study,” in Proceedings of the 14th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM), 2020, pp. 1–6.
  19. A. Eteläpelto, “Metacognition and the expertise of computer program comprehension,” Scandinavian Journal of Educational Research, vol. 37, no. 3, pp. 243–254, 1993.
  20. M. C. Davis, E. Aghayi, T. D. Latoza, X. Wang, B. A. Myers, and J. Sunshine, “What’s (not) working in programmer user studies?” ACM Transactions on Software Engineering and Methodology, vol. 32, no. 5, pp. 1–32, 2023.
  21. Z. P. Fry, B. Landau, and W. Weimer, “A human study of patch maintainability,” in Proceedings of the 2012 International Symposium on Software Testing and Analysis, 2012, pp. 177–187.
  22. P. Rodeghero and C. McMillan, “An empirical study on the patterns of eye movement during summarization tasks,” in 2015 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM).   IEEE, 2015, pp. 1–10.
  23. R. Bednarik and M. Tukiainen, “An eye-tracking methodology for characterizing program comprehension processes,” in Proceedings of the 2006 symposium on Eye tracking research & applications, 2006, pp. 125–132.
  24. C. Aschwanden and M. Crosby, “Code scanning patterns in program comprehension,” in Proceedings of the 39th hawaii international conference on system sciences.   Citeseer, 2006.
  25. O. Palinko, A. L. Kun, A. Shyrokov, and P. Heeman, “Estimating cognitive load using remote eye tracking in a driving simulator,” in Proceedings of the 2010 symposium on eye-tracking research & applications, 2010, pp. 141–144.
  26. J. Zagermann, U. Pfeil, and H. Reiterer, “Measuring cognitive load using eye tracking technology in visual computing,” in Proceedings of the sixth workshop on beyond time and errors on novel evaluation methods for visualization, 2016, pp. 78–85.
  27. R. Bednarik, “Expertise-dependent visual attention strategies develop over time during debugging with multiple code representations,” International Journal of Human-Computer Studies, vol. 70, no. 2, pp. 143–155, 2012.
  28. P. Rodeghero, C. McMillan, P. W. McBurney, N. Bosch, and S. D’Mello, “Improving automated source code summarization via an eye-tracking study of programmers,” in Proceedings of the 36th international conference on Software engineering, 2014, pp. 390–401.
  29. P. Hejmady and N. H. Narayanan, “Visual attention patterns during program debugging with an ide,” in proceedings of the symposium on eye tracking research and applications, 2012, pp. 197–200.
  30. M. Kazemitabaar, X. Hou, A. Henley, B. J. Ericson, D. Weintrop, and T. Grossman, “How novices use llm-based code generators to solve cs1 coding tasks in a self-paced learning environment,” in Proceedings of the 23rd Koli Calling International Conference on Computing Education Research, 2023, pp. 1–12.
  31. N. Tang, J. An, M. Chen, A. Bansal, Y. Huang, C. McMillan, and T. J.-J. Li, “Codegrits: A research toolkit for developer behavior and eye tracking in ide,” in 46th International Conference on Software Engineering Companion (ICSE-Companion ’24).   ACM, 2024.
  32. S. G. Hart, “Nasa task load index (tlx),” 1986.
  33. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  34. J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al., “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2023.
  35. A. Svyatkovskiy, S. K. Deng, S. Fu, and N. Sundaresan, “Intellicode compose: Code generation using transformer,” in Proceedings of the 28th ACM joint meeting on European software engineering conference and symposium on the foundations of software engineering, 2020, pp. 1433–1443.
  36. Y. Li, D. Choi, J. Chung, N. Kushman, J. Schrittwieser, R. Leblond, T. Eccles, J. Keeling, F. Gimeno, A. D. Lago et al., “Competition-level code generation with alphacode,” arXiv preprint arXiv:2203.07814, 2022.
  37. A. Ziegler, E. Kalliamvakou, X. A. Li, A. Rice, D. Rifkin, S. Simister, G. Sittampalam, and E. Aftandilian, “Productivity assessment of neural code completion,” in Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming, 2022, pp. 21–29.
  38. T. Wu, K. Koedinger et al., “Is ai the better programming partner? human-human pair programming vs. human-ai pair programming,” arXiv preprint arXiv:2306.05153, 2023.
  39. M. Amoozadeh, D. Daniels, D. Nam, A. Kumar, S. Chen, M. Hilton, S. Srinivasa Ragavan, and M. A. Alipour, “Trust in generative ai among students: An exploratory study,” in Proceedings of the 55th ACM Technical Symposium on Computer Science Education V. 1, 2024, pp. 67–73.
  40. R. Brooks, “Towards a theory of the cognitive processes in computer programming,” International Journal of Human-Computer Studies, vol. 51, no. 2, pp. 197–211, 1999.
  41. T. D. LaToza and B. A. Myers, “Developers ask reachability questions,” in Proceedings of the 32Nd ACM/IEEE International Conference on Software Engineering-Volume 1, 2010, pp. 185–194.
  42. A. Alaboudi and T. D. LaToza, “What constitutes debugging? an exploratory study of debugging episodes,” Empirical Software Engineering, vol. 28, no. 5, p. 117, 2023.
  43. K. Araki, Z. Furukawa, and J. Cheng, “A general framework for debugging,” IEEE software, vol. 8, no. 3, pp. 14–20, 1991.
  44. D. J. Gilmore, “Models of debugging,” Acta psychologica, vol. 78, no. 1-3, pp. 151–172, 1991.
  45. J. Lawrance, C. Bogart, M. Burnett, R. Bellamy, K. Rector, and S. D. Fleming, “How programmers debug, revisited: An information foraging theory perspective,” IEEE Transactions on Software Engineering, vol. 39, no. 2, pp. 197–215, 2010.
  46. A. Liu and M. Coblenz, “Debugging techniques in professional programming.”   Plateau Workshop.
  47. M. Beller, N. Spruit, D. Spinellis, and A. Zaidman, “On the dichotomy of debugging behavior among programmers,” in Proceedings of the 40th International Conference on Software Engineering, 2018, pp. 572–583.
  48. N. Nguyen and S. Nadi, “An empirical evaluation of github copilot’s code suggestions,” in Proceedings of the 19th International Conference on Mining Software Repositories, 2022, pp. 1–5.
  49. Y. Li, Y. Peng, Y. Huo, and M. R. Lyu, “Enhancing llm-based coding tools through native integration of ide-derived static context,” arXiv preprint arXiv:2402.03630, 2024.
  50. S. Imai, “Is github copilot a substitute for human pair-programming? an empirical study,” in Proceedings of the ACM/IEEE 44th International Conference on Software Engineering: Companion Proceedings, 2022, pp. 319–321.
  51. M. A. Just and P. A. Carpenter, “A theory of reading: from eye fixations to comprehension.” Psychological review, vol. 87, no. 4, p. 329, 1980.
  52. K. Rayner, “Eye movements in reading and information processing.” Psychological bulletin, vol. 85, no. 3, p. 618, 1978.
  53. J. G. May, R. S. Kennedy, M. C. Williams, W. P. Dunlap, and J. R. Brannan, “Eye movement indices of mental workload,” Acta psychologica, vol. 75, no. 1, pp. 75–89, 1990.
  54. Z. Sharafi, T. Shaffer, B. Sharif, and Y.-G. Guéhéneuc, “Eye-tracking metrics in software engineering,” in 2015 Asia-Pacific Software Engineering Conference (APSEC).   IEEE, 2015, pp. 96–103.
  55. T. Busjahn, R. Bednarik, and C. Schulte, “What influences dwell time during source code reading? analysis of element type and frequency as factors,” in Proceedings of the Symposium on Eye Tracking Research and Applications, 2014, pp. 335–338.
  56. U. Obaidellah, M. Al Haek, and P. C.-H. Cheng, “A survey on the usage of eye-tracking in computer programming,” ACM Computing Surveys (CSUR), vol. 51, no. 1, pp. 1–58, 2018.
  57. N. Ali, Z. Sharafi, Y.-G. Guéhéneuc, and G. Antoniol, “An empirical study on the importance of source code entities for requirements traceability,” Empirical software engineering, vol. 20, no. 2, pp. 442–478, 2015.
  58. S. Maan, “Representational learning approach for predicting developer expertise using eye movements,” 2020.
  59. M. Dorr, T. Martinetz, K. R. Gegenfurtner, and E. Barth, “Variability of eye movements when viewing dynamic natural scenes,” Journal of vision, vol. 10, no. 10, pp. 28–28, 2010.
  60. H. Sheridan and E. M. Reingold, “Chess players’ eye movements reveal rapid recognition of complex visual patterns: Evidence from a chess-related visual search task,” Journal of vision, vol. 17, no. 3, pp. 4–4, 2017.
  61. M. Brod, L. E. Tesler, and T. L. Christensen, “Qualitative research and content validity: developing best practices based on science and experience,” Quality of life research, vol. 18, pp. 1263–1278, 2009.
  62. A. J. Ko, B. A. Myers, M. J. Coblenz, and H. H. Aung, “An exploratory study of how developers seek, relate, and collect relevant information during software maintenance tasks,” IEEE Transactions on software engineering, vol. 32, no. 12, pp. 971–987, 2006.
  63. A. Sarkar, A. D. Gordon, C. Negreanu, C. Poelitz, S. S. Ragavan, and B. Zorn, “What is it like to program with artificial intelligence?” arXiv preprint arXiv:2208.06213, 2022.
  64. R. Yen, J. Zhu, S. Suh, H. Xia, and J. Zhao, “Coladder: Supporting programmers with hierarchical code generation in multi-level abstraction,” arXiv preprint arXiv:2310.08699, 2023.
  65. P. Alex, “Eye tracking in human-computer interaction and usability research: Current status and future prospects,” The Encyclopedia of Human Computer Interaction, pp. 211–219, 2005.
  66. B. A. Kitchenham, S. L. Pfleeger, L. M. Pickard, P. W. Jones, D. C. Hoaglin, K. El Emam, and J. Rosenberg, “Preliminary guidelines for empirical research in software engineering,” IEEE Transactions on software engineering, vol. 28, no. 8, pp. 721–734, 2002.
  67. Z. Sharafi, B. Sharif, Y.-G. Guéhéneuc, A. Begel, R. Bednarik, and M. Crosby, “A practical guide on conducting eye tracking studies in software engineering,” Empirical Software Engineering, vol. 25, no. 5, pp. 3128–3174, 2020.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Ningzhi Tang (5 papers)
  2. Meng Chen (98 papers)
  3. Zheng Ning (13 papers)
  4. Aakash Bansal (22 papers)
  5. Yu Huang (176 papers)
  6. Collin McMillan (38 papers)
  7. Toby Jia-Jun Li (57 papers)
Citations (2)