How Far Are We? The Triumphs and Trials of Generative AI in Learning Software Engineering (2312.11719v1)
Abstract: Conversational Generative AI (convo-genAI) is revolutionizing Software Engineering (SE) as engineers and academics embrace this technology in their work. However, there is a gap in understanding the current potential and pitfalls of this technology, specifically in supporting students in SE tasks. In this work, we evaluate through a between-subjects study (N=22) the effectiveness of ChatGPT, a convo-genAI platform, in assisting students in SE tasks. Our study did not find statistical differences in participants' productivity or self-efficacy when using ChatGPT as compared to traditional resources, but we found significantly increased frustration levels. Our study also revealed 5 distinct faults arising from violations of Human-AI interaction guidelines, which led to 7 different (negative) consequences on participants.
- Effect of chatbot systems on students learning outcomes. Sylwan 163, 10 (2019).
- Larry Alton. 2017. Phone calls, texts or email? Here’s how millennials prefer to communicate. Forbes. com. Available at: https://www. forbes. com/sites/larryalton/2017/05/11/how-do-millennials-prefer-to-communicate (2017).
- Guidelines for human-AI interaction. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. 1–13.
- Diverse Humans and Human-AI Interaction: What Cognitive Style Disaggregation Reveals. arXiv preprint arXiv:2108.00588 (2021).
- Anonymous. 2023. Supplemental Material for ChatGPT User Study . https://doi.org/10.5281/zenodo.8193821
- Apple. 2023. Human interface guidelines for machine learning. https://developer.apple.com/design/human-interface-guidelines/machine-learning/.
- I, Chatbot: Modeling the determinants of users’ satisfaction and continuance intention of AI-powered service agents. Telematics and Informatics 54 (2020), 101473.
- Albert Bandura. 1986. The explanatory and predictive scope of self-efficacy theory. Journal of Social and Clinical Psychology 4, 3 (1986), 359–373.
- Albert Bandura. 1993. Perceived self-efficacy in cognitive development and functioning. Educational Psychologist 28, 2 (1993), 117–148.
- Hans Baumgartner and Jan-Benedict EM Steenkamp. 2001. Response styles in marketing research: A cross-national investigation. Journal of Marketing Research 38, 2 (2001), 143–156.
- Gender HCI: What about the software? Computer 39, 11 (2006), 97–101.
- Tinkering and gender in end-user programmers’ debugging. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 231–240.
- Andrew Begel and Beth Simon. 2008. Novice software developers, all over again. In Proceedings of the Fourth International Workshop on Computing Education Research. 3–14.
- On the dangers of stochastic parrots: Can language models be too big?. In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency. 610–623.
- Patrick Bii. 2013. Chatbot technology: A possible means of unlocking student potential to learn how to learn. Educational Research 4, 2 (2013), 218–221.
- Rule-Based Chatbot Integration into Software Engineering Course. In Information and Software Technologies: 27th International Conference, ICIST 2021, Kaunas, Lithuania, October 14–16, 2021, Proceedings 27. Springer, 367–377.
- Taking Flight with Copilot: Early insights and opportunities of AI-powered pair-programming tools. Queue 20, 6 (2022), 35–57.
- On the ability of virtual agents to decrease cognitive load: an experimental study. Information Systems and e-Business Management 18 (2020), 187–207.
- Morton B Brown and Alan B Forsythe. 1974. Robust tests for the equality of variances. J. Amer. Statist. Assoc. 69, 346 (1974), 364–367.
- Rubric based assessment plan implementation for Computer Science program: A practical approach. In Proceedings of 2013 IEEE International Conference on Teaching, Assessment and Learning for Engineering (TALE). IEEE, 551–555.
- Gender differences and programming environments: across programming populations. In Proceedings of the 2010 ACM-IEEE International Symposium on Empirical Software Engineering and Measurement. 1–10.
- Finding Gender-Inclusiveness Software Issues with GenderMag: A Field Investigation. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems (San Jose, California, USA) (CHI ’16). Association for Computing Machinery, 2586–2598. https://doi.org/10.1145/2858036.2858274
- GenderMag: A Method for Evaluating Software’s Gender Inclusiveness. Interacting with Computers 28, 6 (10 2016), 760–787. https://doi.org/10.1093/iwc/iwv046
- Veronica Cateté and Tiffany Barnes. 2017. Application of the Delphi method in computer science principles rubric creation. In Proceedings of the 2017 ACM conference on innovation and technology in computer science education. 164–169.
- Effects of online college student’s Internet self-efficacy on learning motivation and performance. Innovations in Education and Teaching International 51, 4 (2014), 366–377.
- Promoting students’ learning achievement and self-efficacy: A mobile chatbot approach for nursing training. British Journal of Educational Technology 53, 1 (2022), 171–188.
- Gary Charness and Uri Gneezy. 2012. Strong evidence for gender differences in risk taking. Journal of Economic Behavior & Organization 83, 1 (2012), 50–58.
- Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021).
- Marian Daun and Jennifer Brings. 2023. How ChatGPT Will Change Software Engineering Education. In Proceedings of the 2023 Conference on Innovation and Technology in Computer Science Education V. 1. 110–116.
- Relationship between students’ critical thinking and self-efficacy beliefs in Ferdowsi University of Mashhad, Iran. Procedia-Social and Behavioral Sciences 15 (2011), 2952–2955.
- Conversing with copilot: Exploring prompt engineering for solving cs1 problems using natural language. In Proceedings of the 54th ACM Technical Symposium on Computer Science Education V. 1. 1136–1142.
- Robosourcing Educational Resources–Leveraging Large Language Models for Learnersourcing. arXiv preprint arXiv:2211.04715 (2022).
- After-action review for AI (AAR/AI). ACM Transactions on Interactive Intelligent Systems (TiiS) 11, 3-4 (2021), 1–35.
- Impersonating chatbots in a code review exercise to teach software engineering best practices. In 2022 IEEE Global Engineering Education Conference (EDUCON). IEEE, 1634–1642.
- Umer Farooq and Jonathan Grudin. 2016. Human-computer integration. Interactions 23, 6 (2016), 26–32.
- The robots are coming: Exploring the implications of openai codex on introductory programming. In Proceedings of the 24th Australasian Computing Education Conference. 10–19.
- My AI Wants to Know if This Will Be on the Exam: Testing OpenAI’s Codex on CS2 Programming Exercises. In Proceedings of the 25th Australasian Computing Education Conference. 97–104.
- Martin Fowler. 1997. Refactoring: Improving the design of existing code. In 11th European Conference. Jyväskylä, Finland.
- Stimulating and sustaining interest in a language course: An experimental comparison of Chatbot and Human task partners. Computers in Human Behavior 75 (2017), 461–468.
- How to Support ML End-User Programmers through a Conversational Agent. In 2024 IEEE/ACM 46th International Conference on Software Engineering (ICSE). IEEE Computer Society, 618–629.
- Improving learning experiences in software engineering capstone courses using artificial intelligence virtual assistants. Computer Applications in Engineering Education 30, 5 (2022), 1370–1389.
- Google. 2023a. Bard. https://bard.google.com/.
- Google. 2023b. People+ai guidebook. https://pair.withgoogle.com/guidebook/.
- Jonathan Grudin and Richard Jacques. 2019. Chatbots, humbots, and the quest for artificial general intelligence. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. 1–11.
- Sandra G Hart and Lowell E Staveland. 1988. Development of NASA-TLX (Task Load Index): Results of empirical and theoretical research. In Advances in psychology. Vol. 52. Elsevier, 139–183.
- Andrew W Ishak and Elizabeth A Williams. 2017. Slides in the tray: How fire crews enable members to borrow experiences. Small Group Research 48, 3 (2017), 336–364.
- Make Your Tools Sparkle with Trust: The PICSE Framework for Trust in Software Tools. In 2023 IEEE/ACM 45th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). IEEE, 409–419.
- Studying the effect of AI Code Generators on Supporting Novice Learners in Introductory Programming. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. 1–23.
- Finding AI’s faults with AAR/AI: An empirical study. ACM Transactions on Interactive Intelligent Systems (TiiS) 12, 1 (2022), 1–33.
- Interacting with educational chatbots: A systematic review. Education and Information Technologies 28, 1 (2023), 973–1018.
- Sam Lau and Philip J Guo. 2023. From” Ban It Till We Understand It” to” Resistance is Futile”: How University Programming Instructors Plan to Adapt as More Students Use AI Code Generation and Explanation Tools such as ChatGPT and GitHub Copilot. (2023).
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023).
- Assessing Human-AI Interaction Early through Factorial Surveys: A Study on the Guidelines for Human-AI Interaction. ACM Transactions on Computer-Human Interaction (2022).
- Competition-level code generation with alphacode. Science 378, 6624 (2022), 1092–1097.
- Towards understanding and mitigating social biases in language models. In International Conference on Machine Learning. PMLR, 6565–6576.
- Programming, problem solving, and self-awareness: Effects of explicit guidance. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems. 1449–1461.
- The Scope of ChatGPT in Software Engineering: A Thorough Investigation. arXiv preprint arXiv:2305.12138 (2023).
- Experiences from using code explanations generated by large language models in a web software development e-book. In Proceedings of the 54th ACM Technical Symposium on Computer Science Education V. 1. 931–937.
- Generating diverse code explanations using the gpt-3 large language model. In Proceedings of the 2022 ACM Conference on International Computing Education Research-Volume 2. 37–39.
- Patrick E McKnight and Julius Najab. 2010. Mann-Whitney U Test. The Corsini encyclopedia of psychology (2010), 1–1.
- Meta. 2023. Llama2. https://ai.meta.com/llama/.
- Joan Meyers-Levy and Barbara Loken. 2015. Revisiting gender differences: What we know and what lies ahead. Journal of Consumer Psychology 25, 1 (2015), 129–149.
- John E Morrison and Larry L Meliza. 1999. Foundations of the after action review process. Technical Report. Institute for Defense Analyses Alexandria Va.
- Thomas O Nelson and Louis Narens. 1994. Why investigate metacognition. Metacognition: Knowing about knowing 13 (1994), 1–25.
- Chinedu Wilfred Okonkwo and Abejide Ade-Ibijola. 2021. Chatbots applications in education: A systematic review. Computers and Education: Artificial Intelligence 2 (2021), 100033.
- OpenAI. 2023. GPT-4. https://openai.com/product/gpt-4.
- How gender-biased tools shape newcomer experiences in oss projects. IEEE Transactions on Software Engineering 48, 1 (2020), 241–259.
- Utilizing a Conversational Agent to Promote Self-efficacy in Children: A Pilot Study on Low Cognitive Ability Children with Attention Deficit Hyperactivity Disorder. In Extended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems. 1–7.
- MY Park and KH Chung. 2011. The antecedents and consequences of user satisfaction in virtual community: Focused on college students. Korean Research Academy of Distribution and Management Review 14, 1 (2011), 77–99.
- The impact of ai on developer productivity: Evidence from github copilot. arXiv preprint arXiv:2302.06590 (2023).
- Juanan Pereira. 2016. Leveraging chatbots to improve self-guided learning through conversational quizzes. In Proceedings of the Fourth International Conference on Technological Ecosystems for Enhancing Multiculturality. 911–918.
- Training software engineers using open-source software: the students’ perspective. In 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering Education and Training (ICSE-SEET). IEEE, 147–157.
- Training Software Engineers Using Open-Source Software: The Professors’ Perspective. In The 30th IEEE Conference on Software Engineering Education and Training. 1–5.
- ” It’s Weird That it Knows What I Want”: Usability and Interactions with Copilot for Novice Programmers. arXiv preprint arXiv:2304.02491 (2023).
- Exploring methods for evaluating group differences on the NSSE and other surveys: Are the t-test and Cohen’sd indices the most appropriate choices. In Annual meeting of the Southern Association for Institutional Research. Citeseer, 1–51.
- Designing for Cognitive Diversity: Improving the GitHub Experience for Newcomers. In 2023 IEEE/ACM 45th International Conference on Software Engineering: Software Engineering in Society (ICSE-SEIS). IEEE, 12 pages.
- Automatic generation of programming exercises and code explanations using large language models. In Proceedings of the 2022 ACM Conference on International Computing Education Research-Volume 1. 27–43.
- Taylor Lee Sawyer and Shad Deering. 2013. Adaptation of the US Army’s after-action review for simulation debriefing in healthcare. Simulation in Healthcare 8, 6 (2013), 388–397.
- Cognitive Load and Productivity Implications in Human-Chatbot Interaction. In 2021 IEEE 2nd International Conference on Human-Machine Systems (ICHMS). IEEE, 1–6.
- Samuel Sanford Shapiro and Martin B Wilk. 1965. An analysis of variance test for normality (complete samples). Biometrika 52, 3/4 (1965), 591–611.
- Jesper Simonsen and Morten Hertzum. 2010. Iterative participatory design. Design research: Synergies from interdisciplinary perspectives 1 (2010), 16–32.
- Overcoming open source project entry barriers with a portal for newcomers. In Proceedings of the 38th International Conference on Software Engineering. 273–284.
- Investigating explainability of generative AI for code through scenario-based design. In 27th International Conference on Intelligent User Interfaces. 212–228.
- Silvia Tamayo-Moreno and Diana Pérez-Marín. 2017. Designing and evaluating pedagogic conversational agents to teach children. International Journal of Educational and Pedagogical Sciences 11, 3 (2017), 521–526.
- Expectation vs. experience: Evaluating the usability of code generation tools powered by large language models. In CHI Conference on Human Factors in Computing Systems extended abstracts. 1–7.
- Communicability of traditional interfaces VS chatbots in healthcare and smart home domains. Behaviour & Information Technology 39, 1 (2020), 108–132.
- Matthew Verleger and James Pembridge. 2018. A pilot study integrating an AI-driven chatbot in an introductory programming course. In 2018 IEEE Frontiers in Education conference (FIE). IEEE, 1–4.
- Ari Ezra Waldman. 2020. Cognitive biases, dark patterns, and the ‘privacy paradox’. Current opinion in psychology 31 (2020), 105–109.
- MS Walgama and B Hettige. 2017. Chatbots: The next generation in computer interfacing–A Review. (2017).
- Investigating and Designing for Trust in AI-powered Code Generation Tools. arXiv preprint arXiv:2305.11248 (2023).
- Alexandra Weidemann and Nele Rußwinkel. 2021. The Role of Frustration in Human–Robot Interaction–What Is Needed for a Successful Collaboration? Frontiers in Psychology (2021), 707.
- Matt Welsh. 2022. The End of Programming. Commun. ACM 66, 1 (2022), 34–35.
- A prompt pattern catalog to enhance prompt engineering with chatgpt. arXiv preprint arXiv:2302.11382 (2023).
- Are we there yet?-A systematic literature review on chatbots in education. Frontiers in Artificial Intelligence 4 (2021), 654924.
- A comparative analysis of industry human-AI interaction guidelines. arXiv preprint arXiv:2010.11761 (2020).
- A systematic evaluation of large language models of code. In Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming. 1–10.
- In-IDE code generation from natural language: Promise and challenges. ACM Transactions on Software Engineering and Methodology (TOSEM) 31, 2 (2022), 1–47.
- Daniel M Yellin. 2023. The Premature Obituary of Programming. Commun. ACM 66, 2 (2023), 41–44.
- Rudrajit Choudhuri (6 papers)
- Dylan Liu (2 papers)
- Igor Steinmacher (47 papers)
- Marco Gerosa (16 papers)
- Anita Sarma (34 papers)