Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

"Which LLM should I use?": Evaluating LLMs for tasks performed by Undergraduate Computer Science Students (2402.01687v2)

Published 22 Jan 2024 in cs.CY, cs.HC, and cs.LG
"Which LLM should I use?": Evaluating LLMs for tasks performed by Undergraduate Computer Science Students

Abstract: This study evaluates the effectiveness of various LLMs in performing tasks common among undergraduate computer science students. Although a number of research studies in the computing education community have explored the possibility of using LLMs for a variety of tasks, there is a lack of comprehensive research comparing different LLMs and evaluating which LLMs are most effective for different tasks. Our research systematically assesses some of the publicly available LLMs such as Google Bard, ChatGPT(3.5), GitHub Copilot Chat, and Microsoft Copilot across diverse tasks commonly encountered by undergraduate computer science students in India. These tasks include code explanation and documentation, solving class assignments, technical interview preparation, learning new concepts and frameworks, and email writing. Evaluation for these tasks was carried out by pre-final year and final year undergraduate computer science students and provides insights into the models' strengths and limitations. This study aims to guide students as well as instructors in selecting suitable LLMs for any specific task and offers valuable insights on how LLMs can be used constructively by students and instructors.

Evaluating LLMs for Undergraduate Computer Science Tasks

Introduction

The utilization of LLMs in educational contexts, particularly in undergraduate computer science programs, has gained substantial attention. This paper ambitiously sets out to compare and evaluate the effectiveness of various publicly available LLMs—Google Bard, ChatGPT, GitHub Copilot Chat, and Microsoft Copilot—in facilitating tasks commonly performed by computer science students. These tasks span a range of activities, including code generation, project ideation, exam preparation, and email composition. Given the rapid expansion of LLMs and their application potential, this research offers valuable insights for students and educators in identifying the most suitable LLMs for specific educational tasks.

Methodology

The methodology employed in this paper involves a mixture of quantitative and qualitative analysis of four leading LLMs across universally encountered tasks among computer science students. These tasks were rigorously evaluated by both junior and senior computer science students, encompassing:

  • Code Explanation and Documentation
  • Class Assignments across Programming, Theoretical, and Humanities contexts
  • Technical Interview Preparation
  • Learning New Concepts and Frameworks
  • Writing Emails

The LLMs were assessed based on their ability to provide clear, accurate, and helpful responses across these tasks, with performance rated on a scale from 1 to 10.

Key Findings

The paper revealed that no single LLM outperforms others across all assessed tasks.

  • For code explanation and documentation, Microsoft Copilot excelled, indicating its robustness in dealing with a wide range of programming languages and presenting comprehensive code insights.
  • In class assignments, GitHub Copilot Chat led in programming assignments, leveraging its programming-centric design, whereas Microsoft Copilot was the frontrunner in both theoretical and humanities assignments, showcasing its versatility.
  • For technical interview preparation, both GitHub Copilot Chat and ChatGPT demonstrated high performance, suggesting that these models are particularly adept at solving algorithmic problems.
  • In aiding the learning of new concepts and frameworks, Google Bard emerged as the most effective, offering clear and insightful explanations that facilitate deeper understanding.
  • When it came to writing emails, ChatGPT was found to be superior, indicating its strength in generating contextually relevant and well-structured content.

Implications

This research underscores the diverse capabilities of current LLMs, suggesting that students and educators could benefit from choosing specific LLMs tailored to the needs of their tasks. It also highlights the importance of understanding the limitations and strengths of each LLM, advocating for a more informed selection process to optimize their utility in educational settings.

The findings further hint at the potential of LLMs to redefine the educational landscape, offering personalized assistance in learning new concepts, preparing for interviews, and handling assignments. However, the paper also cautions against over-reliance on these models, given their varying reliability across different tasks.

Future Directions

The rapidly evolving field of LLMs promises the introduction of more advanced models. Future work could extend this research to include upcoming LLMs, offering a dynamic and updated guide for their application in education. It also opens the floor for developing domain-specific LLMs, fine-tuned to meet the nuanced requirements of educational contexts, particularly in computer science education.

Conclusion

In conclusion, this paper presents a comprehensive evaluation of the performance of four major LLMs in tasks common to the undergraduate computer science curriculum. The varied performance across different tasks underscores the necessity of selecting LLMs based on the specific needs of the task at hand. As the development of LLMs continues to advance, this research provides a foundational understanding for leveraging their potential in educational settings, guiding both students and educators in their selection process.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. [n. d.]. https://alphacode.deepmind.com/
  2. [n. d.]. Code Generator - Amazon CodeWhisperer - AWS. https://aws.amazon.com/codewhisperer/
  3. [n. d.]. Codex. https://openai.com/blog/openai-codex
  4. [n. d.]. GPT-3 powers the next generation of apps. https://openai.com/blog/gpt-3-apps
  5. 2024a. Dataset used for code explanation task. https://github.com/whatisthereinthename/comparing_llm_data/blob/main/code_explanation_and_documentation.pdf
  6. 2024b. Dataset used for evaluating LLMs. https://github.com/whatisthereinthename/comparing_llm_data
  7. 2024c. Dataset used for learning new concepts and frameworks. https://github.com/whatisthereinthename/comparing_llm_data/blob/main/learning_new_concepts%26frameworks.pdf
  8. 2024d. Dataset used for writing emails task. https://github.com/whatisthereinthename/comparing_llm_data/blob/main/email_writing.xlsx
  9. 2024. neetcode.io. https://neetcode.io/practice
  10. Investigating the Potential of GPT-3 in Providing Feedback for Programming Assessments. https://doi.org/10.1145/3587102.3588852
  11. Programming Is Hard - Or at Least It Used to Be. https://doi.org/10.1145/3545945.3569759
  12. “It’s not like Jarvis, but it’s pretty close!” - Examining ChatGPT’s Usage among Undergraduate Students in Computer Science. In Proceedings of the 26th Australasian Computing Education Conference (, Sydney, NSW, Australia,) (ACE ’24). Association for Computing Machinery, New York, NY, USA, 124–133. https://doi.org/10.1145/3636243.3636257
  13. Adapting Large Language Models via Reading Comprehension. arXiv:2309.09530 [cs.CL]
  14. Bruno Pereira Cipriano and Pedro Alves. 2023. GPT-3 vs Object Oriented Programming Assignments: An Experience Report. https://doi.org/10.1145/3587102.3588814
  15. Marian Daun and Jennifer Brings. 2023. How ChatGPT Will Change Software Engineering Education. In Proceedings of the 2023 Conference on Innovation and Technology in Computer Science Education V. 1 (<conf-loc>, <city>Turku</city>, <country>Finland</country>, </conf-loc>) (ITiCSE 2023). Association for Computing Machinery, New York, NY, USA, 110–116. https://doi.org/10.1145/3587102.3588815
  16. Conversing with Copilot: Exploring Prompt Engineering for Solving CS1 Problems Using Natural Language. https://doi.org/10.1145/3545945.3569823
  17. Thomas Dohmke. 2023. GitHub COPILOT X: The AI-Powered Developer experience. https://github.blog/2023-03-22-github-copilot-x-the-ai-powered-developer-experience/
  18. The Robots Are Coming: Exploring the Implications of OpenAI Codex on Introductory Programming. https://doi.org/10.1145/3511861.3511863
  19. My AI Wants to Know if This Will Be on the Exam: Testing OpenAI’s Codex on CS2 Programming Exercises. https://doi.org/10.1145/3576123.3576134
  20. Luciano Floridi and Massimo Chiriatti. 2020. GPT-3: Its Nature, Scope, Limits, and Consequences. https://doi.org/10.1007/s11023-020-09548-1
  21. Krystal Hu. 2023. CHATGPT sets record for fastest-growing user base - analyst note. https://www.reuters.com/technology/chatgpt-sets-record-fastest-growing-user-base-analyst-note-2023-02-01/
  22. ChatGPT in the Classroom: An Analysis of Its Strengths and Weaknesses for Solving Undergraduate Computer Science Questions. arXiv:2304.14993 [cs.HC]
  23. "With Great Power Comes Great Responsibility!": Student and Instructor Perspectives on the influence of LLMs on Undergraduate Engineering Education. arXiv:2309.10694 [cs.HC]
  24. Sam Lau and Philip J. Guo. 2023. From "Ban It Till We Understand It" to "Resistance is Futile": How University Programming Instructors Plan to Adapt as More Students Use AI Code Generation and Explanation Tools such as ChatGPT and GitHub Copilot. In Proceedings of the 2023 ACM Conference on International Computing Education Research - Volume 1, ICER 2023, Chicago, IL, USA, August 7-11, 2023, Kathi Fisler, Paul Denny, Diana Franklin, and Margaret Hamilton (Eds.). ACM, 106–121. https://doi.org/10.1145/3568813.3600138
  25. Leetcode. [n. d.]. The world’s leading online programming learning platform. https://leetcode.com/
  26. Comparing Code Explanations Created by Students and Large Language Models. https://doi.org/10.1145/3587102.3588785
  27. Experiences from Using Code Explanations Generated by Large Language Models in a Web Software Development E-Book. https://doi.org/10.1145/3545945.3569785
  28. On the Educational Impact of ChatGPT: Is Artificial Intelligence Ready to Obtain a University Degree? https://doi.org/10.1145/3587102.3588827
  29. Yusuf Mehdi. 2023. Reinventing search with a new AI-powered Microsoft Bing and EDGE, your copilot for the web. https://blogs.microsoft.com/blog/2023/02/07/reinventing-search-with-a-new-ai-powered-microsoft-bing-and-edge-your-copilot-for-the-web/
  30. ChatGPT, Can You Generate Solutions for my Coding Exercises? An Evaluation on its Effectiveness in an undergraduate Java Programming Course. https://doi.org/10.1145/3587102.3588794
  31. Sundar Pichai. 2023. An important next step on our ai journey. https://blog.google/technology/ai/bard-google-ai-search-updates/
  32. Evaluating the Performance of Code Generation Models for Solving Parsons Problems With Small Prompt Variations. https://doi.org/10.1145/3587102.3588805
  33. Automatic Generation of Programming Exercises and Code Explanations Using Large Language Models. https://doi.org/10.1145/3501385.3543957
  34. Llama 2 is here - get it on hugging face. https://huggingface.co/blog/llama2
  35. Michel Wermelinger. 2023. Using GitHub Copilot to Solve Simple Programming Problems. https://doi.org/10.1145/3545945.3569830
  36. Can Generative Pre-trained Transformers (GPT) Pass Assessments in Higher Education Programming Courses? https://doi.org/10.1145/3587102.3588792
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Vibhor Agarwal (14 papers)
  2. Madhav Krishan Garg (2 papers)
  3. Sahiti Dharmavaram (2 papers)
  4. Dhruv Kumar (41 papers)
Citations (1)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets