Evaluating LLMs for Undergraduate Computer Science Tasks
Introduction
The utilization of LLMs in educational contexts, particularly in undergraduate computer science programs, has gained substantial attention. This paper ambitiously sets out to compare and evaluate the effectiveness of various publicly available LLMs—Google Bard, ChatGPT, GitHub Copilot Chat, and Microsoft Copilot—in facilitating tasks commonly performed by computer science students. These tasks span a range of activities, including code generation, project ideation, exam preparation, and email composition. Given the rapid expansion of LLMs and their application potential, this research offers valuable insights for students and educators in identifying the most suitable LLMs for specific educational tasks.
Methodology
The methodology employed in this paper involves a mixture of quantitative and qualitative analysis of four leading LLMs across universally encountered tasks among computer science students. These tasks were rigorously evaluated by both junior and senior computer science students, encompassing:
- Code Explanation and Documentation
- Class Assignments across Programming, Theoretical, and Humanities contexts
- Technical Interview Preparation
- Learning New Concepts and Frameworks
- Writing Emails
The LLMs were assessed based on their ability to provide clear, accurate, and helpful responses across these tasks, with performance rated on a scale from 1 to 10.
Key Findings
The paper revealed that no single LLM outperforms others across all assessed tasks.
- For code explanation and documentation, Microsoft Copilot excelled, indicating its robustness in dealing with a wide range of programming languages and presenting comprehensive code insights.
- In class assignments, GitHub Copilot Chat led in programming assignments, leveraging its programming-centric design, whereas Microsoft Copilot was the frontrunner in both theoretical and humanities assignments, showcasing its versatility.
- For technical interview preparation, both GitHub Copilot Chat and ChatGPT demonstrated high performance, suggesting that these models are particularly adept at solving algorithmic problems.
- In aiding the learning of new concepts and frameworks, Google Bard emerged as the most effective, offering clear and insightful explanations that facilitate deeper understanding.
- When it came to writing emails, ChatGPT was found to be superior, indicating its strength in generating contextually relevant and well-structured content.
Implications
This research underscores the diverse capabilities of current LLMs, suggesting that students and educators could benefit from choosing specific LLMs tailored to the needs of their tasks. It also highlights the importance of understanding the limitations and strengths of each LLM, advocating for a more informed selection process to optimize their utility in educational settings.
The findings further hint at the potential of LLMs to redefine the educational landscape, offering personalized assistance in learning new concepts, preparing for interviews, and handling assignments. However, the paper also cautions against over-reliance on these models, given their varying reliability across different tasks.
Future Directions
The rapidly evolving field of LLMs promises the introduction of more advanced models. Future work could extend this research to include upcoming LLMs, offering a dynamic and updated guide for their application in education. It also opens the floor for developing domain-specific LLMs, fine-tuned to meet the nuanced requirements of educational contexts, particularly in computer science education.
Conclusion
In conclusion, this paper presents a comprehensive evaluation of the performance of four major LLMs in tasks common to the undergraduate computer science curriculum. The varied performance across different tasks underscores the necessity of selecting LLMs based on the specific needs of the task at hand. As the development of LLMs continues to advance, this research provides a foundational understanding for leveraging their potential in educational settings, guiding both students and educators in their selection process.