Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Evaluation of the Programming Skills of Large Language Models (2405.14388v1)

Published 23 May 2024 in cs.SE, cs.CL, and cs.CR
Evaluation of the Programming Skills of Large Language Models

Abstract: The advent of LLMs (LLM) has revolutionized the efficiency and speed with which tasks are completed, marking a significant leap in productivity through technological innovation. As these chatbots tackle increasingly complex tasks, the challenge of assessing the quality of their outputs has become paramount. This paper critically examines the output quality of two leading LLMs, OpenAI's ChatGPT and Google's Gemini AI, by comparing the quality of programming code generated in both their free versions. Through the lens of a real-world example coupled with a systematic dataset, we investigate the code quality produced by these LLMs. Given their notable proficiency in code generation, this aspect of chatbot capability presents a particularly compelling area for analysis. Furthermore, the complexity of programming code often escalates to levels where its verification becomes a formidable task, underscoring the importance of our study. This research aims to shed light on the efficacy and reliability of LLMs in generating high-quality programming code, an endeavor that has significant implications for the field of software development and beyond.

Evaluating Programming Skills of LLMs

Introduction

With the rise of AI chatbots, the programming world is experiencing rapid shifts in how code is generated and written. This paper takes a close look at two of the leading AI chatbots, OpenAI's ChatGPT and Google's Gemini, particularly their ability to produce high-quality programming code. The authors compare the free versions of these AI models using standardized datasets and real-world scenarios to assess how well they perform in generating functional and efficient code.

Methodology

To evaluate these AI chatbots, the researchers follow a systematic approach:

Datasets

The paper employs two well-recognized datasets: HumanEval and ClassEval. These datasets include a variety of coding tasks designed to test the chatbots' ability to generate both basic and complex code structures.

  • HumanEval: Focuses on algorithmic solutions requiring functional code.
  • ClassEval: Targets object-oriented programming challenges, assessing the creation of class structures and more interconnected methods.

Data Collection Process

  1. Query: Each task from the datasets is presented as a detailed prompt.
  2. Execution of Queries: Prompts are submitted multiple times to ChatGPT and Gemini to account for variability.
  3. Result Aggregation: Responses are analyzed for consistency and performance, with results averaged to derive a representative metric.

Evaluation

The evaluation process involves several layers:

Compilation Test

Firstly, the paper checks whether the AI-generated code can be compiled without errors. This is crucial because a functional compilation indicates basic correctness.

ChatGPT vs. Gemini
  • ChatGPT experienced a higher rate of missing library import errors.
  • Google Gemini notably struggled with incomplete code, often due to token limits.

Even though compilation errors are easier to fix, they highlight the need for automated tools to identify these issues promptly.

Semantic Test

The more significant test checks if the code semantically performs the tasks it’s supposed to.

Results
  • ChatGPT: Achieved a pass rate of 68.9% for HumanEval and 30% for ClassEval.
  • Google Gemini: Achieved 54.878% and 17% respectively.

Despite ChatGPT showing better performance overall, both models exhibited substantial issues with more complex tasks.

Practical Implementation Test

An experiment with developers using AI to create a Java program for managing a card collection provides further insights.

  • Productivity: Both AI chatbots helped speed up initial code generation significantly.
  • Quality: However, they often introduced code smells—suboptimal code patterns—and required multiple prompt refinements and manual fixes.

Implications and Future Work

Practical Implications

The findings provide valuable insights for developers and businesses considering the use of AI for coding tasks:

  • Productivity Boost: While AI can expedite code writing, it is not foolproof and needs thorough vetting and refinement.
  • Reliability: Given the presence of semantic errors and code smells, relying solely on AI-generated code for mission-critical systems is risky.

Future Developments

There are several promising areas for ongoing research:

  • Premium Versions: Assessing paid versions of these AI models could reveal significant improvements in accuracy and functionality.
  • Industry Applications: Studying how companies integrate these AI tools in real-world settings can offer deeper insights into their impact on productivity and code quality.

Conclusion

While AI chatbots like ChatGPT and Google Gemini can be powerful tools for programming, they are not yet a substitute for experienced developers. These AI models can certainly aid in speeding up certain aspects of development but require human oversight to ensure high-quality, maintainable, and secure code. Going forward, continuous improvements and real-world testing will be key to fully harnessing the potential of AI in software development.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Luc Bryan Heitz (5 papers)
  2. Joun Chamas (1 paper)
  3. Christopher Scherb (9 papers)
X Twitter Logo Streamline Icon: https://streamlinehq.com