Evaluating Programming Skills of LLMs
Introduction
With the rise of AI chatbots, the programming world is experiencing rapid shifts in how code is generated and written. This paper takes a close look at two of the leading AI chatbots, OpenAI's ChatGPT and Google's Gemini, particularly their ability to produce high-quality programming code. The authors compare the free versions of these AI models using standardized datasets and real-world scenarios to assess how well they perform in generating functional and efficient code.
Methodology
To evaluate these AI chatbots, the researchers follow a systematic approach:
Datasets
The paper employs two well-recognized datasets: HumanEval and ClassEval. These datasets include a variety of coding tasks designed to test the chatbots' ability to generate both basic and complex code structures.
- HumanEval: Focuses on algorithmic solutions requiring functional code.
- ClassEval: Targets object-oriented programming challenges, assessing the creation of class structures and more interconnected methods.
Data Collection Process
- Query: Each task from the datasets is presented as a detailed prompt.
- Execution of Queries: Prompts are submitted multiple times to ChatGPT and Gemini to account for variability.
- Result Aggregation: Responses are analyzed for consistency and performance, with results averaged to derive a representative metric.
Evaluation
The evaluation process involves several layers:
Compilation Test
Firstly, the paper checks whether the AI-generated code can be compiled without errors. This is crucial because a functional compilation indicates basic correctness.
ChatGPT vs. Gemini
- ChatGPT experienced a higher rate of
missing library import
errors. - Google Gemini notably struggled with
incomplete code
, often due to token limits.
Even though compilation errors are easier to fix, they highlight the need for automated tools to identify these issues promptly.
Semantic Test
The more significant test checks if the code semantically performs the tasks it’s supposed to.
Results
- ChatGPT: Achieved a pass rate of 68.9% for HumanEval and 30% for ClassEval.
- Google Gemini: Achieved 54.878% and 17% respectively.
Despite ChatGPT showing better performance overall, both models exhibited substantial issues with more complex tasks.
Practical Implementation Test
An experiment with developers using AI to create a Java program for managing a card collection provides further insights.
- Productivity: Both AI chatbots helped speed up initial code generation significantly.
- Quality: However, they often introduced
code smells
—suboptimal code patterns—and required multiple prompt refinements and manual fixes.
Implications and Future Work
Practical Implications
The findings provide valuable insights for developers and businesses considering the use of AI for coding tasks:
- Productivity Boost: While AI can expedite code writing, it is not foolproof and needs thorough vetting and refinement.
- Reliability: Given the presence of semantic errors and code smells, relying solely on AI-generated code for mission-critical systems is risky.
Future Developments
There are several promising areas for ongoing research:
- Premium Versions: Assessing paid versions of these AI models could reveal significant improvements in accuracy and functionality.
- Industry Applications: Studying how companies integrate these AI tools in real-world settings can offer deeper insights into their impact on productivity and code quality.
Conclusion
While AI chatbots like ChatGPT and Google Gemini can be powerful tools for programming, they are not yet a substitute for experienced developers. These AI models can certainly aid in speeding up certain aspects of development but require human oversight to ensure high-quality, maintainable, and secure code. Going forward, continuous improvements and real-world testing will be key to fully harnessing the potential of AI in software development.