Evaluating ChatGPT-3.5 Efficiency in Solving Coding Problems of Different Complexity Levels: An Empirical Analysis (2411.07529v1)

Published 12 Nov 2024 in cs.SE and cs.AI

Abstract: ChatGPT and other LLMs promise to revolutionize software development by automatically generating code from program specifications. We assess the performance of ChatGPT's GPT-3.5-turbo model on LeetCode, a popular platform with algorithmic coding challenges for technical interview practice, across three difficulty levels: easy, medium, and hard. We test three main hypotheses. First, ChatGPT solves fewer problems as difficulty rises (Hypothesis 1). Second, prompt engineering improves ChatGPT's performance, with greater gains on easier problems and diminishing returns on harder ones (Hypothesis 2). Third, ChatGPT performs better in popular languages like Python, Java, and C++ than in less common ones like Elixir, Erlang, and Racket (Hypothesis 3). To investigate these hypotheses, we conduct automated experiments using Python scripts to generate prompts that instruct ChatGPT to create Python solutions. These solutions are stored and manually submitted on LeetCode to check their correctness. For Hypothesis 1, results show the GPT-3.5-turbo model successfully solves 92% of easy, 79% of medium, and 51% of hard problems. For Hypothesis 2, prompt engineering yields improvements: 14-29% for Chain of Thought Prompting, 38-60% by providing failed test cases in a second feedback prompt, and 33-58% by switching to GPT-4. From a random subset of problems ChatGPT solved in Python, it also solved 78% in Java, 50% in C++, and none in Elixir, Erlang, or Racket. These findings generally validate all three hypotheses.

Summary

The paper demonstrates that ChatGPT-3.5 achieves 92% accuracy on easy problems but drops to 79% and 51% on medium and hard tasks, respectively.
The analysis reveals that prompt engineering, including failed test case integration and chain-of-thought prompting, significantly boosts solution accuracy with improvements up to 60% for medium problems.
The study also highlights language-specific performance, with robust results in Python, Java, and C++ while struggling with less represented languages like Elixir and Racket.

Evaluating ChatGPT-3.5 Efficiency in Solving Coding Problems of Different Complexity Levels: An Empirical Analysis

This paper investigates the performance of ChatGPT's GPT-3.5-turbo model in generating code solutions for algorithmic problems of varying difficulty levels on the LeetCode platform. The paper aims to comprehensively evaluate three main aspects: the model's capability to solve easy, medium, and hard coding problems; the potential of prompt engineering to enhance solution accuracy; and the proficiency of the model across different programming languages.

Model Performance Across Difficulty Levels

The researchers tested GPT-3.5-turbo's competence on a large dataset of 1,475 LeetCode problems, categorized by their difficulty level. The model displayed a strong performance on easy problems, solving 92% of them successfully. However, the efficiency dropped significantly for medium and hard problems, with the success rate declining to 79% and 51%, respectively. This reduction in performance validates the hypothesis that the model struggles more with increased complexity, reflecting limitations in its reasoning and problem-solving abilities as the difficulty escalates.

Impact of Prompt Engineering

The paper explores the efficacy of various prompt engineering strategies to improve the model's performance. Among these, incorporating failed test cases into subsequent prompts emerged as the most effective strategy, particularly for problems with medium complexity, showing an improvement of up to 60%. Chain-of-thought (CoT) prompting also provided enhancements, especially for easier problems, with a noted 29% improvement, suggesting that systematic breakdowns of tasks can guide the model towards more accurate problem-solving. Moreover, the adoption of the GPT-4 model increased overall performance significantly across all difficulties, reinforcing the importance of continual model advancements alongside prompt modifications.

Multilingual Performance Evaluation

The paper includes an analysis of ChatGPT's ability to generate code in languages other than Python, namely Java, C++, Elixir, Erlang, and Racket. The model showed competent results in Java and C++, solving approximately 70% and 50% of the problems that were solvable in Python, albeit with variances that highlight language-specific challenges. Contrastively, for the less common languages like Elixir and Racket, the model failed to solve any problems, believed to be due to insufficient representation in the training data.

Further Observations

Additional results focus on the types of algorithmic problems where ChatGPT performs well or poorly. The model excels in hash table, search, and divide-and-conquer tasks but struggles with database and dynamic programming problems, corroborating the challenges associated with complex logic and syntax precision. A further analysis correlating the lines of code with correctness underscores that conciseness tends towards higher accuracy, especially in hard problem sets.

Conclusions and Future Directions

This research underscores the nuanced capabilities and limitations of ChatGPT-3.5-turbo in automated code generation. While it is adept at tackling simpler problem statements, the challenges in addressing complex problems are apparent, warranting additional research into enhancing the model's code reasoning by leveraging more sophisticated LLM advancements such as GPT-4. Future work in this domain could delve into increasing support for diverse programming paradigms and improving robustness through refined prompts and data enrichment to better accommodate the complexities seen in various coding languages and problem types.

PDF Markdown

Related Papers

Tweets

https://twitter.com/bhaskark_la/status/1857856976660869569