- The paper demonstrates that ChatGPT-3.5 achieves 92% accuracy on easy problems but drops to 79% and 51% on medium and hard tasks, respectively.
- The analysis reveals that prompt engineering, including failed test case integration and chain-of-thought prompting, significantly boosts solution accuracy with improvements up to 60% for medium problems.
- The study also highlights language-specific performance, with robust results in Python, Java, and C++ while struggling with less represented languages like Elixir and Racket.
Evaluating ChatGPT-3.5 Efficiency in Solving Coding Problems of Different Complexity Levels: An Empirical Analysis
This paper investigates the performance of ChatGPT's GPT-3.5-turbo model in generating code solutions for algorithmic problems of varying difficulty levels on the LeetCode platform. The paper aims to comprehensively evaluate three main aspects: the model's capability to solve easy, medium, and hard coding problems; the potential of prompt engineering to enhance solution accuracy; and the proficiency of the model across different programming languages.
Model Performance Across Difficulty Levels
The researchers tested GPT-3.5-turbo's competence on a large dataset of 1,475 LeetCode problems, categorized by their difficulty level. The model displayed a strong performance on easy problems, solving 92% of them successfully. However, the efficiency dropped significantly for medium and hard problems, with the success rate declining to 79% and 51%, respectively. This reduction in performance validates the hypothesis that the model struggles more with increased complexity, reflecting limitations in its reasoning and problem-solving abilities as the difficulty escalates.
Impact of Prompt Engineering
The paper explores the efficacy of various prompt engineering strategies to improve the model's performance. Among these, incorporating failed test cases into subsequent prompts emerged as the most effective strategy, particularly for problems with medium complexity, showing an improvement of up to 60%. Chain-of-thought (CoT) prompting also provided enhancements, especially for easier problems, with a noted 29% improvement, suggesting that systematic breakdowns of tasks can guide the model towards more accurate problem-solving. Moreover, the adoption of the GPT-4 model increased overall performance significantly across all difficulties, reinforcing the importance of continual model advancements alongside prompt modifications.
Multilingual Performance Evaluation
The paper includes an analysis of ChatGPT's ability to generate code in languages other than Python, namely Java, C++, Elixir, Erlang, and Racket. The model showed competent results in Java and C++, solving approximately 70% and 50% of the problems that were solvable in Python, albeit with variances that highlight language-specific challenges. Contrastively, for the less common languages like Elixir and Racket, the model failed to solve any problems, believed to be due to insufficient representation in the training data.
Further Observations
Additional results focus on the types of algorithmic problems where ChatGPT performs well or poorly. The model excels in hash table, search, and divide-and-conquer tasks but struggles with database and dynamic programming problems, corroborating the challenges associated with complex logic and syntax precision. A further analysis correlating the lines of code with correctness underscores that conciseness tends towards higher accuracy, especially in hard problem sets.
Conclusions and Future Directions
This research underscores the nuanced capabilities and limitations of ChatGPT-3.5-turbo in automated code generation. While it is adept at tackling simpler problem statements, the challenges in addressing complex problems are apparent, warranting additional research into enhancing the model's code reasoning by leveraging more sophisticated LLM advancements such as GPT-4. Future work in this domain could delve into increasing support for diverse programming paradigms and improving robustness through refined prompts and data enrichment to better accommodate the complexities seen in various coding languages and problem types.