Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Where Do Large Language Models Fail When Generating Code? (2406.08731v2)

Published 13 Jun 2024 in cs.SE

Abstract: LLMs have shown great potential in code generation. However, current LLMs still cannot reliably generate correct code. Moreover, it is unclear what kinds of code generation errors LLMs can make. To address this, we conducted an empirical study to analyze incorrect code snippets generated by six popular LLMs on the HumanEval dataset. We analyzed these errors alongside two dimensions of error characteristics -- semantic characteristics and syntactic characteristics -- to derive a comprehensive code generation error taxonomy for LLMs through open coding and thematic analysis. We then labeled all 557 incorrect code snippets based on this taxonomy. Our results showed that the six LLMs exhibited similar distributions of syntactic characteristics while different distributions of semantic characteristics. Furthermore, we analyzed the correlation between different error characteristics and factors such as task complexity, code length, and test-pass rate. Finally, we highlight the challenges that LLMs may encounter when generating code and propose implications for future research on reliable code generation with LLMs.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Zhijie Wang (36 papers)
  2. Zijie Zhou (21 papers)
  3. Da Song (10 papers)
  4. Yuheng Huang (26 papers)
  5. Shengmai Chen (2 papers)
  6. Lei Ma (195 papers)
  7. Tianyi Zhang (262 papers)
Citations (5)
Youtube Logo Streamline Icon: https://streamlinehq.com