Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Exploring and Evaluating Hallucinations in LLM-Powered Code Generation (2404.00971v2)

Published 1 Apr 2024 in cs.SE and cs.AI

Abstract: The rise of LLMs has significantly advanced many applications on software engineering tasks, particularly in code generation. Despite the promising performance, LLMs are prone to generate hallucinations, which means LLMs might produce outputs that deviate from users' intent, exhibit internal inconsistencies, or misalign with the factual knowledge, making the deployment of LLMs potentially risky in a wide range of applications. Existing work mainly focuses on investing the hallucination in the domain of natural language generation (NLG), leaving a gap in understanding the types and extent of hallucinations in the context of code generation. To bridge the gap, we conducted a thematic analysis of the LLM-generated code to summarize and categorize the hallucinations present in it. Our study established a comprehensive taxonomy of hallucinations in LLM-generated code, encompassing 5 primary categories of hallucinations depending on the conflicting objectives and varying degrees of deviation observed in code generation. Furthermore, we systematically analyzed the distribution of hallucinations, exploring variations among different LLMs and their correlation with code correctness. Based on the results, we proposed HalluCode, a benchmark for evaluating the performance of code LLMs in recognizing hallucinations. Hallucination recognition and mitigation experiments with HalluCode and HumanEval show existing LLMs face great challenges in recognizing hallucinations, particularly in identifying their types, and are hardly able to mitigate hallucinations. We believe our findings will shed light on future research about hallucination evaluation, detection, and mitigation, ultimately paving the way for building more effective and reliable code LLMs in the future.

Exploring and Evaluating Hallucinations in LLM-Powered Code Generation

The paper "Exploring and Evaluating Hallucinations in LLM-Powered Code Generation" explores the intricacies of hallucinations in code generation using LLMs. Hallucinations refer to instances where models generate outputs that diverge from user intent, display inconsistencies, or contradict known information. While much research has been conducted on hallucinations in natural language generation, this paper focuses on understanding and categorizing such occurrences in code generation, a less explored area.

The authors undertake a thematic analysis to establish a comprehensive taxonomy of hallucinations in LLM-generated code. Their investigation identifies five primary categories: Intent Conflicting, Context Deviation, Context Inconsistency, Context Repetition, Dead Code, and Knowledge Conflicting. Each category is further divided into subtypes based on various hallucinatory behaviors observed in code. For instance, Intent Conflicting is bifurcated into overall semantic conflicts and local semantic conflicts, depending on the extent and impact of the hallucination.

The research reveals substantial diversity in hallucination types across different LLMs. Notably, models like CodeGen, CodeRL, and ChatGPT are analyzed for their hallucination behaviors, revealing distinctive patterns and distributions. For example, the paper finds that CodeRL often produces outputs with significant deviations in intent, possibly due to its training with reinforcement learning that emphasizes functional integrity. ChatGPT, known for its advanced prompt understanding, exhibits fewer intent conflicts but is more prone to context deviations and knowledge conflicts.

A critical part of the paper is the development of "HalluCode," a benchmark to evaluate the effectiveness of LLMs in recognizing and mitigating hallucinations. HalluCode comprises Python code tasks with annotated hallucinations, covering the different types identified in the paper. It's intended to facilitate the evaluation of various LLMs' performance in understanding and correcting hallucinations within generated code.

The paper also investigates the relationship between hallucinations and functional correctness of the generated code. It establishes that while not all errors originate from hallucinations, they often hint at code quality issues. Specifically, certain hallucination types, like Intent Conflicting and Context Inconsistency, are shown to correlate strongly with incorrect outputs, underscoring the need for robust hallucination detection mechanisms in code LLMs.

From an evaluative standpoint, the authors conduct experiments using HalluCode on various models, including ChatGPT, Code Llama, and DeepSeek-Coder. These experiments demonstrate that recognizing and mitigating hallucinations poses significant challenges, even for sophisticated models. Accuracy rates for hallucination recognition hover around 89% for ChatGPT, indicating room for improvement. Notably, the task of mitigating hallucinations proves more complex, with models struggling to correct identified issues consistently.

The implications of this research are multifaceted. Firstly, it underscores the necessity for better hallucination evaluation metrics within code generation, beyond traditional functional correctness tests. It also highlights the potential for developing advanced techniques to detect and mitigate hallucinations, enhancing the reliability and accuracy of code output by LLMs. Moreover, this paper lays a foundation for future exploration into hallucinations across various code generation tasks, extending beyond the NL2Code problem space.

Overall, this paper provides a detailed examination of hallucinations in code generation, offering valuable insights into their identification, classification, and impact. It sets the stage for future research endeavors aimed at addressing these challenges and refining code LLMs for more reliable and effective application in software development.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. Amazon. Amazon codewhisperer. https://aws.amazon.com/codewhisperer/, 2023.
  2. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.
  3. Sahil Chaudhary. Code alpaca: An instruction-following llama model for code generation. https://github.com/sahil280114/codealpaca, 2023.
  4. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
  5. Recommended steps for thematic synthesis in software engineering. In ESEM, pages 275–284. IEEE Computer Society, 2011.
  6. Euronews. Microsoft attracting users to its code-writing, generative ai software. https://www.euronews.com/next/2023/01/25/microsoft-results-ai, 2023.
  7. Incoder: A generative model for code infilling and synthesis. In ICLR. OpenReview.net, 2023.
  8. Patricia I Fusch Ph D and Lawrence R Ness. Are we there yet? data saturation in qualitative research. 2015.
  9. Deepseek-coder: When the large language model meets programming–the rise of code intelligence. arXiv preprint arXiv:2401.14196, 2024.
  10. Measuring coding challenge competence with apps. arXiv preprint arXiv:2105.09938, 2021.
  11. Instructcoder: Empowering language models for code editing. arXiv preprint arXiv:2310.20329, 2023.
  12. Challenges in building intelligent open-domain dialog systems. ACM Transactions on Information Systems (TOIS), 38(3):1–32, 2020.
  13. Large language models and simple, stupid bugs. arXiv preprint arXiv:2303.11455, 2023.
  14. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38, 2023.
  15. Eirini Kalliamvakou. Research: quantifying github copilot’s impact on developer productivity and happiness. https://github.blog/2022-09-07-research-quantifying-github-copilots-impact-on-developer-productivity-and-happiness/, 2022.
  16. How often do single-statement bugs occur? the manysstubs4j dataset. In Proceedings of the 17th International Conference on Mining Software Repositories, pages 573–577, 2020.
  17. Ds-1000: A natural and reliable benchmark for data science code generation. In International Conference on Machine Learning, pages 18319–18345. PMLR, 2023.
  18. Coderl: Mastering code generation through pretrained models and deep reinforcement learning. Advances in Neural Information Processing Systems, 35:21314–21328, 2022.
  19. Addressing semantic drift in generative question answering with auxiliary extraction. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 942–947, 2021.
  20. Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161, 2023.
  21. Competition-level code generation with alphacode. Science, 378(6624):1092–1097, 2022.
  22. Refining chatgpt-generated code: Characterizing and mitigating code quality issues. arXiv preprint arXiv:2307.12596, 2023a.
  23. No need to lift a finger anymore? assessing the quality of code generation by chatgpt. arXiv preprint arXiv:2308.04838, 2023b.
  24. Wizardcoder: Empowering code large language models with evol-instruct. arXiv preprint arXiv:2306.08568, 2023.
  25. On faithfulness and factuality in abstractive summarization. arXiv preprint arXiv:2005.00661, 2020.
  26. An empirical evaluation of github copilot’s code suggestions. In Proceedings of the 19th International Conference on Mining Software Repositories, pages 1–5, 2022.
  27. Codegen2: Lessons for training llms on programming and natural languages. arXiv preprint arXiv:2305.02309, 2023a.
  28. Codegen: An open large language model for code with multi-turn program synthesis. In ICLR. OpenReview.net, 2023b.
  29. OpenAI. Chatgpt: Optimizing language models for dialogue. https://openai.com/blog/chatgpt, 2022.
  30. Unsupervised translation of programming languages. Advances in neural information processing systems, 33:20601–20611, 2020.
  31. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
  32. Bugs in large language models generated code. arXiv preprint arXiv:2403.08937, 2024.
  33. Expectation vs. experience: Evaluating the usability of code generation tools powered by large language models. In Chi conference on human factors in computing systems extended abstracts, pages 1–7, 2022.
  34. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560, 2022.
  35. Automated program repair in the era of large pre-trained language models. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 1482–1494. IEEE, 2023.
  36. Evaluating the code quality of ai-assisted code generation tools: An empirical study on github copilot, amazon codewhisperer, and chatgpt. arXiv preprint arXiv:2304.10778, 2023.
  37. Codereval: A benchmark of pragmatic code generation with generative pre-trained models. In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering, pages 1–12, 2024.
  38. No more manual tests? evaluating and improving chatgpt for unit test generation. arXiv preprint arXiv:2305.04207, 2023.
  39. Siren’s song in the ai ocean: A survey on hallucination in large language models. arXiv preprint arXiv:2309.01219, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Fang Liu (800 papers)
  2. Yang Liu (2253 papers)
  3. Lin Shi (39 papers)
  4. Houkun Huang (1 paper)
  5. Ruifeng Wang (11 papers)
  6. Zhen Yang (160 papers)
  7. Li Zhang (690 papers)
  8. Zhongqi Li (5 papers)
  9. Yuchi Ma (22 papers)
Citations (59)
Youtube Logo Streamline Icon: https://streamlinehq.com