Exploring and Evaluating Hallucinations in LLM-Powered Code Generation (2404.00971v2)

Published 1 Apr 2024 in cs.SE and cs.AI

Abstract: The rise of LLMs has significantly advanced many applications on software engineering tasks, particularly in code generation. Despite the promising performance, LLMs are prone to generate hallucinations, which means LLMs might produce outputs that deviate from users' intent, exhibit internal inconsistencies, or misalign with the factual knowledge, making the deployment of LLMs potentially risky in a wide range of applications. Existing work mainly focuses on investing the hallucination in the domain of natural language generation (NLG), leaving a gap in understanding the types and extent of hallucinations in the context of code generation. To bridge the gap, we conducted a thematic analysis of the LLM-generated code to summarize and categorize the hallucinations present in it. Our study established a comprehensive taxonomy of hallucinations in LLM-generated code, encompassing 5 primary categories of hallucinations depending on the conflicting objectives and varying degrees of deviation observed in code generation. Furthermore, we systematically analyzed the distribution of hallucinations, exploring variations among different LLMs and their correlation with code correctness. Based on the results, we proposed HalluCode, a benchmark for evaluating the performance of code LLMs in recognizing hallucinations. Hallucination recognition and mitigation experiments with HalluCode and HumanEval show existing LLMs face great challenges in recognizing hallucinations, particularly in identifying their types, and are hardly able to mitigate hallucinations. We believe our findings will shed light on future research about hallucination evaluation, detection, and mitigation, ultimately paving the way for building more effective and reliable code LLMs in the future.

PDF HTML Abstract

Exploring and Evaluating Hallucinations in LLM-Powered Code Generation

The paper "Exploring and Evaluating Hallucinations in LLM-Powered Code Generation" explores the intricacies of hallucinations in code generation using LLMs. Hallucinations refer to instances where models generate outputs that diverge from user intent, display inconsistencies, or contradict known information. While much research has been conducted on hallucinations in natural language generation, this paper focuses on understanding and categorizing such occurrences in code generation, a less explored area.

The authors undertake a thematic analysis to establish a comprehensive taxonomy of hallucinations in LLM-generated code. Their investigation identifies five primary categories: Intent Conflicting, Context Deviation, Context Inconsistency, Context Repetition, Dead Code, and Knowledge Conflicting. Each category is further divided into subtypes based on various hallucinatory behaviors observed in code. For instance, Intent Conflicting is bifurcated into overall semantic conflicts and local semantic conflicts, depending on the extent and impact of the hallucination.

The research reveals substantial diversity in hallucination types across different LLMs. Notably, models like CodeGen, CodeRL, and ChatGPT are analyzed for their hallucination behaviors, revealing distinctive patterns and distributions. For example, the paper finds that CodeRL often produces outputs with significant deviations in intent, possibly due to its training with reinforcement learning that emphasizes functional integrity. ChatGPT, known for its advanced prompt understanding, exhibits fewer intent conflicts but is more prone to context deviations and knowledge conflicts.

A critical part of the paper is the development of "HalluCode," a benchmark to evaluate the effectiveness of LLMs in recognizing and mitigating hallucinations. HalluCode comprises Python code tasks with annotated hallucinations, covering the different types identified in the paper. It's intended to facilitate the evaluation of various LLMs' performance in understanding and correcting hallucinations within generated code.

The paper also investigates the relationship between hallucinations and functional correctness of the generated code. It establishes that while not all errors originate from hallucinations, they often hint at code quality issues. Specifically, certain hallucination types, like Intent Conflicting and Context Inconsistency, are shown to correlate strongly with incorrect outputs, underscoring the need for robust hallucination detection mechanisms in code LLMs.

From an evaluative standpoint, the authors conduct experiments using HalluCode on various models, including ChatGPT, Code Llama, and DeepSeek-Coder. These experiments demonstrate that recognizing and mitigating hallucinations poses significant challenges, even for sophisticated models. Accuracy rates for hallucination recognition hover around 89% for ChatGPT, indicating room for improvement. Notably, the task of mitigating hallucinations proves more complex, with models struggling to correct identified issues consistently.

The implications of this research are multifaceted. Firstly, it underscores the necessity for better hallucination evaluation metrics within code generation, beyond traditional functional correctness tests. It also highlights the potential for developing advanced techniques to detect and mitigate hallucinations, enhancing the reliability and accuracy of code output by LLMs. Moreover, this paper lays a foundation for future exploration into hallucinations across various code generation tasks, extending beyond the NL2Code problem space.

Overall, this paper provides a detailed examination of hallucinations in code generation, offering valuable insights into their identification, classification, and impact. It sets the stage for future research endeavors aimed at addressing these challenges and refining code LLMs for more reliable and effective application in software development.