Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Showing LLM-Generated Code Selectively Based on Confidence of LLMs (2410.03234v1)

Published 4 Oct 2024 in cs.SE and cs.CL

Abstract: LLMs have shown impressive abilities in code generation, but they may generate erroneous programs. Reading a program takes ten times longer than writing it. Showing these erroneous programs to developers will waste developers' energies and introduce security risks to software. To address the above limitations, we propose HonestCoder, a novel LLM-based code generation approach. HonestCoder selectively shows the generated programs to developers based on LLMs' confidence. The confidence provides valuable insights into the correctness of generated programs. To achieve this goal, we propose a novel approach to estimate LLMs' confidence in code generation. It estimates confidence by measuring the multi-modal similarity between LLMs-generated programs. We collect and release a multilingual benchmark named TruthCodeBench, which consists of 2,265 samples and covers two popular programming languages (i.e., Python and Java). We apply HonestCoder to four popular LLMs (e.g., DeepSeek-Coder and Code Llama) and evaluate it on TruthCodeBench. Based on the experiments, we obtain the following insights. (1) HonestCoder can effectively estimate LLMs' confidence and accurately determine the correctness of generated programs. For example, HonestCoder outperforms the state-of-the-art baseline by 27.79% in AUROC and 63.74% in AUCPR. (2) HonestCoder can decrease the number of erroneous programs shown to developers. Compared to eight baselines, it can show more correct programs and fewer erroneous programs to developers. (3) Compared to showing code indiscriminately, HonestCoder only adds slight time overhead (approximately 0.4 seconds per requirement). (4) We discuss future directions to facilitate the application of LLMs in software development. We hope this work can motivate broad discussions about measuring the reliability of LLMs' outputs in performing code-related tasks.

Insights into Selectively Displaying LLM-Generated Code Based on Confidence

The research paper entitled Showing LLM-Generated Code Selectively Based on Confidence of LLMs addresses the critical problem of reducing the exposure of erroneous LLM-generated code to developers. LLMs, despite their prowess in code generation, can produce faulty outputs, which can waste developers' time and introduce potential security vulnerabilities. The paper proposes a novel method to mitigate these issues by selectively showing generated code based on the model's confidence.

Methodology and Confidence Estimation

The authors introduce a framework called , designed to enhance the reliability of LLM outputs by estimating the confidence of the LLMs in their generated programs. This approach involves an innovative confidence estimation technique that measures multi-modal similarities between multiple independently generated code samples. These modalities include lexical, syntactic, semantic, and data flow analysis. The estimator checks for consistency across these samples, drawing on the observation that models, like humans, tend to be more consistent when confident.

Benchmarking and Evaluation

To validate their approach, the authors created a multilingual benchmark containing 2,265 samples across Python and Java. Through comprehensive experiments on four popular LLMs, such as DeepSeek-Coder and Code Llama, the authors report significant improvements over existing methods. Key metrics include an improvement of 27.79% in AUROC and 63.74% in AUCPR compared to state-of-the-art baselines. These results underscore the effectiveness of in discerning the correctness of generated code and markedly reducing erroneous code visibility.

Practical Implications

practically enhances software development processes by filtering out low-confidence outputs, thus preserving the developer's focus on productive coding tasks and mitigating security risks associated with exploring erroneous suggestions. The approach adds minimal latency—approximately 0.4 seconds per requirement—demonstrating its feasibility for real-world application without substantial overhead.

Future Directions

The authors point out several avenues for future work, predicated on extending the understanding and application of LLM confidence. One direction involves developing even more accurate confidence estimation methods, potentially utilizing internal model states or advanced tuning techniques. Another direction could explore integrating humans into the loop for requirements where model confidence is low, expanding potential applications beyond code generation to other domains such as vulnerability detection.

Conclusion

This paper establishes a foundational methodology for utilizing LLM confidence to selectively present generated code, offering a nuanced advancement over indiscriminate code suggestion techniques. By effectively filtering based on confidence, this approach not only boosts developer productivity but also increases the safety and reliability of LLM-assisted programming. Given the substantial results presented, marks a meaningful step forward in AI-assisted software engineering, offering both a practical tool for developers and a rich field for further academic exploration.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Jia Li (380 papers)
  2. Yuqi Zhu (25 papers)
  3. Yongmin Li (32 papers)
  4. Ge Li (213 papers)
  5. Zhi Jin (160 papers)