Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The Counterfeit Conundrum: Can Code Language Models Grasp the Nuances of Their Incorrect Generations? (2402.19475v1)

Published 29 Feb 2024 in cs.SE, cs.AI, and cs.LG

Abstract: While LLMs are increasingly more proficient at code generation, they still frequently generate incorrect programs. Many of these programs are obviously wrong, but others are more subtle and pass weaker correctness checks such as being able to compile. In this work, we focus on these counterfeit samples: programs sampled from a LLM that 1) have a high enough log-probability to be generated at a moderate temperature and 2) pass weak correctness checks. Overall, we discover that most models have a very shallow understanding of counterfeits through three clear failure modes. First, models mistakenly classify them as correct. Second, models are worse at reasoning about the execution behaviour of counterfeits and often predict their execution results as if they were correct. Third, when asking models to fix counterfeits, the likelihood of a model successfully repairing a counterfeit is often even lower than that of sampling a correct program from scratch. Counterfeits also have very unexpected properties: first, counterfeit programs for problems that are easier for a model to solve are not necessarily easier to detect and only slightly easier to execute and repair. Second, counterfeits from a given model are just as confusing to the model itself as they are to other models. Finally, both strong and weak models are able to generate counterfeit samples that equally challenge all models. In light of our findings, we recommend that care and caution be taken when relying on models to understand their own samples, especially when no external feedback is incorporated.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (58)
  1. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.
  2. Codet: Code generation with generated tests. arXiv preprint arXiv:2207.10397, 2022.
  3. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
  4. Universal self-consistency for large language model generation. arXiv preprint arXiv:2311.17311, 2023.
  5. Teaching large language models to self-debug. In International Conference on Learning Representations (ICLR), 2024a.
  6. When is tree search useful for llm planning? it depends on the discriminator. arXiv preprint arXiv:2402.10890, 2024b.
  7. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  8. Large language models of code fail at completing code with potential bugs. Advances in Neural Information Processing Systems, 36, 2024.
  9. CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution. arXiv preprint arXiv:2401.03065, 2024.
  10. Deepseek-coder: When the large language model meets programming–the rise of code intelligence. arXiv preprint arXiv:2401.14196, 2024.
  11. Codesc: A large code-description parallel dataset. arXiv preprint arXiv:2105.14220, 2021.
  12. Enhancing large language models in coding through multi-perspective self-consistency. arXiv preprint arXiv:2309.17272, 2023a.
  13. Large language models cannot self-correct reasoning yet. arXiv preprint arXiv:2310.01798, 2023b.
  14. Fault-aware neural code rankers. Advances in Neural Information Processing Systems, 35:13419–13432, 2022.
  15. Summarizing source code using a neural attention model. In 54th Annual Meeting of the Association for Computational Linguistics 2016, pages 2073–2083. Association for Computational Linguistics, 2016.
  16. Evidence of meaning in language models trained on programs. arXiv preprint arXiv:2305.11169, 2023.
  17. I speak, you verify: Toward trustworthy neural program synthesis. arXiv preprint arXiv:2210.00848, 2022.
  18. Code simulation challenges for large language models. arXiv preprint arXiv:2401.09074, 2024.
  19. Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161, 2023.
  20. Competition-level code generation with alphacode. Science, 378(6624):1092–1097, 2022.
  21. Let’s verify step by step. arXiv preprint arXiv:2305.20050, 2023.
  22. Codemind: A framework to challenge large language models for code reasoning, 2024.
  23. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. arXiv preprint arXiv:2305.01210, 2023.
  24. Atom: Commit message generation based on abstract syntax tree and hybrid ranking. IEEE Transactions on Software Engineering, 48(5):1800–1817, 2020.
  25. Codexglue: A machine learning benchmark dataset for code understanding and generation. arXiv preprint arXiv:2102.04664, 2021.
  26. Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651, 2023.
  27. A large language model-assisted education tool to provide feedback on open-ended responses. arXiv preprint arXiv:2308.02439, 2023.
  28. The larger they are, the harder they fail: Language models do not recognize identifier swaps in python. arXiv preprint arXiv:2305.15507, 2023.
  29. Beyond accuracy: Evaluating self-consistency of code large language models with identitychain. arXiv preprint arXiv:2310.14053, 2023.
  30. State of what art? a call for multi-prompt llm evaluation. arXiv preprint arXiv:2401.00595, 2023.
  31. Octopack: Instruction tuning code large language models. arXiv preprint arXiv:2308.07124, 2023.
  32. Lever: Learning to verify language-to-code generation with execution. In International Conference on Machine Learning, pages 26106–26128. PMLR, 2023.
  33. Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114, 2021.
  34. The generative ai paradox on evaluation: What it can solve, it may not evaluate. arXiv preprint arXiv:2402.06204, 2024.
  35. Is Self-Repair a Silver Bullet for Code Generation? In International Conference on Learning Representations (ICLR), 2024.
  36. R OpenAI. Gpt-4 technical report. arxiv 2303.08774. View in Article, 2023.
  37. Code generation with alphacodium: From prompt engineering to flow engineering. arXiv preprint arXiv:2401.08500, 2024.
  38. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
  39. Learning to answer semantic queries over code. arXiv preprint arXiv:2209.08372, 2022.
  40. Towards llm-based autograding for short textual answers. arXiv preprint arXiv:2309.11508, 2023.
  41. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024.
  42. Nofuneval: Funny how code lms falter on requirements beyond functional correctness. arXiv preprint arXiv:2401.15963, 2024.
  43. Gpt-4 doesn’t know it’s wrong: An analysis of iterative prompting for reasoning problems. arXiv preprint arXiv:2310.12397, 2023.
  44. Llms cannot find reasoning errors, but can correct them! arXiv preprint arXiv:2311.08516, 2023.
  45. Can large language models really improve by self-critiquing their own plans? arXiv preprint arXiv:2310.08118, 2023.
  46. Math-shepherd: Verify and reinforce llms step-by-step without human annotations. CoRR, abs/2312.08935, 2023.
  47. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022a.
  48. Execution-based evaluation for open-domain code generation. arXiv preprint arXiv:2212.10481, 2022b.
  49. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
  50. Large language models are reasoners with self-verification. arXiv preprint arXiv:2212.09561, 2022.
  51. The generative ai paradox:” what it can create, it may not understand”. arXiv preprint arXiv:2311.00059, 2023.
  52. Reasoning or reciting? exploring the capabilities and limitations of language models through counterfactual tasks. arXiv preprint arXiv:2307.02477, 2023.
  53. Self-edit: Fault-aware code editor for code generation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 769–787, Toronto, Canada, July 2023a. Association for Computational Linguistics.
  54. Algo: Synthesizing algorithmic programs with generated oracle verifiers. arXiv preprint arXiv:2305.14591, 2023b.
  55. Can transformers learn to solve problems recursively? arXiv preprint arXiv:2305.14699, 2023c.
  56. Coder reviewer reranking for code generation. In International Conference on Machine Learning, pages 41832–41846. PMLR, 2023d.
  57. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023.
  58. Solving challenging math word problems using gpt-4 code interpreter with code-based self-verification. arXiv preprint arXiv:2308.07921, 2023.
Citations (9)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com