Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution (2401.03065v1)

Published 5 Jan 2024 in cs.SE, cs.AI, and cs.LG

Abstract: We present CRUXEval (Code Reasoning, Understanding, and eXecution Evaluation), a benchmark consisting of 800 Python functions (3-13 lines). Each function comes with an input-output pair, leading to two natural tasks: input prediction and output prediction. First, we propose a generic recipe for generating our execution benchmark which can be used to create future variation of the benchmark. Second, we evaluate twenty code models on our benchmark and discover that many recent high-scoring models on HumanEval do not show the same improvements on our benchmark. Third, we show that simple CoT and fine-tuning schemes can improve performance on our benchmark but remain far from solving it. The best setup, GPT-4 with chain of thought (CoT), achieves a pass@1 of 75% and 81% on input and output prediction, respectively. In contrast, Code Llama 34B achieves a pass@1 of 50% and 46% on input and output prediction, highlighting the gap between open and closed source models. As no model is close to acing CRUXEval, we provide examples of consistent GPT-4 failures on simple programs as a lens into its code reasoning capabilities and areas for improvement.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (109)
  1. Juice: A large scale distantly supervised dataset for open domain context-based code generation. arXiv preprint arXiv:1910.02216, 2019.
  2. Avatar: A parallel corpus for java-python program translation. arXiv preprint arXiv:2108.11590, 2021.
  3. AI, D. Deepseek coder: Let the code write itself. https://github.com/deepseek-ai/DeepSeek-Coder, 2023.
  4. Santacoder: don’t reach for the stars! arXiv preprint arXiv:2301.03988, 2023.
  5. code2seq: Generating sequences from structured representations of code. arXiv preprint arXiv:1808.01400, 2018.
  6. Arkoudas, K. Gpt-4 can’t reason. arXiv preprint arXiv:2308.03762, 2023.
  7. Multi-lingual evaluation of code generation models. arXiv preprint arXiv:2210.14868, 2022.
  8. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.
  9. A parallel corpus of python functions and documentation strings for automated code documentation and code generation. arXiv preprint arXiv:1707.02275, 2017.
  10. Tfix: Learning to fix coding errors with a text-to-text transformer. In International Conference on Machine Learning, pp.  780–791. PMLR, 2021.
  11. The reversal curse: Llms trained on” a is b” fail to learn” b is a”. arXiv preprint arXiv:2309.12288, 2023.
  12. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  13. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
  14. Multipl-e: A scalable and extensible approach to benchmarking neural code generation. arXiv preprint arXiv:2208.08227, 2022.
  15. Codet: Code generation with generated tests. arXiv preprint arXiv:2207.10397, 2022.
  16. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
  17. Teaching large language models to self-debug. arXiv preprint arXiv:2304.05128, 2023.
  18. Cocomic: Code completion by jointly modeling in-file and cross-file context. arXiv preprint arXiv:2212.10007, 2022.
  19. Faith and fate: Limits of transformers on compositionality. arXiv preprint arXiv:2305.18654, 2023.
  20. Large language models for software engineering: Survey and open problems. arXiv preprint arXiv:2310.03533, 2023.
  21. Incoder: A generative model for code infilling and synthesis. preprint arXiv:2204.05999, 2022.
  22. Deepperf: A deep learning-based approach for improving software performance. arXiv preprint arXiv:2206.13619, 2022.
  23. Looped transformers as programmable computers. arXiv preprint arXiv:2301.13196, 2023.
  24. The false promise of imitating proprietary llms. arXiv preprint arXiv:2305.15717, 2023.
  25. Textbooks are all you need. arXiv preprint arXiv:2306.11644, 2023.
  26. Deepfix: Fixing common c language errors by deep learning. In Proceedings of the aaai conference on artificial intelligence, volume 31, 2017.
  27. Language models can teach themselves to program better. arXiv preprint arXiv:2207.14502, 2022.
  28. Fixeval: Execution-based evaluation of program fixes for competitive programming problems. 2022.
  29. Codesc: A large code-description parallel dataset. arXiv preprint arXiv:2105.14220, 2021.
  30. Measuring coding challenge competence with apps. arXiv preprint arXiv:2105.09938, 2021.
  31. Understanding by understanding not: Modeling negation in language models. arXiv preprint arXiv:2105.03519, 2021.
  32. Codesearchnet challenge: Evaluating the state of semantic code search. arXiv preprint arXiv:1909.09436, 2019.
  33. Summarizing source code using a neural attention model. In 54th Annual Meeting of the Association for Computational Linguistics 2016, pp.  2073–2083. Association for Computational Linguistics, 2016.
  34. Jigsaw: Large language models meet program synthesis. In Proceedings of the 44th International Conference on Software Engineering, pp.  1219–1231, 2022.
  35. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023a.
  36. Impact of code language models on automated program repair. arXiv preprint arXiv:2302.05020, 2023b.
  37. Swe-bench: Can language models resolve real-world github issues? arXiv preprint arXiv:2310.06770, 2023.
  38. Inferfix: End-to-end program repair with llms. arXiv preprint arXiv:2303.07263, 2023.
  39. I speak, you verify: Toward trustworthy neural program synthesis. arXiv preprint arXiv:2210.00848, 2022.
  40. Ds-1000: A natural and reliable benchmark for data science code generation. In International Conference on Machine Learning, pp.  18319–18345. PMLR, 2023.
  41. Coderl: Mastering code generation through pretrained models and deep reinforcement learning. Advances in Neural Information Processing Systems, 35:21314–21328, 2022.
  42. A neural model for generating natural language summaries of program subroutines. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), pp.  795–806. IEEE, 2019.
  43. Teaching arithmetic to small transformers. arXiv preprint arXiv:2307.03381, 2023.
  44. Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161, 2023a.
  45. Competition-level code generation with alphacode. Science, 378(6624):1092–1097, 2022.
  46. Textbooks are all you need ii: phi-1.5 technical report. arXiv preprint arXiv:2309.05463, 2023b.
  47. Can we generate shellcodes via natural language? an empirical study. Automated Software Engineering, 29(1):30, 2022.
  48. Code execution with pre-trained language models. arXiv preprint arXiv:2305.05383, 2023a.
  49. Evaluating the logical reasoning ability of chatgpt and gpt-4. arXiv preprint arXiv:2304.03439, 2023b.
  50. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. arXiv preprint arXiv:2305.01210, 2023c.
  51. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9):1–35, 2023d.
  52. Atom: Commit message generation based on abstract syntax tree and hybrid ranking. IEEE Transactions on Software Engineering, 48(5):1800–1817, 2020.
  53. Repobench: Benchmarking repository-level code auto-completion systems. arXiv preprint arXiv:2306.03091, 2023e.
  54. Wizardcoder: Empowering code large language models with evol-instruct. arXiv preprint arXiv:2306.08568, 2023.
  55. Learning performance-improving code edits. arXiv preprint arXiv:2302.07867, 2023a.
  56. Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651, 2023b.
  57. Nl2type: inferring javascript function types from natural language information. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), pp.  304–315. IEEE, 2019.
  58. The expresssive power of transformers with chain of thought. arXiv preprint arXiv:2310.07923, 2023.
  59. Provable limitations of acquiring meaning from ungrounded form: What will future language models understand? Transactions of the Association for Computational Linguistics, 9:1047–1060, 2021.
  60. The larger they are, the harder they fail: Language models do not recognize identifier swaps in python. arXiv preprint arXiv:2305.15507, 2023.
  61. Type4py: Practical deep similarity learning-based type inference for python. In Proceedings of the 44th International Conference on Software Engineering, pp.  2241–2252, 2022.
  62. State of what art? a call for multi-prompt llm evaluation. arXiv preprint arXiv:2401.00595, 2023.
  63. Lever: Learning to verify language-to-code generation with execution. In International Conference on Machine Learning, pp.  26106–26128. PMLR, 2023.
  64. Codegen: An open large language model for code with multi-turn program synthesis. In The Eleventh International Conference on Learning Representations, 2022.
  65. Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114, 2021.
  66. Linc: A neurosymbolic approach for logical reasoning by combining language models with first-order logic provers. arXiv preprint arXiv:2310.15164, 2023a.
  67. Demystifying gpt self-repair for code generation. arXiv preprint arXiv:2306.09896, 2023b.
  68. OpenAI, R. Gpt-4 technical report. arxiv 2303.08774. View in Article, 2023.
  69. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  70. Gorilla: Large language model connected with massive apis. arXiv preprint arXiv:2305.15334, 2023.
  71. Asleep at the keyboard? assessing the security of github copilot’s code contributions. In 2022 IEEE Symposium on Security and Privacy (SP), pp.  754–768. IEEE, 2022.
  72. Check your facts and try again: Improving large language models with external knowledge and automated feedback. arXiv preprint arXiv:2302.12813, 2023.
  73. Phind, 2023. URL https://www.phind.com.
  74. Unsupervised translation of programming languages. Advances in Neural Information Processing Systems, 33:20601–20611, 2020.
  75. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
  76. Pangu-coder2: Boosting large language models for code with ranking feedback. arXiv preprint arXiv:2307.14936, 2023.
  77. Natural language to code translation with execution. arXiv preprint arXiv:2204.11454, 2022.
  78. Large language models can be easily distracted by irrelevant context. In International Conference on Machine Learning, pp.  31210–31227. PMLR, 2023.
  79. Reflexion: an autonomous agent with dynamic memory and self-reflection. arXiv preprint arXiv:2303.11366, 2023.
  80. Repository-level prompt generation for large language models of code. In International Conference on Machine Learning, pp.  31693–31715. PMLR, 2023.
  81. Test-case-driven programming understanding in large language models for better code generation. arXiv preprint arXiv:2309.16120, 2023.
  82. Llmseceval: A dataset of natural language prompts for security evaluations. arXiv preprint arXiv:2303.09384, 2023.
  83. An empirical study on learning bug-fixing patches in the wild via neural machine translation. ACM Transactions on Software Engineering and Methodology (TOSEM), 28(4):1–29, 2019.
  84. Methods2test: A dataset of focal methods mapped to test cases. In Proceedings of the 19th International Conference on Mining Software Repositories, pp.  299–303, 2022.
  85. Llms cannot find reasoning errors, but can correct them! arXiv preprint arXiv:2311.08516, 2023.
  86. Solving math word problems with process-and outcome-based feedback. arXiv preprint arXiv:2211.14275, 2022.
  87. Recode: Robustness evaluation of code generation models. arXiv preprint arXiv:2212.10264, 2022a.
  88. Execution-based evaluation for open-domain code generation. arXiv preprint arXiv:2212.10481, 2022b.
  89. On learning meaningful assert statements for unit test cases. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering, pp.  1398–1409, 2020.
  90. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
  91. Typet5: Seq2seq type inference using static analysis. arXiv preprint arXiv:2303.09564, 2023.
  92. Reasoning or reciting? exploring the capabilities and limitations of language models through counterfactual tasks. arXiv preprint arXiv:2307.02477, 2023.
  93. Practical program repair in the era of large pre-trained language models. arXiv preprint arXiv:2210.14179, 2022.
  94. A systematic evaluation of large language models of code. In Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming, pp.  1–10, 2022.
  95. Natural language to code generation in interactive data science notebooks. arXiv preprint arXiv:2212.09248, 2022.
  96. Large language models meet nl2code: A survey. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  7443–7464, 2023.
  97. Repocoder: Repository-level code completion through iterative retrieval and generation. arXiv preprint arXiv:2303.12570, 2023a.
  98. On the paradox of learning to reason from data. arXiv preprint arXiv:2205.11502, 2022.
  99. Toolcoder: Teach code generation models to use apis with search tools. arXiv preprint arXiv:2305.04032, 2023b.
  100. Self-edit: Fault-aware code editor for code generation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  769–787, Toronto, Canada, July 2023c. Association for Computational Linguistics.
  101. Algo: Synthesizing algorithmic programs with generated oracle verifiers. arXiv preprint arXiv:2305.14591, 2023d.
  102. Planning with large language models for code generation. arXiv preprint arXiv:2303.05510, 2023e.
  103. Can transformers learn to solve problems recursively? arXiv preprint arXiv:2305.14699, 2023f.
  104. Coder reviewer reranking for code generation. In International Conference on Machine Learning, pp.  41832–41846. PMLR, 2023g.
  105. A survey on language models for code. 2023h.
  106. Codegeex: A pre-trained model for code generation with multilingual evaluations on humaneval-x. arXiv preprint arXiv:2303.17568, 2023.
  107. Codegen-test: An automatic code generation model integrating program test information. arXiv preprint arXiv:2202.07612, 2022.
  108. What algorithms can transformers learn? a study in length generalization. arXiv preprint arXiv:2310.16028, 2023.
  109. Xlcost: A benchmark dataset for cross-lingual code intelligence. arXiv preprint arXiv:2206.08474, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Alex Gu (20 papers)
  2. Baptiste Rozière (99 papers)
  3. Hugh Leather (23 papers)
  4. Armando Solar-Lezama (65 papers)
  5. Gabriel Synnaeve (97 papers)
  6. Sida I. Wang (20 papers)
Citations (43)