Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Turbulence: Systematically and Automatically Testing Instruction-Tuned Large Language Models for Code (2312.14856v2)

Published 22 Dec 2023 in cs.SE and cs.AI

Abstract: We present a method for systematically evaluating the correctness and robustness of instruction-tuned LLMs for code generation via a new benchmark, Turbulence. Turbulence consists of a large set of natural language $\textit{question templates}$, each of which is a programming problem, parameterised so that it can be asked in many different forms. Each question template has an associated $\textit{test oracle}$ that judges whether a code solution returned by an LLM is correct. Thus, from a single question template, it is possible to ask an LLM a $\textit{neighbourhood}$ of very similar programming questions, and assess the correctness of the result returned for each question. This allows gaps in an LLM's code generation abilities to be identified, including $\textit{anomalies}$ where the LLM correctly solves $\textit{almost all}$ questions in a neighbourhood but fails for particular parameter instantiations. We present experiments against five LLMs from OpenAI, Cohere and Meta, each at two temperature configurations. Our findings show that, across the board, Turbulence is able to reveal gaps in LLM reasoning ability. This goes beyond merely highlighting that LLMs sometimes produce wrong code (which is no surprise): by systematically identifying cases where LLMs are able to solve some problems in a neighbourhood but do not manage to generalise to solve the whole neighbourhood, our method is effective at highlighting $\textit{robustness}$ issues. We present data and examples that shed light on the kinds of mistakes that LLMs make when they return incorrect code results.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (77)
  1. A. Sarkar, A. D. Gordon, C. Negreanu, C. Poelitz, S. S. Ragavan, and B. Zorn, “What is it like to program with artificial intelligence?” 2022.
  2. J. D. Weisz, M. J. Muller, S. Houde, J. T. Richards, S. I. Ross, F. Martinez, M. Agarwal, and K. Talamadupula, “Perfection not required? human-ai partnerships in code translation,” in IUI ’21: 26th International Conference on Intelligent User Interfaces, College Station, TX, USA, April 13-17, 2021, T. Hammond, K. Verbert, D. Parra, B. P. Knijnenburg, J. O’Donovan, and P. Teale, Eds.   ACM, 2021, pp. 402–412. [Online]. Available: https://doi.org/10.1145/3397481.3450656
  3. V. Lomshakov, S. Kovalchuk, M. Omelchenko, S. Nikolenko, and A. Aliev, “Fine-tuning large language models for answering programming questions with code snippets,” in Computational Science – ICCS 2023, J. Mikyška, C. de Mulatier, M. Paszynski, V. V. Krzhizhanovskaya, J. J. Dongarra, and P. M. Sloot, Eds.   Springer Nature Switzerland, 2023.
  4. J. Wei, M. Bosma, V. Y. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V. Le, “Finetuned language models are zero-shot learners,” in The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022.   OpenReview.net, 2022. [Online]. Available: https://openreview.net/forum?id=gEZrGCozdqR
  5. S. Zhang, L. Dong, X. Li, S. Zhang, X. Sun, S. Wang, J. Li, R. Hu, T. Zhang, F. Wu, and G. Wang, “Instruction tuning for large language models: A survey,” CoRR, vol. abs/2308.10792, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2308.10792
  6. M. C. et al., “Evaluating large language models trained on code,” CoRR, vol. abs/2107.03374, 2021. [Online]. Available: https://arxiv.org/abs/2107.03374
  7. D. Hendrycks, S. Basart, S. Kadavath, M. Mazeika, A. Arora, E. Guo, C. Burns, S. Puranik, H. He, D. Song, and J. Steinhardt, “Measuring coding challenge competence with APPS,” in Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual, J. Vanschoren and S. Yeung, Eds., 2021. [Online]. Available: https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/c24cd76e1ce41366a4bbe8a49b02a028-Abstract-round2.html
  8. N. Nguyen and S. Nadi, “An empirical evaluation of GitHub Copilot’s code suggestions,” in Proceedings of the 19th International Conference on Mining Software Repositories, ser. MSR ’22.   Association for Computing Machinery, 2022, p. 1–5. [Online]. Available: https://doi.org/10.1145/3524842.3528470
  9. J. Liu, C. S. Xia, Y. Wang, and L. Zhang, “Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation,” 2023.
  10. S. I. Ross, F. Martinez, S. Houde, M. J. Muller, and J. D. Weisz, “The programmer’s assistant: Conversational interaction with a large language model for software development,” in Proceedings of the 28th International Conference on Intelligent User Interfaces, IUI 2023, Sydney, NSW, Australia, March 27-31, 2023.   ACM, 2023, pp. 491–514. [Online]. Available: https://doi.org/10.1145/3581641.3584037
  11. N. Perry, M. Srivastava, D. Kumar, and D. Boneh, “Do users write more insecure code with AI assistants?” 2022.
  12. J. Austin, A. Odena, M. I. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. J. Cai, M. Terry, Q. V. Le, and C. Sutton, “Program synthesis with large language models,” CoRR, vol. abs/2108.07732, 2021. [Online]. Available: https://arxiv.org/abs/2108.07732
  13. F. F. Xu, U. Alon, G. Neubig, and V. J. Hellendoorn, “A systematic evaluation of large language models of code,” in Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming, ser. MAPS 2022.   Association for Computing Machinery, 2022, p. 1–10. [Online]. Available: https://doi.org/10.1145/3520312.3534862
  14. Z. Yuan, J. Liu, Q. Zi, M. Liu, X. Peng, and Y. Lou, “Evaluating instruction-tuned large language models on code comprehension and generation,” CoRR, vol. abs/2308.01240, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2308.01240
  15. J. Döderlein, M. Acher, D. E. Khelladi, and B. Combemale, “Piloting copilot and codex: Hot temperature, cold prompts, or black magic?” CoRR, 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2210.14699
  16. S. Wang, Z. Li, H. Qian, C. Yang, Z. Wang, M. Shang, V. Kumar, S. Tan, B. Ray, P. Bhatia, R. Nallapati, M. K. Ramanathan, D. Roth, and B. Xiang, “Recode: Robustness evaluation of code generation models,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, A. Rogers, J. L. Boyd-Graber, and N. Okazaki, Eds.   Association for Computational Linguistics, 2023, pp. 13 818–13 843. [Online]. Available: https://doi.org/10.18653/v1/2023.acl-long.773
  17. A. Shirafuji, Y. Watanobe, T. Ito, M. Morishita, Y. Nakamura, Y. Oda, and J. Suzuki, “Exploring the robustness of large language models for solving programming problems,” 2023.
  18. M. Gardner, Y. Artzi, V. Basmova, J. Berant, B. Bogin, S. Chen, P. Dasigi, D. Dua, Y. Elazar, A. Gottumukkala, N. Gupta, H. Hajishirzi, G. Ilharco, D. Khashabi, K. Lin, J. Liu, N. F. Liu, P. Mulcaire, Q. Ning, S. Singh, N. A. Smith, S. Subramanian, R. Tsarfaty, E. Wallace, A. Zhang, and B. Zhou, “Evaluating models’ local decision boundaries via contrast sets,” in Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16-20 November 2020, ser. Findings of ACL, T. Cohn, Y. He, and Y. Liu, Eds., vol. EMNLP 2020.   Association for Computational Linguistics, 2020, pp. 1307–1323. [Online]. Available: https://doi.org/10.18653/v1/2020.findings-emnlp.117
  19. A. Saparov and H. He, “Language models are greedy reasoners: A systematic formal analysis of chain-of-thought,” in The Eleventh International Conference on Learning Representations, 2023. [Online]. Available: https://openreview.net/forum?id=qFVVBzXxR2V
  20. F. Shi, M. Suzgun, M. Freitag, X. Wang, S. Srivats, S. Vosoughi, H. W. Chung, Y. Tay, S. Ruder, D. Zhou, D. Das, and J. Wei, “Language models are multilingual chain-of-thought reasoners,” in The Eleventh International Conference on Learning Representations, 2023. [Online]. Available: https://openreview.net/forum?id=fR3wGCk-IXp
  21. T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwasawa, “Large language models are zero-shot reasoners,” in Advances in Neural Information Processing Systems, vol. 35, 2022, pp. 22 199–22 213.
  22. OpenAI, “Gpt-4 technical report,” 2023.
  23. OpenAI, “GPT-3.5,” https://platform.openai.com/docs/models/gpt-3-5, 2023.
  24. OpenAI, “Models,” https://platform.openai.com/docs/models/models, 2023.
  25. Cohere, “Cohere’s Command model,” https://docs.cohere.com/docs/the-command-model, 2023.
  26. Meta, “Introducing Code Llama, an AI Tool for Coding,” https://about.fb.com/news/2023/08/code-llama-ai-for-coding/, 2023.
  27. Ollama, “codellama,” https://ollama.ai/library/codellama, 2023.
  28. W. M. McKeeman, “Differential testing for software,” Digital Technical Journal, vol. 10, no. 1, pp. 100–107, 1998. [Online]. Available: http://www.hpl.hp.com/hpjournal/dtj/vol10num1/vol10num1art9.pdf
  29. L. Reynolds and K. McDonell, “Prompt programming for large language models: Beyond the few-shot paradigm,” in CHI ’21: CHI Conference on Human Factors in Computing Systems, Virtual Event / Yokohama Japan, May 8-13, 2021, Extended Abstracts, Y. Kitamura, A. Quigley, K. Isbister, and T. Igarashi, Eds.   ACM, 2021, pp. 314:1–314:7. [Online]. Available: https://doi.org/10.1145/3411763.3451760
  30. OpenAI, “About OpenAI,” https://openai.com/about, 2023.
  31. Meta, “Introducing Llama 2,” https://ai.meta.com/llama/, 2023.
  32. M. Caccia, L. Caccia, W. Fedus, H. Larochelle, J. Pineau, and L. Charlin, “Language gans falling short,” in 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020.   OpenReview.net, 2020. [Online]. Available: https://openreview.net/forum?id=BJgza6VtPB
  33. T. B. Hashimoto, H. Zhang, and P. Liang, “Unifying human and statistical evaluation for natural language generation,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), J. Burstein, C. Doran, and T. Solorio, Eds.   Association for Computational Linguistics, 2019, pp. 1689–1701. [Online]. Available: https://doi.org/10.18653/v1/n19-1169
  34. Logilab and P. contributors, “Pylint,” https://pylint.pycqa.org/en/latest/index.html#pylint, 2023.
  35. Google, “Bard,” https://bard.google.com, 2023.
  36. Google, “Welcome to the Gemini era,” https://deepmind.google/technologies/gemini/#introduction, 2023.
  37. OpenAI, “Code completion,” https://platform.openai.com/docs/guides/code, 2023.
  38. X. Hou, Y. Zhao, Y. Liu, Z. Yang, K. Wang, L. Li, X. Luo, D. Lo, J. C. Grundy, and H. Wang, “Large language models for software engineering: A systematic literature review,” CoRR, vol. abs/2308.10620, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2308.10620
  39. A. Fan, B. Gokkaya, M. Harman, M. Lyubarskiy, S. Sengupta, S. Yoo, and J. M. Zhang, “Large language models for software engineering: Survey and open problems,” CoRR, vol. abs/2310.03533, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2310.03533
  40. R. Pan, A. R. Ibrahimzada, R. Krishna, D. Sankar, L. P. Wassi, M. Merler, B. Sobolev, R. Pavuluri, S. Sinha, and R. Jabbarvand, “Understanding the effectiveness of large language models in code translation,” CoRR, vol. abs/2308.03109, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2308.03109
  41. K. Pei, D. Bieber, K. Shi, C. Sutton, and P. Yin, “Can large language models reason about program invariants?” in International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, ser. Proceedings of Machine Learning Research, A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, Eds., vol. 202.   PMLR, 2023, pp. 27 496–27 520. [Online]. Available: https://proceedings.mlr.press/v202/pei23a.html
  42. J. Zhang, S. Panthaplackel, P. Nie, J. J. Li, and M. Gligoric, “Coditt5: Pretraining for source code and natural language editing,” in 37th IEEE/ACM International Conference on Automated Software Engineering, ASE 2022, Rochester, MI, USA, October 10-14, 2022.   ACM, 2022, pp. 22:1–22:12. [Online]. Available: https://doi.org/10.1145/3551349.3556955
  43. Y. Li, T. Zhang, X. Luo, H. Cai, S. Fang, and D. Yuan, “Do pretrained language models indeed understand software engineering tasks?” IEEE Trans. Software Eng., vol. 49, no. 10, pp. 4639–4655, 2023. [Online]. Available: https://doi.org/10.1109/TSE.2023.3308952
  44. Z. Xing, Q. Huang, Y. Cheng, L. Zhu, Q. Lu, and X. Xu, “Prompt sapper: Llm-empowered software engineering infrastructure for ai-native services,” CoRR, vol. abs/2306.02230, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2306.02230
  45. R. Degiovanni and M. Papadakis, “μ𝜇\mathrm{\mu}italic_μbert: Mutation testing using pre-trained language models,” in 15th IEEE International Conference on Software Testing, Verification and Validation Workshops ICST Workshops 2022, Valencia, Spain, April 4-13, 2022.   IEEE, 2022, pp. 160–169. [Online]. Available: https://doi.org/10.1109/ICSTW55395.2022.00039
  46. Z. Zhang, R. Winn, Y. Zhao, T. Yu, and W. G. Halfond, “Automatically reproducing android bug reports using natural language processing and reinforcement learning,” in Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis, ser. ISSTA 2023.   New York, NY, USA: Association for Computing Machinery, 2023, p. 411–422. [Online]. Available: https://doi.org/10.1145/3597926.3598066
  47. C. S. Xia, M. Paltenghi, J. L. Tian, M. Pradel, and L. Zhang, “Universal fuzzing via large language models,” CoRR, vol. abs/2308.04748, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2308.04748
  48. M. Kim, T. Stennett, D. Shah, S. Sinha, and A. Orso, “Leveraging large language models to improve REST API testing,” CoRR, vol. abs/2312.00894, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2312.00894
  49. V. Vikram, C. Lemieux, and R. Padhye, “Can large language models write good property-based tests?” CoRR, vol. abs/2307.04346, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2307.04346
  50. M. Schäfer, S. Nadi, A. Eghbali, and F. Tip, “Adaptive test generation using a large language model,” CoRR, vol. abs/2302.06527, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2302.06527
  51. C. Zhang, M. Bai, Y. Zheng, Y. Li, X. Xie, Y. Li, W. Ma, L. Sun, and Y. Liu, “Understanding large language model based fuzz driver generation,” CoRR, vol. abs/2307.12469, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2307.12469
  52. Y. Wu, Z. Li, J. M. Zhang, M. Papadakis, M. Harman, and Y. Liu, “Large language models in fault localisation,” CoRR, vol. abs/2308.15276, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2308.15276
  53. Y. Wu, Z. Li, J. M. Zhang, and Y. Liu, “Condefects: A new dataset to address the data leakage concern for llm-based fault localization and program repair,” CoRR, vol. abs/2310.16253, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2310.16253
  54. Q. Zhang, T. Zhang, J. Zhai, C. Fang, B. Yu, W. Sun, and Z. Chen, “A critical review of large language model on software engineering: An example from chatgpt and automated program repair,” CoRR, vol. abs/2310.08879, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2310.08879
  55. C. S. Xia, Y. Wei, and L. Zhang, “Automated program repair in the era of large pre-trained language models,” in 45th IEEE/ACM International Conference on Software Engineering, ICSE 2023, Melbourne, Australia, May 14-20, 2023.   IEEE, 2023, pp. 1482–1494. [Online]. Available: https://doi.org/10.1109/ICSE48619.2023.00129
  56. Z. Li, C. Wang, Z. Liu, H. Wang, D. Chen, S. Wang, and C. Gao, “Cctest: Testing and repairing code completion systems,” in 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE).   IEEE Computer Society, may 2023, pp. 1238–1250. [Online]. Available: https://doi.ieeecomputersociety.org/10.1109/ICSE48619.2023.00110
  57. Y. Wan, W. Zhao, H. Zhang, Y. Sui, G. Xu, and H. Jin, “What do they capture? - A structural analysis of pre-trained language models for source code,” in 44th IEEE/ACM 44th International Conference on Software Engineering, ICSE 2022, Pittsburgh, PA, USA, May 25-27, 2022.   ACM, 2022, pp. 2377–2388. [Online]. Available: https://doi.org/10.1145/3510003.3510050
  58. Z. Fan, X. Gao, M. Mirchev, A. Roychoudhury, and S. H. Tan, “Automated repair of programs from large language models,” in 45th IEEE/ACM International Conference on Software Engineering, ICSE 2023, Melbourne, Australia, May 14-20, 2023.   IEEE, 2023, pp. 1469–1481. [Online]. Available: https://doi.org/10.1109/ICSE48619.2023.00128
  59. S. Ouyang, J. M. Zhang, M. Harman, and M. Wang, “LLM is like a box of chocolates: the non-determinism of chatgpt in code generation,” CoRR, vol. abs/2308.02828, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2308.02828
  60. H. Pearce, B. Ahmad, B. Tan, B. Dolan-Gavitt, and R. Karri, “Asleep at the keyboard? assessing the security of GitHub Copilot’s code contributions,” in 2022 IEEE Symposium on Security and Privacy (SP).   IEEE, 2022, pp. 754–768.
  61. M. L. Siddiq and J. C. S. Santos, “Generate and pray: Using SALLMS to evaluate the security of LLM generated code,” CoRR, vol. abs/2311.00889, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2311.00889
  62. C. Niu, C. Li, V. Ng, D. Chen, J. Ge, and B. Luo, “An empirical comparison of pre-trained models of source code,” in 45th IEEE/ACM International Conference on Software Engineering, ICSE 2023, Melbourne, Australia, May 14-20, 2023.   IEEE, 2023, pp. 2136–2148. [Online]. Available: https://doi.org/10.1109/ICSE48619.2023.00180
  63. T. Singla, D. Anandayuvaraj, K. G. Kalu, T. R. Schorlemmer, and J. C. Davis, “An empirical study on using large language models to analyze software supply chain security failures,” in Proceedings of the 2023 Workshop on Software Supply Chain Offensive Research and Ecosystem Defenses, SCORED 2023, Copenhagen, Denmark, 30 November 2023, S. Torres-Arias, M. S. Melara, L. Simon, N. Vasilakis, and K. Moriarty, Eds.   ACM, 2023, pp. 5–15. [Online]. Available: https://doi.org/10.1145/3605770.3625214
  64. D. Huang, Q. Bu, J. Zhang, X. Xie, J. Chen, and H. Cui, “Bias assessment and mitigation in llm-based code generation,” CoRR, vol. abs/2309.14345, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2309.14345
  65. Y. Zhuang, Q. Liu, Y. Ning, W. Huang, R. Lv, Z. Huang, G. Zhao, Z. Zhang, Q. Mao, S. Wang, and E. Chen, “Efficiently measuring the cognitive ability of llms: An adaptive testing perspective,” CoRR, vol. abs/2306.10512, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2306.10512
  66. P. Vaithilingam, T. Zhang, and E. L. Glassman, “Expectation vs. experience: Evaluating the usability of code generation tools powered by large language models,” in Extended Abstracts of the 2022 CHI Conference on Human Factors in Computing Systems, ser. CHI EA ’22.   Association for Computing Machinery, 2022. [Online]. Available: https://doi.org/10.1145/3491101.3519665
  67. B. Kou, S. Chen, Z. Wang, L. Ma, and T. Zhang, “Is model attention aligned with human attention? an empirical study on large language models for code generation,” CoRR, vol. abs/2306.01220, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2306.01220
  68. Y. Li, D. Choi, J. Chung, N. Kushman, J. Schrittwieser, R. Leblond, T. Eccles, J. Keeling, F. Gimeno, A. D. Lago, T. Hubert, P. Choy, C. de Masson d’Autume, I. Babuschkin, X. Chen, P.-S. Huang, J. Welbl, S. Gowal, A. Cherepanov, J. Molloy, D. J. Mankowitz, E. S. Robson, P. Kohli, N. de Freitas, K. Kavukcuoglu, and O. Vinyals, “Competition-level code generation with alphacode,” Science, vol. 378, no. 6624, pp. 1092–1097, 2022. [Online]. Available: https://www.science.org/doi/abs/10.1126/science.abq1158
  69. S. Lu, D. Guo, S. Ren, J. Huang, A. Svyatkovskiy, A. Blanco, C. B. Clement, D. Drain, D. Jiang, D. Tang, G. Li, L. Zhou, L. Shou, L. Zhou, M. Tufano, M. Gong, M. Zhou, N. Duan, N. Sundaresan, S. K. Deng, S. Fu, and S. Liu, “Codexglue: A machine learning benchmark dataset for code understanding and generation,” in Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual, J. Vanschoren and S. Yeung, Eds., 2021. [Online]. Available: https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/c16a5320fa475530d9583c34fd356ef5-Abstract-round1.html
  70. D. Wong, A. Kothig, and P. Lam, “Exploring the verifiability of code generated by GitHub Copilot,” 2022.
  71. A. Moradi Dakhel, V. Majdinasab, A. Nikanjam, F. Khomh, M. C. Desmarais, and Z. M. J. Jiang, “Github copilot ai pair programmer: Asset or liability?” J. Syst. Softw., jul 2023. [Online]. Available: https://doi.org/10.1016/j.jss.2023.111734
  72. X. Du, M. Liu, K. Wang, H. Wang, J. Liu, Y. Chen, J. Feng, C. Sha, X. Peng, and Y. Lou, “Classeval: A manually-crafted benchmark for evaluating llms on class-level code generation,” CoRR, vol. abs/2308.01861, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2308.01861
  73. I. Ciupa, B. Meyer, M. Oriol, and A. Pretschner, “Finding faults: Manual testing vs. random+ testing vs. user reports,” in 2008 19th International Symposium on Software Reliability Engineering (ISSRE), 2008, pp. 157–166.
  74. GitHub, “Your AI pair programmer,” https://github.com/features/copilot, 2023.
  75. LeetCode, “A new way to learn,” https://leetcode.com/, 2023.
  76. E. Nijkamp, B. Pang, H. Hayashi, L. Tu, H. Wang, Y. Zhou, S. Savarese, and C. Xiong, “Codegen: An open large language model for code with multi-turn program synthesis,” in The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023.   OpenReview.net, 2023. [Online]. Available: https://openreview.net/pdf?id=iaYcJKpY2B_
  77. U. of Aizu, “Aizu online judge,” https://onlinejudge.u-aizu.ac.jp/home, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Shahin Honarvar (2 papers)
  2. Mark van der Wilk (61 papers)
  3. Alastair Donaldson (4 papers)
Citations (4)