Turbulence: Systematically and Automatically Testing Instruction-Tuned Large Language Models for Code (2312.14856v2)
Abstract: We present a method for systematically evaluating the correctness and robustness of instruction-tuned LLMs for code generation via a new benchmark, Turbulence. Turbulence consists of a large set of natural language $\textit{question templates}$, each of which is a programming problem, parameterised so that it can be asked in many different forms. Each question template has an associated $\textit{test oracle}$ that judges whether a code solution returned by an LLM is correct. Thus, from a single question template, it is possible to ask an LLM a $\textit{neighbourhood}$ of very similar programming questions, and assess the correctness of the result returned for each question. This allows gaps in an LLM's code generation abilities to be identified, including $\textit{anomalies}$ where the LLM correctly solves $\textit{almost all}$ questions in a neighbourhood but fails for particular parameter instantiations. We present experiments against five LLMs from OpenAI, Cohere and Meta, each at two temperature configurations. Our findings show that, across the board, Turbulence is able to reveal gaps in LLM reasoning ability. This goes beyond merely highlighting that LLMs sometimes produce wrong code (which is no surprise): by systematically identifying cases where LLMs are able to solve some problems in a neighbourhood but do not manage to generalise to solve the whole neighbourhood, our method is effective at highlighting $\textit{robustness}$ issues. We present data and examples that shed light on the kinds of mistakes that LLMs make when they return incorrect code results.
- A. Sarkar, A. D. Gordon, C. Negreanu, C. Poelitz, S. S. Ragavan, and B. Zorn, “What is it like to program with artificial intelligence?” 2022.
- J. D. Weisz, M. J. Muller, S. Houde, J. T. Richards, S. I. Ross, F. Martinez, M. Agarwal, and K. Talamadupula, “Perfection not required? human-ai partnerships in code translation,” in IUI ’21: 26th International Conference on Intelligent User Interfaces, College Station, TX, USA, April 13-17, 2021, T. Hammond, K. Verbert, D. Parra, B. P. Knijnenburg, J. O’Donovan, and P. Teale, Eds. ACM, 2021, pp. 402–412. [Online]. Available: https://doi.org/10.1145/3397481.3450656
- V. Lomshakov, S. Kovalchuk, M. Omelchenko, S. Nikolenko, and A. Aliev, “Fine-tuning large language models for answering programming questions with code snippets,” in Computational Science – ICCS 2023, J. Mikyška, C. de Mulatier, M. Paszynski, V. V. Krzhizhanovskaya, J. J. Dongarra, and P. M. Sloot, Eds. Springer Nature Switzerland, 2023.
- J. Wei, M. Bosma, V. Y. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V. Le, “Finetuned language models are zero-shot learners,” in The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. [Online]. Available: https://openreview.net/forum?id=gEZrGCozdqR
- S. Zhang, L. Dong, X. Li, S. Zhang, X. Sun, S. Wang, J. Li, R. Hu, T. Zhang, F. Wu, and G. Wang, “Instruction tuning for large language models: A survey,” CoRR, vol. abs/2308.10792, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2308.10792
- M. C. et al., “Evaluating large language models trained on code,” CoRR, vol. abs/2107.03374, 2021. [Online]. Available: https://arxiv.org/abs/2107.03374
- D. Hendrycks, S. Basart, S. Kadavath, M. Mazeika, A. Arora, E. Guo, C. Burns, S. Puranik, H. He, D. Song, and J. Steinhardt, “Measuring coding challenge competence with APPS,” in Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual, J. Vanschoren and S. Yeung, Eds., 2021. [Online]. Available: https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/c24cd76e1ce41366a4bbe8a49b02a028-Abstract-round2.html
- N. Nguyen and S. Nadi, “An empirical evaluation of GitHub Copilot’s code suggestions,” in Proceedings of the 19th International Conference on Mining Software Repositories, ser. MSR ’22. Association for Computing Machinery, 2022, p. 1–5. [Online]. Available: https://doi.org/10.1145/3524842.3528470
- J. Liu, C. S. Xia, Y. Wang, and L. Zhang, “Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation,” 2023.
- S. I. Ross, F. Martinez, S. Houde, M. J. Muller, and J. D. Weisz, “The programmer’s assistant: Conversational interaction with a large language model for software development,” in Proceedings of the 28th International Conference on Intelligent User Interfaces, IUI 2023, Sydney, NSW, Australia, March 27-31, 2023. ACM, 2023, pp. 491–514. [Online]. Available: https://doi.org/10.1145/3581641.3584037
- N. Perry, M. Srivastava, D. Kumar, and D. Boneh, “Do users write more insecure code with AI assistants?” 2022.
- J. Austin, A. Odena, M. I. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. J. Cai, M. Terry, Q. V. Le, and C. Sutton, “Program synthesis with large language models,” CoRR, vol. abs/2108.07732, 2021. [Online]. Available: https://arxiv.org/abs/2108.07732
- F. F. Xu, U. Alon, G. Neubig, and V. J. Hellendoorn, “A systematic evaluation of large language models of code,” in Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming, ser. MAPS 2022. Association for Computing Machinery, 2022, p. 1–10. [Online]. Available: https://doi.org/10.1145/3520312.3534862
- Z. Yuan, J. Liu, Q. Zi, M. Liu, X. Peng, and Y. Lou, “Evaluating instruction-tuned large language models on code comprehension and generation,” CoRR, vol. abs/2308.01240, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2308.01240
- J. Döderlein, M. Acher, D. E. Khelladi, and B. Combemale, “Piloting copilot and codex: Hot temperature, cold prompts, or black magic?” CoRR, 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2210.14699
- S. Wang, Z. Li, H. Qian, C. Yang, Z. Wang, M. Shang, V. Kumar, S. Tan, B. Ray, P. Bhatia, R. Nallapati, M. K. Ramanathan, D. Roth, and B. Xiang, “Recode: Robustness evaluation of code generation models,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, A. Rogers, J. L. Boyd-Graber, and N. Okazaki, Eds. Association for Computational Linguistics, 2023, pp. 13 818–13 843. [Online]. Available: https://doi.org/10.18653/v1/2023.acl-long.773
- A. Shirafuji, Y. Watanobe, T. Ito, M. Morishita, Y. Nakamura, Y. Oda, and J. Suzuki, “Exploring the robustness of large language models for solving programming problems,” 2023.
- M. Gardner, Y. Artzi, V. Basmova, J. Berant, B. Bogin, S. Chen, P. Dasigi, D. Dua, Y. Elazar, A. Gottumukkala, N. Gupta, H. Hajishirzi, G. Ilharco, D. Khashabi, K. Lin, J. Liu, N. F. Liu, P. Mulcaire, Q. Ning, S. Singh, N. A. Smith, S. Subramanian, R. Tsarfaty, E. Wallace, A. Zhang, and B. Zhou, “Evaluating models’ local decision boundaries via contrast sets,” in Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16-20 November 2020, ser. Findings of ACL, T. Cohn, Y. He, and Y. Liu, Eds., vol. EMNLP 2020. Association for Computational Linguistics, 2020, pp. 1307–1323. [Online]. Available: https://doi.org/10.18653/v1/2020.findings-emnlp.117
- A. Saparov and H. He, “Language models are greedy reasoners: A systematic formal analysis of chain-of-thought,” in The Eleventh International Conference on Learning Representations, 2023. [Online]. Available: https://openreview.net/forum?id=qFVVBzXxR2V
- F. Shi, M. Suzgun, M. Freitag, X. Wang, S. Srivats, S. Vosoughi, H. W. Chung, Y. Tay, S. Ruder, D. Zhou, D. Das, and J. Wei, “Language models are multilingual chain-of-thought reasoners,” in The Eleventh International Conference on Learning Representations, 2023. [Online]. Available: https://openreview.net/forum?id=fR3wGCk-IXp
- T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwasawa, “Large language models are zero-shot reasoners,” in Advances in Neural Information Processing Systems, vol. 35, 2022, pp. 22 199–22 213.
- OpenAI, “Gpt-4 technical report,” 2023.
- OpenAI, “GPT-3.5,” https://platform.openai.com/docs/models/gpt-3-5, 2023.
- OpenAI, “Models,” https://platform.openai.com/docs/models/models, 2023.
- Cohere, “Cohere’s Command model,” https://docs.cohere.com/docs/the-command-model, 2023.
- Meta, “Introducing Code Llama, an AI Tool for Coding,” https://about.fb.com/news/2023/08/code-llama-ai-for-coding/, 2023.
- Ollama, “codellama,” https://ollama.ai/library/codellama, 2023.
- W. M. McKeeman, “Differential testing for software,” Digital Technical Journal, vol. 10, no. 1, pp. 100–107, 1998. [Online]. Available: http://www.hpl.hp.com/hpjournal/dtj/vol10num1/vol10num1art9.pdf
- L. Reynolds and K. McDonell, “Prompt programming for large language models: Beyond the few-shot paradigm,” in CHI ’21: CHI Conference on Human Factors in Computing Systems, Virtual Event / Yokohama Japan, May 8-13, 2021, Extended Abstracts, Y. Kitamura, A. Quigley, K. Isbister, and T. Igarashi, Eds. ACM, 2021, pp. 314:1–314:7. [Online]. Available: https://doi.org/10.1145/3411763.3451760
- OpenAI, “About OpenAI,” https://openai.com/about, 2023.
- Meta, “Introducing Llama 2,” https://ai.meta.com/llama/, 2023.
- M. Caccia, L. Caccia, W. Fedus, H. Larochelle, J. Pineau, and L. Charlin, “Language gans falling short,” in 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. [Online]. Available: https://openreview.net/forum?id=BJgza6VtPB
- T. B. Hashimoto, H. Zhang, and P. Liang, “Unifying human and statistical evaluation for natural language generation,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), J. Burstein, C. Doran, and T. Solorio, Eds. Association for Computational Linguistics, 2019, pp. 1689–1701. [Online]. Available: https://doi.org/10.18653/v1/n19-1169
- Logilab and P. contributors, “Pylint,” https://pylint.pycqa.org/en/latest/index.html#pylint, 2023.
- Google, “Bard,” https://bard.google.com, 2023.
- Google, “Welcome to the Gemini era,” https://deepmind.google/technologies/gemini/#introduction, 2023.
- OpenAI, “Code completion,” https://platform.openai.com/docs/guides/code, 2023.
- X. Hou, Y. Zhao, Y. Liu, Z. Yang, K. Wang, L. Li, X. Luo, D. Lo, J. C. Grundy, and H. Wang, “Large language models for software engineering: A systematic literature review,” CoRR, vol. abs/2308.10620, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2308.10620
- A. Fan, B. Gokkaya, M. Harman, M. Lyubarskiy, S. Sengupta, S. Yoo, and J. M. Zhang, “Large language models for software engineering: Survey and open problems,” CoRR, vol. abs/2310.03533, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2310.03533
- R. Pan, A. R. Ibrahimzada, R. Krishna, D. Sankar, L. P. Wassi, M. Merler, B. Sobolev, R. Pavuluri, S. Sinha, and R. Jabbarvand, “Understanding the effectiveness of large language models in code translation,” CoRR, vol. abs/2308.03109, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2308.03109
- K. Pei, D. Bieber, K. Shi, C. Sutton, and P. Yin, “Can large language models reason about program invariants?” in International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, ser. Proceedings of Machine Learning Research, A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, Eds., vol. 202. PMLR, 2023, pp. 27 496–27 520. [Online]. Available: https://proceedings.mlr.press/v202/pei23a.html
- J. Zhang, S. Panthaplackel, P. Nie, J. J. Li, and M. Gligoric, “Coditt5: Pretraining for source code and natural language editing,” in 37th IEEE/ACM International Conference on Automated Software Engineering, ASE 2022, Rochester, MI, USA, October 10-14, 2022. ACM, 2022, pp. 22:1–22:12. [Online]. Available: https://doi.org/10.1145/3551349.3556955
- Y. Li, T. Zhang, X. Luo, H. Cai, S. Fang, and D. Yuan, “Do pretrained language models indeed understand software engineering tasks?” IEEE Trans. Software Eng., vol. 49, no. 10, pp. 4639–4655, 2023. [Online]. Available: https://doi.org/10.1109/TSE.2023.3308952
- Z. Xing, Q. Huang, Y. Cheng, L. Zhu, Q. Lu, and X. Xu, “Prompt sapper: Llm-empowered software engineering infrastructure for ai-native services,” CoRR, vol. abs/2306.02230, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2306.02230
- R. Degiovanni and M. Papadakis, “μ𝜇\mathrm{\mu}italic_μbert: Mutation testing using pre-trained language models,” in 15th IEEE International Conference on Software Testing, Verification and Validation Workshops ICST Workshops 2022, Valencia, Spain, April 4-13, 2022. IEEE, 2022, pp. 160–169. [Online]. Available: https://doi.org/10.1109/ICSTW55395.2022.00039
- Z. Zhang, R. Winn, Y. Zhao, T. Yu, and W. G. Halfond, “Automatically reproducing android bug reports using natural language processing and reinforcement learning,” in Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis, ser. ISSTA 2023. New York, NY, USA: Association for Computing Machinery, 2023, p. 411–422. [Online]. Available: https://doi.org/10.1145/3597926.3598066
- C. S. Xia, M. Paltenghi, J. L. Tian, M. Pradel, and L. Zhang, “Universal fuzzing via large language models,” CoRR, vol. abs/2308.04748, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2308.04748
- M. Kim, T. Stennett, D. Shah, S. Sinha, and A. Orso, “Leveraging large language models to improve REST API testing,” CoRR, vol. abs/2312.00894, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2312.00894
- V. Vikram, C. Lemieux, and R. Padhye, “Can large language models write good property-based tests?” CoRR, vol. abs/2307.04346, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2307.04346
- M. Schäfer, S. Nadi, A. Eghbali, and F. Tip, “Adaptive test generation using a large language model,” CoRR, vol. abs/2302.06527, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2302.06527
- C. Zhang, M. Bai, Y. Zheng, Y. Li, X. Xie, Y. Li, W. Ma, L. Sun, and Y. Liu, “Understanding large language model based fuzz driver generation,” CoRR, vol. abs/2307.12469, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2307.12469
- Y. Wu, Z. Li, J. M. Zhang, M. Papadakis, M. Harman, and Y. Liu, “Large language models in fault localisation,” CoRR, vol. abs/2308.15276, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2308.15276
- Y. Wu, Z. Li, J. M. Zhang, and Y. Liu, “Condefects: A new dataset to address the data leakage concern for llm-based fault localization and program repair,” CoRR, vol. abs/2310.16253, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2310.16253
- Q. Zhang, T. Zhang, J. Zhai, C. Fang, B. Yu, W. Sun, and Z. Chen, “A critical review of large language model on software engineering: An example from chatgpt and automated program repair,” CoRR, vol. abs/2310.08879, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2310.08879
- C. S. Xia, Y. Wei, and L. Zhang, “Automated program repair in the era of large pre-trained language models,” in 45th IEEE/ACM International Conference on Software Engineering, ICSE 2023, Melbourne, Australia, May 14-20, 2023. IEEE, 2023, pp. 1482–1494. [Online]. Available: https://doi.org/10.1109/ICSE48619.2023.00129
- Z. Li, C. Wang, Z. Liu, H. Wang, D. Chen, S. Wang, and C. Gao, “Cctest: Testing and repairing code completion systems,” in 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE Computer Society, may 2023, pp. 1238–1250. [Online]. Available: https://doi.ieeecomputersociety.org/10.1109/ICSE48619.2023.00110
- Y. Wan, W. Zhao, H. Zhang, Y. Sui, G. Xu, and H. Jin, “What do they capture? - A structural analysis of pre-trained language models for source code,” in 44th IEEE/ACM 44th International Conference on Software Engineering, ICSE 2022, Pittsburgh, PA, USA, May 25-27, 2022. ACM, 2022, pp. 2377–2388. [Online]. Available: https://doi.org/10.1145/3510003.3510050
- Z. Fan, X. Gao, M. Mirchev, A. Roychoudhury, and S. H. Tan, “Automated repair of programs from large language models,” in 45th IEEE/ACM International Conference on Software Engineering, ICSE 2023, Melbourne, Australia, May 14-20, 2023. IEEE, 2023, pp. 1469–1481. [Online]. Available: https://doi.org/10.1109/ICSE48619.2023.00128
- S. Ouyang, J. M. Zhang, M. Harman, and M. Wang, “LLM is like a box of chocolates: the non-determinism of chatgpt in code generation,” CoRR, vol. abs/2308.02828, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2308.02828
- H. Pearce, B. Ahmad, B. Tan, B. Dolan-Gavitt, and R. Karri, “Asleep at the keyboard? assessing the security of GitHub Copilot’s code contributions,” in 2022 IEEE Symposium on Security and Privacy (SP). IEEE, 2022, pp. 754–768.
- M. L. Siddiq and J. C. S. Santos, “Generate and pray: Using SALLMS to evaluate the security of LLM generated code,” CoRR, vol. abs/2311.00889, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2311.00889
- C. Niu, C. Li, V. Ng, D. Chen, J. Ge, and B. Luo, “An empirical comparison of pre-trained models of source code,” in 45th IEEE/ACM International Conference on Software Engineering, ICSE 2023, Melbourne, Australia, May 14-20, 2023. IEEE, 2023, pp. 2136–2148. [Online]. Available: https://doi.org/10.1109/ICSE48619.2023.00180
- T. Singla, D. Anandayuvaraj, K. G. Kalu, T. R. Schorlemmer, and J. C. Davis, “An empirical study on using large language models to analyze software supply chain security failures,” in Proceedings of the 2023 Workshop on Software Supply Chain Offensive Research and Ecosystem Defenses, SCORED 2023, Copenhagen, Denmark, 30 November 2023, S. Torres-Arias, M. S. Melara, L. Simon, N. Vasilakis, and K. Moriarty, Eds. ACM, 2023, pp. 5–15. [Online]. Available: https://doi.org/10.1145/3605770.3625214
- D. Huang, Q. Bu, J. Zhang, X. Xie, J. Chen, and H. Cui, “Bias assessment and mitigation in llm-based code generation,” CoRR, vol. abs/2309.14345, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2309.14345
- Y. Zhuang, Q. Liu, Y. Ning, W. Huang, R. Lv, Z. Huang, G. Zhao, Z. Zhang, Q. Mao, S. Wang, and E. Chen, “Efficiently measuring the cognitive ability of llms: An adaptive testing perspective,” CoRR, vol. abs/2306.10512, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2306.10512
- P. Vaithilingam, T. Zhang, and E. L. Glassman, “Expectation vs. experience: Evaluating the usability of code generation tools powered by large language models,” in Extended Abstracts of the 2022 CHI Conference on Human Factors in Computing Systems, ser. CHI EA ’22. Association for Computing Machinery, 2022. [Online]. Available: https://doi.org/10.1145/3491101.3519665
- B. Kou, S. Chen, Z. Wang, L. Ma, and T. Zhang, “Is model attention aligned with human attention? an empirical study on large language models for code generation,” CoRR, vol. abs/2306.01220, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2306.01220
- Y. Li, D. Choi, J. Chung, N. Kushman, J. Schrittwieser, R. Leblond, T. Eccles, J. Keeling, F. Gimeno, A. D. Lago, T. Hubert, P. Choy, C. de Masson d’Autume, I. Babuschkin, X. Chen, P.-S. Huang, J. Welbl, S. Gowal, A. Cherepanov, J. Molloy, D. J. Mankowitz, E. S. Robson, P. Kohli, N. de Freitas, K. Kavukcuoglu, and O. Vinyals, “Competition-level code generation with alphacode,” Science, vol. 378, no. 6624, pp. 1092–1097, 2022. [Online]. Available: https://www.science.org/doi/abs/10.1126/science.abq1158
- S. Lu, D. Guo, S. Ren, J. Huang, A. Svyatkovskiy, A. Blanco, C. B. Clement, D. Drain, D. Jiang, D. Tang, G. Li, L. Zhou, L. Shou, L. Zhou, M. Tufano, M. Gong, M. Zhou, N. Duan, N. Sundaresan, S. K. Deng, S. Fu, and S. Liu, “Codexglue: A machine learning benchmark dataset for code understanding and generation,” in Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual, J. Vanschoren and S. Yeung, Eds., 2021. [Online]. Available: https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/c16a5320fa475530d9583c34fd356ef5-Abstract-round1.html
- D. Wong, A. Kothig, and P. Lam, “Exploring the verifiability of code generated by GitHub Copilot,” 2022.
- A. Moradi Dakhel, V. Majdinasab, A. Nikanjam, F. Khomh, M. C. Desmarais, and Z. M. J. Jiang, “Github copilot ai pair programmer: Asset or liability?” J. Syst. Softw., jul 2023. [Online]. Available: https://doi.org/10.1016/j.jss.2023.111734
- X. Du, M. Liu, K. Wang, H. Wang, J. Liu, Y. Chen, J. Feng, C. Sha, X. Peng, and Y. Lou, “Classeval: A manually-crafted benchmark for evaluating llms on class-level code generation,” CoRR, vol. abs/2308.01861, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2308.01861
- I. Ciupa, B. Meyer, M. Oriol, and A. Pretschner, “Finding faults: Manual testing vs. random+ testing vs. user reports,” in 2008 19th International Symposium on Software Reliability Engineering (ISSRE), 2008, pp. 157–166.
- GitHub, “Your AI pair programmer,” https://github.com/features/copilot, 2023.
- LeetCode, “A new way to learn,” https://leetcode.com/, 2023.
- E. Nijkamp, B. Pang, H. Hayashi, L. Tu, H. Wang, Y. Zhou, S. Savarese, and C. Xiong, “Codegen: An open large language model for code with multi-turn program synthesis,” in The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. [Online]. Available: https://openreview.net/pdf?id=iaYcJKpY2B_
- U. of Aizu, “Aizu online judge,” https://onlinejudge.u-aizu.ac.jp/home, 2023.
- Shahin Honarvar (2 papers)
- Mark van der Wilk (61 papers)
- Alastair Donaldson (4 papers)