Comparing large language models and human programmers for generating programming code (2403.00894v2)
Abstract: We systematically evaluated the performance of seven LLMs in generating programming code using various prompt strategies, programming languages, and task difficulties. GPT-4 substantially outperforms other LLMs, including Gemini Ultra and Claude 2. The coding performance of GPT-4 varies considerably with different prompt strategies. In most LeetCode and GeeksforGeeks coding contests evaluated in this study, GPT-4 employing the optimal prompt strategy outperforms 85 percent of human participants. Additionally, GPT-4 demonstrates strong capabilities in translating code between different programming languages and in learning from past errors. The computational efficiency of the code generated by GPT-4 is comparable to that of human programmers. These results suggest that GPT-4 has the potential to serve as a reliable assistant in programming code generation and software development.
- Bubeck, S. et al. Sparks of artificial general intelligence: Early experiments with gpt-4. \JournalTitlearXiv preprint arXiv:2303.12712 (2023).
- Li, J. et al. Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls. \JournalTitlearXiv preprint arXiv:2305.03111 (2023).
- Chatgpt participates in a computer science exam. \JournalTitlearXiv preprint arXiv:2303.09461 (2023).
- Is gpt-4 a good data analyst? \JournalTitlearXiv preprint arXiv:2305.15038 (2023).
- Singla, A. Evaluating chatgpt and gpt-4 for visual programming. In Proceedings of the 2023 ACM Conference on International Computing Education Research-Volume 2, 14–15 (2023).
- Tarassow, A. The potential of llms for coding with low-resource and domain-specific programming languages. \JournalTitlearXiv preprint arXiv:2307.13018 (2023).
- Gpt-3 vs object oriented programming assignments: An experience report. In Proceedings of the 2023 Conference on Innovation and Technology in Computer Science Education V. 1, 61–67 (2023).
- Tian, H. et al. Is chatgpt the ultimate programming assistant–how far is it? \JournalTitlearXiv preprint arXiv:2304.11938 (2023).
- Phung, T. et al. Generative ai for programming education: Benchmarking chatgpt, gpt-4, and human tutors. \JournalTitleInternational Journal of Management 21, 100790 (2023).
- Kung, T. H. et al. Performance of chatgpt on usmle: Potential for ai-assisted medical education using large language models. \JournalTitlePLoS digital health 2, e0000198 (2023).
- Geneturing tests gpt models in genomics. \JournalTitlebioRxiv. doi.org/10.1101/2023.03.11.532238 (2023).
- Assessing gpt-4 for cell type annotation in single-cell rna-seq analysis. \JournalTitlebioRxiv. doi.org/10.1101/2023.04.16.537094 (2023).
- Gpt-4v exhibits human-like performance in biomedical image classification. \JournalTitlebioRxiv. doi.org/10.1101/2023.12.31.573796 (2024).
- Shue, E. et al. Empowering beginners in bioinformatics with ChatGPT. \JournalTitleQuant. Biol. 11, 105–108 (2023).
- Revolutionizing radiology with gpt-based models: Current applications, future possibilities and limitations of chatgpt. \JournalTitleDiagnostic and Interventional Imaging 104, 269–274 (2023).
- Chen, M. et al. Evaluating large language models trained on code. \JournalTitlearXiv preprint arXiv:2107.03374 (2021).
- Austin, J. et al. Program synthesis with large language models. \JournalTitlearXiv preprint arXiv:2108.07732 (2021).
- Hendrycks, D. et al. Measuring coding challenge competence with apps. \JournalTitlearXiv preprint arXiv:2105.09938 (2021).
- Learning to mine aligned code and natural language pairs from stack overflow. In Proceedings of the 15th international conference on mining software repositories, 476–486 (2018).
- Li, Y. et al. Competition-level code generation with alphacode. \JournalTitleScience 378, 1092–1097 (2022).
- OpenAI. Gpt-4 technical report. \JournalTitlearXiv preprint arXiv:2303.08774 (2023).
- Wei, J. et al. Chain-of-thought prompting elicits reasoning in large language models. \JournalTitleAdvances in Neural Information Processing Systems 35, 24824–24837 (2022).
- Yao, S. et al. Tree of thoughts: Deliberate problem solving with large language models. \JournalTitlearXiv preprint arXiv:2305.10601 (2023).
- Wenpin Hou (1 paper)
- Zhicheng Ji (2 papers)