Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Comparing large language models and human programmers for generating programming code (2403.00894v2)

Published 1 Mar 2024 in cs.SE, cs.AI, cs.CL, and cs.PL

Abstract: We systematically evaluated the performance of seven LLMs in generating programming code using various prompt strategies, programming languages, and task difficulties. GPT-4 substantially outperforms other LLMs, including Gemini Ultra and Claude 2. The coding performance of GPT-4 varies considerably with different prompt strategies. In most LeetCode and GeeksforGeeks coding contests evaluated in this study, GPT-4 employing the optimal prompt strategy outperforms 85 percent of human participants. Additionally, GPT-4 demonstrates strong capabilities in translating code between different programming languages and in learning from past errors. The computational efficiency of the code generated by GPT-4 is comparable to that of human programmers. These results suggest that GPT-4 has the potential to serve as a reliable assistant in programming code generation and software development.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (23)
  1. Bubeck, S. et al. Sparks of artificial general intelligence: Early experiments with gpt-4. \JournalTitlearXiv preprint arXiv:2303.12712 (2023).
  2. Li, J. et al. Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls. \JournalTitlearXiv preprint arXiv:2305.03111 (2023).
  3. Chatgpt participates in a computer science exam. \JournalTitlearXiv preprint arXiv:2303.09461 (2023).
  4. Is gpt-4 a good data analyst? \JournalTitlearXiv preprint arXiv:2305.15038 (2023).
  5. Singla, A. Evaluating chatgpt and gpt-4 for visual programming. In Proceedings of the 2023 ACM Conference on International Computing Education Research-Volume 2, 14–15 (2023).
  6. Tarassow, A. The potential of llms for coding with low-resource and domain-specific programming languages. \JournalTitlearXiv preprint arXiv:2307.13018 (2023).
  7. Gpt-3 vs object oriented programming assignments: An experience report. In Proceedings of the 2023 Conference on Innovation and Technology in Computer Science Education V. 1, 61–67 (2023).
  8. Tian, H. et al. Is chatgpt the ultimate programming assistant–how far is it? \JournalTitlearXiv preprint arXiv:2304.11938 (2023).
  9. Phung, T. et al. Generative ai for programming education: Benchmarking chatgpt, gpt-4, and human tutors. \JournalTitleInternational Journal of Management 21, 100790 (2023).
  10. Kung, T. H. et al. Performance of chatgpt on usmle: Potential for ai-assisted medical education using large language models. \JournalTitlePLoS digital health 2, e0000198 (2023).
  11. Geneturing tests gpt models in genomics. \JournalTitlebioRxiv. doi.org/10.1101/2023.03.11.532238 (2023).
  12. Assessing gpt-4 for cell type annotation in single-cell rna-seq analysis. \JournalTitlebioRxiv. doi.org/10.1101/2023.04.16.537094 (2023).
  13. Gpt-4v exhibits human-like performance in biomedical image classification. \JournalTitlebioRxiv. doi.org/10.1101/2023.12.31.573796 (2024).
  14. Shue, E. et al. Empowering beginners in bioinformatics with ChatGPT. \JournalTitleQuant. Biol. 11, 105–108 (2023).
  15. Revolutionizing radiology with gpt-based models: Current applications, future possibilities and limitations of chatgpt. \JournalTitleDiagnostic and Interventional Imaging 104, 269–274 (2023).
  16. Chen, M. et al. Evaluating large language models trained on code. \JournalTitlearXiv preprint arXiv:2107.03374 (2021).
  17. Austin, J. et al. Program synthesis with large language models. \JournalTitlearXiv preprint arXiv:2108.07732 (2021).
  18. Hendrycks, D. et al. Measuring coding challenge competence with apps. \JournalTitlearXiv preprint arXiv:2105.09938 (2021).
  19. Learning to mine aligned code and natural language pairs from stack overflow. In Proceedings of the 15th international conference on mining software repositories, 476–486 (2018).
  20. Li, Y. et al. Competition-level code generation with alphacode. \JournalTitleScience 378, 1092–1097 (2022).
  21. OpenAI. Gpt-4 technical report. \JournalTitlearXiv preprint arXiv:2303.08774 (2023).
  22. Wei, J. et al. Chain-of-thought prompting elicits reasoning in large language models. \JournalTitleAdvances in Neural Information Processing Systems 35, 24824–24837 (2022).
  23. Yao, S. et al. Tree of thoughts: Deliberate problem solving with large language models. \JournalTitlearXiv preprint arXiv:2305.10601 (2023).
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Wenpin Hou (1 paper)
  2. Zhicheng Ji (2 papers)
Citations (2)