Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Can Language Models Replace Programmers? REPOCOD Says 'Not Yet' (2410.21647v3)

Published 29 Oct 2024 in cs.SE and cs.CL
Can Language Models Replace Programmers? REPOCOD Says 'Not Yet'

Abstract: LLMs have achieved high accuracy, i.e., more than 90% pass@1, in solving Python coding problems in HumanEval and MBPP. Thus, a natural question is, whether LLMs achieve comparable code completion performance compared to human developers? Unfortunately, one cannot answer this question using existing manual crafted or simple (e.g., single-line) code generation benchmarks, since such tasks fail to represent real-world software development tasks. In addition, existing benchmarks often use poor code correctness metrics, providing misleading conclusions. To address these challenges, we create REPOCOD, a code generation benchmark with 980 problems collected from 11 popular real-world projects, with more than 58% of them requiring file-level or repository-level context information. In addition, REPOCOD has the longest average canonical solution length (331.6 tokens) and the highest average cyclomatic complexity (9.00) compared to existing benchmarks. Each task in REPOCOD includes 313.5 developer-written test cases on average for better correctness evaluation. In our evaluations of ten LLMs, none of the models achieve more than 30% pass@1 on REPOCOD, indicating the necessity of building stronger LLMs that can help developers in real-world software development. REPOCOD is available at https://github.com/lt-asset/REPOCOD

Overview of "Can LLMs Replace Programmers? Says 'Not Yet'"

The research paper "Can LLMs Replace Programmers? Says 'Not Yet'" explores the potential of LLMs in the domain of code generation, aiming to evaluate their efficacy in performing software development tasks akin to human programmers. Despite achieving high accuracy on existing code generation benchmarks such as HumanEval and MBPP, current LLMs demonstrate limited performance on real-world programming tasks due to shortcomings in current benchmarks and evaluation techniques.

Clarifications in Current Benchmarks

The authors assert that existing benchmarks do not capture the intricacies of actual software development tasks. They often involve manually crafted or overly simplistic scenarios that do not reflect the exigencies of real-world projects where multi-file, repository-level context, and highly complex functionalities are required. A significant issue with current benchmarks is the reliance on inadequate evaluation metrics based on similarity or exact matching, which results in misleading conclusions on code correctness.

Introducing a Robust Benchmark:

To counter these limitations, the authors present a new benchmark, consisting of 980 problems sourced from 11 widely-used software repositories. This benchmark stands out for its emphasis on complexity and realism: over half of the problems necessitate file-level or repository-level context, with an average canonical solution length of 331.6 tokens and cyclomatic complexity of 9.00. This complexity is paired with rigorous evaluation metrics through an average of 313.5 developer-crafted test cases per task, ensuring the correctness and reliability of LLM-generated code are accurately assessed.

Evaluations and Findings

In evaluating ten contemporary LLMs using the benchmark, the paper reveals that none surpass a pass@1 rate of 30%, accentuating the deficits in current models' abilities to tackle software development tasks inherent to real-world scenarios. The authors highlight that, although LLMs demonstrate competence in generating applicable functions when provided with sufficient context, their efficacy sharply decreases as problem complexity escalates, especially in tasks requiring comprehensive project-level understanding.

Implications for LLM Development

This paper underscores the necessity of advancing LLM architectures and training methodologies to handle project-level code dependencies comprehensively. Enhancing retrieval methods, particularly integrating strategies such as dense vector retrieval seen in this paper, may prove beneficial in bridging the performance gap. Further exploration into augmenting the capacity and efficiency of context processing in LLMs is warranted.

Future Directions

The benchmark provides a crucial tool for future research in expanding the capabilities of LLMs beyond their current limits. It offers a fertile ground for developing models that can meaningfully interact with complex, interdependent software environments. Leveraging improvements in retrieval techniques and addressing the intricacies of real-world code dependencies could yield significant enhancements in LLM performance.

Conclusion

The paper contributes an essential benchmark that both challenges existing models and propels the evolution of LLM capabilities within software development. It provides a comprehensive basis for future research and model development, pushing forward the boundaries of what is achievable with automated programming tools. While current LLMs are far from substituting human programmers, revisiting their architecture in line with insights drawn from this benchmark may deliver more capable code-generating agents in the future.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  2. Automatic semantic augmentation of language model prompts (for code summarization). In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, pages 1–13.
  3. Program synthesis with large language models. ArXiv, abs/2108.07732.
  4. Chatgpt is a knowledgeable but inexperienced solver: An investigation of commonsense problem in large language models. arXiv preprint arXiv:2303.16421.
  5. Long code arena: a set of benchmarks for long-context code models. Preprint, arXiv:2406.11612.
  6. tree-sitter/tree-sitter: v0.22.6.
  7. Evaluating large language models trained on code. Preprint, arXiv:2107.03374.
  8. Large language models are edge-case fuzzers: Testing deep learning libraries via fuzzgpt. Preprint, arXiv:2304.02014.
  9. Crosscodeeval: A diverse and multilingual benchmark for cross-file code completion. Preprint, arXiv:2310.11248.
  10. Evaluating large language models in class-level code generation. In 2024 IEEE/ACM 46th International Conference on Software Engineering (ICSE), pages 982–994, Los Alamitos, CA, USA. IEEE Computer Society.
  11. The llama 3 herd of models. Preprint, arXiv:2407.21783.
  12. Deepseek-coder: When the large language model meets programming – the rise of code intelligence. Preprint, arXiv:2401.14196.
  13. Measuring coding challenge competence with apps. NeurIPS.
  14. A deep dive into large language models for automated bug localization and repair. ArXiv, abs/2404.11595.
  15. Impact of code language models on automated program repair. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 1430–1442.
  16. SWE-bench: Can language models resolve real-world github issues? In The Twelfth International Conference on Learning Representations.
  17. Inferfix: End-to-end program repair with llms. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 1646–1656.
  18. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles.
  19. Improved code summarization via a graph neural network. In Proceedings of the 28th International Conference on Program Comprehension, ICPC ’20, page 184–195, New York, NY, USA. Association for Computing Machinery.
  20. Enabling programming thinking in large language models toward code generation. arXiv preprint arXiv:2305.06599.
  21. Starcoder: may the source be with you! Preprint, arXiv:2305.06161.
  22. Competition-level code generation with alphacode. Science, 378(6624):1092–1097.
  23. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. In Advances in Neural Information Processing Systems, volume 36, pages 21558–21572. Curran Associates, Inc.
  24. Repobench: Benchmarking repository-level code auto-completion systems. In The Twelfth International Conference on Learning Representations.
  25. Starcoder 2 and the stack v2: The next generation. Preprint, arXiv:2402.19173.
  26. Wizardcoder: Empowering code large language models with evol-instruct. In The Twelfth International Conference on Learning Representations.
  27. T.J. McCabe. 1976. A complexity measure. IEEE Transactions on Software Engineering, SE-2(4):308–320.
  28. Codegen: An open large language model for code with multi-turn program synthesis. Preprint, arXiv:2203.13474.
  29. Llm is like a box of chocolates: the non-determinism of chatgpt in code generation. arXiv preprint arXiv:2308.02828.
  30. Understanding the effectiveness of large language models in code translation. arXiv preprint arXiv:2308.03109.
  31. Codebleu: a method for automatic evaluation of code synthesis. Preprint, arXiv:2009.10297.
  32. Stephen Robertson and Hugo Zaragoza. 2009. The probabilistic relevance framework: Bm25 and beyond. Found. Trends Inf. Retr., 3(4):333–389.
  33. Unsupervised translation of programming languages. In Advances in Neural Information Processing Systems, volume 33, pages 20601–20611. Curran Associates, Inc.
  34. Code llama: Open foundation models for code. Preprint, arXiv:2308.12950.
  35. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  36. Repoformer: Selective retrieval for repository-level code completion. Preprint, arXiv:2403.10059.
  37. Top leaderboard ranking= top coding proficiency, always? evoeval: Evolving coding benchmarks via llm. arXiv preprint arXiv:2403.19114.
  38. Automated program repair in the era of large pre-trained language models. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 1482–1494.
  39. Better test cases for better automated program repair. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering, ESEC/FSE 2017, page 831–841, New York, NY, USA. Association for Computing Machinery.
  40. Exploring and unleashing the power of large language models in automated code translation. Proc. ACM Softw. Eng., 1(FSE).
  41. Retrieval-based neural source code summarization. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering, ICSE ’20, page 1385–1397, New York, NY, USA. Association for Computing Machinery.
  42. Learning-based widget matching for migrating gui test cases. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, volume 66 of ICSE ’24, page 1–13. ACM.
  43. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36:46595–46623.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Shanchao Liang (6 papers)
  2. Yiran Hu (16 papers)
  3. Nan Jiang (210 papers)
  4. Lin Tan (25 papers)