Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CodeScope: An Execution-based Multilingual Multitask Multidimensional Benchmark for Evaluating LLMs on Code Understanding and Generation (2311.08588v3)

Published 14 Nov 2023 in cs.CL, cs.AI, and cs.SE

Abstract: LLMs have demonstrated remarkable performance on assisting humans in programming and facilitating programming automation. However, existing benchmarks for evaluating the code understanding and generation capacities of LLMs suffer from severe limitations. First, most benchmarks are insufficient as they focus on a narrow range of popular programming languages and specific tasks, whereas real-world software development scenarios show a critical need to implement systems with multilingual and multitask programming environments to satisfy diverse requirements. Second, most benchmarks fail to consider the actual executability and the consistency of execution results of the generated code. To bridge these gaps between existing benchmarks and expectations from practical applications, we introduce CodeScope, an execution-based, multilingual, multitask, multidimensional evaluation benchmark for comprehensively measuring LLM capabilities on coding tasks. CodeScope covers 43 programming languages and eight coding tasks. It evaluates the coding performance of LLMs from three dimensions (perspectives): length, difficulty, and efficiency. To facilitate execution-based evaluations of code generation, we develop MultiCodeEngine, an automated code execution engine that supports 14 programming languages. Finally, we systematically evaluate and analyze eight mainstream LLMs and demonstrate the superior breadth and challenges of CodeScope for evaluating LLMs on code understanding and generation tasks compared to other benchmarks. The CodeScope benchmark and code are publicly available at https://github.com/WeixiangYAN/CodeScope.

Evaluation of Code Understanding and Generation by LLMs: Insights from the CodeScope Benchmark

LLMs have increasingly demonstrated their utility in automating aspects of software development, such as code generation and understanding. However, existing benchmarks are limited in scope concerning programming languages, tasks, and practical evaluation based on executable code. The paper "CodeScope: An Execution-based Multilingual Multitask Multidimensional Benchmark for Evaluating LLMs on Code Understanding and Generation" addresses these limitations by introducing a comprehensive benchmarking suite designed to rigorously evaluate LLMs on syntactic and semantic code challenges.

Key Features of CodeScope

CodeScope consists of a prolific span of 43 programming languages and encompasses eight tasks, classified under code understanding and generation. The benchmark fundamentally evaluates code proficiency of LLMs through actual execution dimensions, ensuring that generated code not only adheres to syntactic similarity but also executes correctly in practical scenarios. LLMs are evaluated multidimensionally across Length, Difficulty, and Efficiency dimensions, presenting a diverse evaluation landscape.

  1. Multilingual and Multitask Suite: CodeScope includes languages ranging from Python to Delphi, encapsulating a variety of programming paradigms. This multilingual facet challenges the LLMs to generalize across structural and syntactic variants. Tasks like summarization, code translation, and repair enforce LLMs to exhibit understanding beyond simplistic syntactic translation.
  2. Execution-based Evaluation: Previous reliance on n-gram metrics like BLEU for evaluation was limiting, only assessing surface-level similarity. CodeScope innovatively applies execution-based metrics supported by MultiCodeEngine, evaluating code on actual functional grounds with repercussions for correctness and efficiency.
  3. Baseline and Dimension Analysis: The paper benchmarks mainstream LLMs such as GPT-4, LLaMA, and StarCoder on the set tasks under different contextual and task boundaries. We observe that models like WizardCoder excel in complex code structures, highlighting their capability in analyzing intricate logical constructs. However, models such as GPT-3.5, while proficient at easier problems, demonstrate limitations when tackling harder, real-world-inspired scenarios.

Implications and Future Directions

The implications of CodeScope span both practical and theoretical realms. Practically, the benchmark results facilitate an understanding of the current competencies and blind spots in how LLMs process code across languages and paradigms, informing future LLM architectural improvements and training methodologies. Theoretically, CodeScope sets a foundation for emerging evaluation strategies focusing on execution-centric metrics, underscoring the multidimensional competency assessments that simulate real-world software engineering challenges.

The paper concludes by suggesting potential future developments in AI and LLMs concerning code understanding and generation. These include augmenting datasets to cover more programming languages and paradigms further, enhancing execution environments to simulate even more complex application scenarios and refining evaluation metrics to account for nuanced programming attributes like optimization and maintainability.

Conclusion

Incorporating comprehensive benchmarks like CodeScope enriches the evaluation landscape for LLMs, driving advancements in their application to coding tasks. This benchmark is poised to inspire continued enhancements in the design and training of LLMs, steering them closer to fulfilling the multifaceted demands of real-world software development domains.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (85)
  1. Toufique Ahmed and Premkumar T. Devanbu. 2022. Few-shot training llms for project-specific code-summarization. In 37th IEEE/ACM International Conference on Automated Software Engineering, ASE 2022, Rochester, MI, USA, October 10-14, 2022, pages 177:1–177:5. ACM.
  2. Palm 2 technical report. CoRR, abs/2305.10403.
  3. Multi-lingual evaluation of code generation models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net.
  4. Program synthesis with large language models. CoRR, abs/2108.07732.
  5. Deepcoder: Learning to write programs. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net.
  6. Cédric Bastoul. 2004. Code generation in the polyhedral model is easier than you think. In 13th International Conference on Parallel Architectures and Compilation Techniques (PACT 2004), 29 September - 3 October 2004, Antibes Juan-les-Pins, France, pages 7–16. IEEE Computer Society.
  7. Tfix: Learning to fix coding errors with a text-to-text transformer. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 780–791. PMLR.
  8. A practical automatic polyhedral parallelizer and locality optimizer. In Proceedings of the ACM SIGPLAN 2008 Conference on Programming Language Design and Implementation, Tucson, AZ, USA, June 7-13, 2008, pages 101–113. ACM.
  9. Automatic configuration of GCC using irace. In Artificial Evolution - 13th International Conference, Évolution Artificielle, EA 2017, Paris, France, October 25-27, 2017, Revised Selected Papers, volume 10764 of Lecture Notes in Computer Science, pages 202–216. Springer.
  10. The secret sharer: Evaluating and testing unintended memorization in neural networks. In 28th USENIX Security Symposium, USENIX Security 2019, Santa Clara, CA, USA, August 14-16, 2019, pages 267–284. USENIX Association.
  11. Multipl-e: A scalable and extensible approach to benchmarking neural code generation.
  12. Training and evaluating a jupyter notebook data science assistant. CoRR, abs/2201.12901.
  13. Chill: A framework for composing high-level loop transformations. Technical report, Citeseer.
  14. Autofdo: automatic feedback-directed optimization for warehouse-scale applications. In Proceedings of the 2016 International Symposium on Code Generation and Optimization, CGO 2016, Barcelona, Spain, March 12-18, 2016, pages 12–23. ACM.
  15. Evaluating large language models trained on code. CoRR, abs/2107.03374.
  16. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023).
  17. Detecting code smells using deep learning. In TENCON 2019 - 2019 IEEE Region 10 Conference (TENCON), Kochi, India, October 17-20, 2019, pages 2081–2086. IEEE.
  18. Robustfill: Neural program learning under noisy I/O. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Proceedings of Machine Learning Research, pages 990–998. PMLR.
  19. Classeval: A manually-crafted benchmark for evaluating llms on class-level code generation. CoRR, abs/2308.01861.
  20. Martin Fowler. 1999. Refactoring - Improving the Design of Existing Code. Addison Wesley object technology series. Addison-Wesley.
  21. Deepfix: Fixing common c language errors by deep learning. In Proceedings of the aaai conference on artificial intelligence.
  22. On the use of automated text summarization techniques for summarizing source code. In 17th Working Conference on Reverse Engineering, WCRE 2010, 13-16 October 2010, Beverly, MA, USA, pages 35–44. IEEE Computer Society.
  23. Aixbench: A code generation benchmark dataset. CoRR, abs/2206.13179.
  24. Measuring coding challenge competence with APPS. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual.
  25. Measuring massive multitask language understanding. CoRR, abs/2009.03300.
  26. Execution-based evaluation for data science code generation models. CoRR, abs/2211.09374.
  27. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. CoRR, abs/2305.08322.
  28. Summarizing source code using a neural attention model. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers. The Association for Computer Linguistics.
  29. Defects4j: a database of existing faults to enable controlled testing studies for java programs. In International Symposium on Software Testing and Analysis, ISSTA ’14, San Jose, CA, USA - July 21 - 26, 2014, pages 437–440. ACM.
  30. xcodeeval: A large scale multilingual multitask benchmark for code understanding, generation, translation and retrieval. CoRR, abs/2303.03004.
  31. DS-1000: A natural and reliable benchmark for data science code generation. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 18319–18345. PMLR.
  32. Improved code summarization via a graph neural network. In ICPC ’20: 28th International Conference on Program Comprehension, Seoul, Republic of Korea, July 13-15, 2020, pages 184–195. ACM.
  33. CMMLU: measuring massive multitask language understanding in chinese. CoRR, abs/2306.09212.
  34. Starcoder: may the source be with you! CoRR, abs/2305.06161.
  35. Finding failure-inducing test cases with chatgpt. CoRR, abs/2304.11686.
  36. Competition-level code generation with alphacode. CoRR, abs/2203.07814.
  37. Automating code review activities by large-scale pre-training. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2022, Singapore, Singapore, November 14-18, 2022, pages 1035–1047. ACM.
  38. Quixbugs: a multi-lingual program repair benchmark set based on the quixey challenge. In Proceedings Companion of the 2017 ACM SIGPLAN International Conference on Systems, Programming, Languages, and Applications: Software for Humanity, SPLASH 2017, Vancouver, BC, Canada, October 23 - 27, 2017, pages 55–56. ACM.
  39. A novel approach for code smells detection based on deep leaning. In Applied Cryptography in Computer and Communications: First EAI International Conference, AC3 2021, Virtual Event, May 15-16, 2021, Proceedings 1, pages 171–174. Springer.
  40. Latent predictor networks for code generation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers. The Association for Computer Linguistics.
  41. Fan Long and Martin C. Rinard. 2015. Staged program repair with condition synthesis. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering, ESEC/FSE 2015, Bergamo, Italy, August 30 - September 4, 2015, pages 166–178. ACM.
  42. Codexglue: A machine learning benchmark dataset for code understanding and generation. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual.
  43. Wizardcoder: Empowering code large language models with evol-instruct.
  44. Lech Madeyski and Tomasz Lewowski. 2023. Detecting code smells using industry-relevant data. Inf. Softw. Technol., 155:107112.
  45. Radu Marinescu. 2005. Measurement and quality in object-oriented design. In 21st IEEE International Conference on Software Maintenance (ICSM 2005), 25-30 September 2005, Budapest, Hungary, pages 701–704. IEEE Computer Society.
  46. The impact of code review coverage and code review participation on software quality: a case study of the qt, vtk, and ITK projects. In 11th Working Conference on Mining Software Repositories, MSR 2014, Proceedings, May 31 - June 1, 2014, Hyderabad, India, pages 192–201. ACM.
  47. DECOR: A method for the specification and detection of code and design smells. IEEE Trans. Software Eng., 36(1):20–36.
  48. Lexical statistical machine translation for language migration. In Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering, ESEC/FSE’13, Saint Petersburg, Russian Federation, August 18-26, 2013, pages 651–654. ACM.
  49. Semfix: program repair via semantic analysis. In 35th International Conference on Software Engineering, ICSE ’13, San Francisco, CA, USA, May 18-26, 2013, pages 772–781. IEEE Computer Society.
  50. Codegen: An open large language model for code with multi-turn program synthesis. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net.
  51. OpenAI. 2023. GPT-4 technical report.
  52. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, July 6-12, 2002, Philadelphia, PA, USA, pages 311–318. ACL.
  53. Karl Pettis and Robert C. Hansen. 1990. Profile guided code positioning. In Proceedings of the ACM SIGPLAN’90 Conference on Programming Language Design and Implementation (PLDI), White Plains, New York, USA, June 20-22, 1990, pages 16–27. ACM.
  54. An automatic tool for tuning compiler optimizations. In Ninth International Conference on Computer Science and Information Technologies Revised Selected Papers, pages 1–7. IEEE.
  55. Piecewise holistic autotuning of parallel programs with CERE. Concurr. Comput. Pract. Exp., 29(15).
  56. Julian Aron Prenner and Romain Robbes. 2021. Automatic program repair with openai’s codex: Evaluating quixbugs. CoRR, abs/2111.03922.
  57. Project codenet: A large-scale AI for code dataset for learning a diversity of coding tasks. CoRR, abs/2105.12655.
  58. The strength of random search on automated program repair. In 36th International Conference on Software Engineering, ICSE ’14, Hyderabad, India - May 31 - June 07, 2014, pages 254–265. ACM.
  59. Codebleu: a method for automatic evaluation of code synthesis. CoRR, abs/2009.10297.
  60. Code llama: Open foundation models for code. CoRR, abs/2308.12950.
  61. A metric-based heuristic framework to detect object-oriented design flaws. In 14th International Conference on Program Comprehension (ICPC 2006), 14-16 June 2006, Athens, Greece, pages 159–168. IEEE Computer Society.
  62. Designite: a software design quality assessment tool. In Proceedings of the 1st International Workshop on Bringing Architectural Design Thinking into Developers’ Daily Activities, BRIDGE@ICSE 2016, Austin, Texas, USA, May 17, 2016, pages 1–4. ACM.
  63. On the evaluation of neural code summarization. In 44th IEEE/ACM 44th International Conference on Software Engineering, ICSE 2022, Pittsburgh, PA, USA, May 25-27, 2022, pages 1597–1608. ACM.
  64. Exploring the effectiveness of large language models in generating unit tests. CoRR, abs/2305.00418.
  65. Towards a systematic approach to manual annotation of code smells. Sci. Comput. Program., 230:102999.
  66. Towards automatically generating summary comments for java methods. In ASE 2010, 25th IEEE/ACM International Conference on Automated Software Engineering, Antwerp, Belgium, September 20-24, 2010, pages 43–52. ACM.
  67. Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288.
  68. An empirical study on learning bug-fixing patches in the wild via neural machine translation. ACM Trans. Softw. Eng. Methodol., 28(4):19:1–19:29.
  69. Using pre-trained models to boost code review automation. In 44th IEEE/ACM 44th International Conference on Software Engineering, ICSE 2022, Pittsburgh, PA, USA, May 25-27, 2022, pages 2291–2302. ACM.
  70. Scibench: Evaluating college-level scientific problem-solving abilities of large language models. CoRR, abs/2307.10635.
  71. Codet5+: Open code large language models for code understanding and generation. CoRR, abs/2305.07922.
  72. Automatically finding patches using genetic programming. In 31st International Conference on Software Engineering, ICSE 2009, May 16-24, 2009, Vancouver, Canada, Proceedings, pages 364–374. IEEE.
  73. David Williams-King and Junfeng Yang. 2019. Codemason: Binary-level profile-guided optimization. In Proceedings of the 3rd ACM Workshop on Forming an Ecosystem Around Software Transformation, pages 47–53.
  74. Chatunitest: a chatgpt-based automated unit test generation tool. CoRR, abs/2305.04764.
  75. Weixiang Yan and Yuanchun Li. 2022. Whygen: Explaining ml-powered code generation by referring to training examples. In 44th IEEE/ACM International Conference on Software Engineering: Companion Proceedings, ICSE Companion 2022, Pittsburgh, PA, USA, May 22-24, 2022, pages 237–241. ACM/IEEE.
  76. Codetransocean: A comprehensive multilingual benchmark for code translation. CoRR, abs/2310.04951.
  77. Pengcheng Yin and Graham Neubig. 2017. A syntactic neural model for general-purpose code generation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, pages 440–450. Association for Computational Linguistics.
  78. Codereval: A benchmark of pragmatic code generation with generative pre-trained models. CoRR, abs/2302.00288.
  79. No more manual tests? evaluating and improving chatgpt for unit test generation. CoRR, abs/2305.04207.
  80. M3exam: A multilingual, multimodal, multilevel benchmark for examining large language models. CoRR, abs/2306.05179.
  81. Judging llm-as-a-judge with mt-bench and chatbot arena.
  82. Codegeex: A pre-trained model for code generation with multilingual evaluations on humaneval-x. CoRR, abs/2303.17568.
  83. Agieval: A human-centric benchmark for evaluating foundation models.
  84. Language agent tree search unifies reasoning acting and planning in language models. CoRR, abs/2310.04406.
  85. Xlcost: A benchmark dataset for cross-lingual code intelligence.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Weixiang Yan (11 papers)
  2. Haitian Liu (6 papers)
  3. Yunkun Wang (4 papers)
  4. Yunzhe Li (28 papers)
  5. Qian Chen (264 papers)
  6. Wen Wang (144 papers)
  7. Tingyu Lin (4 papers)
  8. Weishan Zhao (2 papers)
  9. Li Zhu (83 papers)
  10. Shuiguang Deng (45 papers)
  11. Hari Sundaram (46 papers)
Citations (17)
Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com