Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

OOP: Object-Oriented Programming Evaluation Benchmark for Large Language Models (2401.06628v2)

Published 12 Jan 2024 in cs.CL
OOP: Object-Oriented Programming Evaluation Benchmark for Large Language Models

Abstract: Advancing automated programming necessitates robust and comprehensive code generation benchmarks, yet current evaluation frameworks largely neglect object-oriented programming (OOP) in favor of functional programming (FP), e.g., HumanEval and MBPP. To address this, our study introduces a pioneering OOP-focused benchmark, featuring 431 Python programs that encompass essential OOP concepts and features like classes and encapsulation methods. We propose a novel evaluation metric, pass@o, tailored for OOP, enhancing traditional pass@k measures. Our evaluation of 23 leading LLMs, including both general and code-specialized models, reveals three key insights: 1) pass@o offers a more relevant and comprehensive assessment for OOP code generation; 2) Despite excelling in FP, code-specialized LLMs like WizardCoder lag in OOP compared to models like ChatGPT; 3) The poor performance of all advanced LLMs on our OOP benchmark highlights a critical need for improvements in this field. Our benchmark and scripts are publicly released at: https://github.com/alphadl/OOP-eval.

Overview of the Object-Oriented Programming Benchmark

LLMs have demonstrated prolific success in various fields, notably in automated code generation tasks. However, the evaluation of these models has predominantly centered around functional programming tasks. A significant gap exists in the assessment of LLMs' capabilities in Object-Oriented Programming (OOP). Recognizing this, a new paper has introduced an OOP benchmark tailored for Python, aiming to gauge LLM performance in areas of class-based design, encapsulation, and other key components of OOP.

Evolving Assessment Metrics

Traditional metrics like pass@k have limitations when it comes to OOP evaluation since they may overlook whether LLMs are correctly implementing OOP concepts. To address this, the paper proposes a new metric, pass@o, which compares key concepts expressed in the model's output to those in the benchmark, ensuring a more rigorous and targeted assessment.

Insights from LLM Evaluations

An extensive evaluation of leading LLMs using the newly introduced OOP benchmark and pass@o metric displayed three primary findings. Firstly, it was apparent that while some LLMs excelled at functional programming, their performance lagged in complex OOP tasks. Secondly, models that were specifically geared toward code generation, such as WizardCoder, did not outperform more general models like ChatGPT in terms of OOP capabilities. Lastly, the overall lackluster performance across all models signaled a clear pathway for further improvements in LLMs' understanding and execution of OOP principles.

Characteristics of the OOP Benchmark

The pioneering benchmark is comprehensive, containing 431 Python programs that exercise a host of OOP principles. The team involved meticulously selected and adapted these programs to ensure a challenging yet fair evaluation context for LLMs. Catering to varied difficulty levels, it comprises simple tasks that test knowledge of classes and public methods, while progressively including advanced concepts such as inheritance and polymorphism at higher difficulty tiers.

Implications and Future Directions

The research exposes a dichotomy in the capabilities of current LLMs, highlighting their relative mastery of functional programming but underdevelopment in object-oriented constructs. This realization provides a clarion call for refining these models. In looking ahead, there's an emphasis on enhancing the OOP features within LLM training. Additionally, the paper opens up the opportunity for further exploration into the design of assessment benchmarks not just for other programming paradigms but for multiple programming languages.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. Santacoder: don’t reach for the stars! arXiv preprint.
  2. The falcon series of open language models. arXiv preprint.
  3. Program synthesis with large language models. arXiv preprint.
  4. Qwen technical report. arXiv preprint.
  5. Multipl-e: a scalable and polyglot approach to benchmarking neural code generation. IEEE Transactions on Software Engineering.
  6. Evaluating large language models trained on code. arXiv preprint.
  7. Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint.
  8. Codescore: Evaluating code generation by learning code execution. arXiv preprint.
  9. Codeapex: A bilingual programming evaluation benchmark for large language models. arXiv preprint.
  10. Measuring coding challenge competence with apps. arXiv preprint.
  11. Spoc: Search-based pseudocode to code. In NeurIPS.
  12. Efficient memory management for large language model serving with pagedattention. In SOSP.
  13. Starcoder: may the source be with you! arXiv preprint.
  14. Competition-level code generation with alphacode. Science.
  15. Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In ACL.
  16. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. arXiv preprint.
  17. Error analysis prompting enables human-like translation evaluation in large language models: A case study on chatgpt. arXiv preprint.
  18. Wizardcoder: Empowering code large language models with evol-instruct. arXiv preprint.
  19. Adaptive machine translation with large language models. In EAMT.
  20. Codegen: An open large language model for code with multi-turn program synthesis. arXiv preprint.
  21. OpenAI. 2023. Gpt-4 technical report. arXiv preprint.
  22. Training language models to follow instructions with human feedback. In NeurIPS.
  23. Bleu: a method for automatic evaluation of machine translation. In ACL.
  24. Towards making the most of chatgpt for machine translation. arxiv preprint.
  25. Codebleu: a method for automatic evaluation of code synthesis. arXiv preprint.
  26. Code llama: Open foundation models for code. arXiv preprint.
  27. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint.
  28. Mark Stefik and Daniel G Bobrow. 1985. Object-oriented programming: Themes and variations. AI magazine.
  29. Bjarne Stroustrup. 1988. What is object-oriented programming? IEEE software.
  30. InternLM Team. 2023a. Internlm: A multilingual language model with progressively enhanced capabilities.
  31. MosaicML NLP Team. 2023b. Introducing mpt-7b: A new standard for open-source, commercially usable llms. Accessed: 2023-05-05.
  32. Prompt-to-os (p2os): Revolutionizing operating systems and human-computer interaction with integrated ai generative models. arXiv preprint.
  33. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint.
  34. Attention is all you need.
  35. Execution-based evaluation for open-domain code generation. arXiv preprint.
  36. Peter Wegner. 1990. Concepts and paradigms of object-oriented programming. ACM Sigplan Oops Messenger.
  37. Emergent abilities of large language models. arXiv preprint.
  38. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint.
  39. Cert: Continual pre-training on sketches for library-oriented code generation. In IJCAI.
  40. Codegeex: A pre-trained model for code generation with multilingual evaluations on humaneval-x. arXiv preprint.
  41. Can chatgpt replace stackoverflow? a study on robustness and reliability of large language model code generation. arXiv preprint.
  42. Can chatgpt understand too? a comparative study on chatgpt and fine-tuned bert. arXiv preprint.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Shuai Wang (466 papers)
  2. Liang Ding (158 papers)
  3. Li Shen (362 papers)
  4. Yong Luo (117 papers)
  5. Bo Du (263 papers)
  6. Dacheng Tao (826 papers)
Citations (1)
Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets