Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

InfiBench: Evaluating the Question-Answering Capabilities of Code Large Language Models (2404.07940v3)

Published 11 Mar 2024 in cs.SE and cs.LG

Abstract: LLMs for code (code LLMs) have witnessed tremendous progress in recent years. With the rapid development of code LLMs, many popular evaluation benchmarks, such as HumanEval, DS-1000, and MBPP, have emerged to measure the performance of code LLMs with a particular focus on code generation tasks. However, they are insufficient to cover the full range of expected capabilities of code LLMs, which span beyond code generation to answering diverse coding-related questions. To fill this gap, we propose InfiBench, the first large-scale freeform question-answering (QA) benchmark for code to our knowledge, comprising 234 carefully selected high-quality Stack Overflow questions that span across 15 programming languages. InfiBench uses four types of model-free automatic metrics to evaluate response correctness where domain experts carefully concretize the criterion for each question. We conduct a systematic evaluation for over 100 latest code LLMs on InfiBench, leading to a series of novel and insightful findings. Our detailed analyses showcase potential directions for further advancement of code LLMs. InfiBench is fully open source at https://infi-coder.github.io/infibench and continuously expanding to foster more scientific and systematic practices for code LLM evaluation.

Systematic Evaluation of Code LLMs: Insights from the InficoDER Benchmark

The continuous evolution of LLMs for programming has catalyzed significant advancements in software development, elevating their capacity to comprehend and generate code. Despite the emergence of numerous benchmarks like HumanEval and MBPP, which focus on code generation in specific programming tasks, these do not adequately encompass the broader question-answering (QA) abilities that reflect real-world coding scenarios. To fill this void, the researchers behind the InficoDER paper propose a new benchmark named InficoDER, a systematic QA benchmark tailored for evaluating code LLMs.

The cornerstone of InficoDER's evaluation framework lies in its comprehensive assessment criteria that surpass traditional benchmarks. It comprises 234 curated questions from Stack Overflow, covering 15 programming languages and diverse domains such as front-end, back-end, data science and machine learning, mobile and desktop, and IT operations. These questions are selected to reflect actual developer inquiries, thereby providing a more realistic gauge of a model's capabilities.

Benchmark Construction Process

InficoDER employs a meticulous selection methodology to ensure diversity and quality. Questions with at least three positively voted answers and an officially accepted solution from a dataset of Stack Overflow entries were retained. These initially amounted to over a million questions, from which a final curated set was derived based upon factors such as viewing frequency and relevance, leading to the finalized 234 questions.

To evaluate the LLM responses, InficoDER employs four model-free evaluation metrics: keyword matching, blank filling, unit testing, and dialogue similarity. By utilizing these diverse metrics, the benchmark evaluates models across a spectrum of tasks ranging from straightforward code interpretation to complex QA interactions.

Evaluation Findings and Highlights

InficoDER's comprehensive evaluation of more than 80 code LLMs presents several insightful findings:

  • Performance Disparities: GPT-4 exhibits a leading performance with a score of 70.64%, but even this advanced LLM is demonstrated to be far from flawless in the diverse and challenging QA landscape provided by InficoDER.
  • The Efficacy of Instruction Finetuning: The analysis underscores the enhancement brought by instruction-finetuning, particularly in models like deepseek-coder-33b-instruct, bridging gaps between base LLMs and those fine-tuned for specific tasks.
  • Scaling Laws and Model Size Relation: Data suggests that beyond a threshold of 50 billion parameters, improvements in LLM performance per parameter become less pronounced. This observation challenges the scaling laws that assert bigger always implies better, indicating that beyond certain limits, data quality and finetuning play a more pivotal role.
  • Future Predictions: The extrapolation of current scaling laws hints that achieving GPT-4 levels of performance in open-source models may require models exceeding 70B parameters, specifically fine-tuned for coding tasks.

Implications and Future Directions

InficoDER sets a precedent for future QA benchmarks by integrating practical coding scenarios directly reflected in real-world developer interactions. This approach fosters advancements in model training that prioritize qualitative data diversity over mere parameter scale. As models grow and the community continues contributing to open-source versions, InficoDER's framework stands as a crucial instrument for holistic evaluations that could prompt improvements in both proprietary and open-source models.

The open-source nature of InficoDER ensures that the benchmark evolves continuously with community input, facilitating a sustainable ecosystem for benchmarking advancements in code LLM evaluation. Researchers are encouraged to employ InficoDER in developing more robust, flexible, and capable code LLMs that can effectively mimic the nuanced human-centric task of providing precise and contextually relevant programming assistance.

Overall, InficoDER radically shifts the narrative on how to evaluate the real-world usability of LLMs by calibrating evaluation metrics to the dynamically complex needs of software developers, thus laying a strong foundation for subsequent evaluations and enhancements in this field.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (28)
  1. Multi-lingual evaluation of code generation models. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=Bo7eeXm6An8.
  2. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.
  3. Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard, 2023.
  4. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  5. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
  6. DeepSeekAI. Deepseek coder: Let the code write itself. https://deepseekcoder.github.io/, 2023.
  7. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  8. Large language models for software engineering: Survey and open problems. arXiv preprint arXiv:2310.03533, 2023.
  9. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  10. Github. Github copilot - your ai pair programmer. https://github.com/features/copilot, 2023.
  11. Measuring coding challenge competence with apps. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021.
  12. Large language models for software engineering: A systematic literature review. arXiv preprint arXiv:2308.10620, 2023.
  13. Ds-1000: A natural and reliable benchmark for data science code generation. In International Conference on Machine Learning, pp. 18319–18345. PMLR, 2023.
  14. Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161, 2023.
  15. Competition-level code generation with alphacode. Science, 378(6624):1092–1097, 2022.
  16. Lin, C.-Y. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pp.  74–81, 2004.
  17. Is your code generated by chatGPT really correct? rigorous evaluation of large language models for code generation. In Thirty-seventh Conference on Neural Information Processing Systems, 2023a. URL https://openreview.net/forum?id=1qvx610Cu7.
  18. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. arXiv preprint arXiv:2305.01210, 2023b.
  19. Wizardcoder: Empowering code large language models with evol-instruct. arXiv preprint arXiv:2306.08568, 2023.
  20. Octopack: Instruction tuning code large language models. arXiv preprint arXiv:2308.07124, 2023.
  21. Codegen: An open large language model for code with multi-turn program synthesis. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=iaYcJKpY2B_.
  22. OpenAI. Gpt-4 technical report. OpenAI, 2023. URL https://cdn.openai.com/papers/gpt-4.pdf.
  23. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
  24. StackExchange. All sites — stackexchange. 2024. URL https://stackexchange.com/sites?view=list#users.
  25. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  26. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  27. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  28. Codegeex: A pre-trained model for code generation with multilingual evaluations on humaneval-x. In KDD, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Linyi Li (41 papers)
  2. Shijie Geng (33 papers)
  3. Zhenwen Li (6 papers)
  4. Yibo He (1 paper)
  5. Hao Yu (195 papers)
  6. Ziyue Hua (15 papers)
  7. Guanghan Ning (14 papers)
  8. Siwei Wang (72 papers)
  9. Tao Xie (117 papers)
  10. Hongxia Yang (130 papers)