Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

PythonSaga: Redefining the Benchmark to Evaluate Code Generating LLMs (2401.03855v4)

Published 8 Jan 2024 in cs.CL and cs.AI

Abstract: Driven by the surge in code generation using LLMs, numerous benchmarks have emerged to evaluate these LLMs capabilities. We conducted a large-scale human evaluation of HumanEval and MBPP, two popular benchmarks for Python code generation, analyzing their diversity and difficulty. Our findings unveil a critical bias towards a limited set of programming concepts, neglecting most of the other concepts entirely. Furthermore, we uncover a worrying prevalence of easy tasks, potentially inflating model performance estimations. To address these limitations, we propose a novel benchmark, PythonSaga, featuring 185 hand-crafted prompts on a balanced representation of 38 programming concepts across diverse difficulty levels. The robustness of our benchmark is demonstrated by the poor performance of existing Code-LLMs.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Ankit Yadav (13 papers)
  2. Mayank Singh (92 papers)
  3. Himanshu Beniwal (9 papers)
Citations (1)
X Twitter Logo Streamline Icon: https://streamlinehq.com