Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MHPP: Exploring the Capabilities and Limitations of Language Models Beyond Basic Code Generation (2405.11430v2)

Published 19 May 2024 in cs.CL

Abstract: Recent advancements in LLMs have greatly improved code generation, specifically at the function level. For instance, GPT-4o has achieved a 91.0\% pass rate on HumanEval. However, this draws into question the adequacy of existing benchmarks in thoroughly assessing function-level code generation capabilities. Our study analyzed two common benchmarks, HumanEval and MBPP, and found that these might not thoroughly evaluate LLMs' code generation capacities due to limitations in quality, difficulty, and granularity. To resolve this, we introduce the Mostly Hard Python Problems (MHPP) dataset, consisting of 210 unique human-curated problems. By focusing on the combination of natural language and code reasoning, MHPP gauges LLMs' abilities to comprehend specifications and restrictions, engage in multi-step reasoning, and apply coding knowledge effectively. Initial evaluations of 26 LLMs using MHPP showed many high-performing models on HumanEval failed to achieve similar success on MHPP. Moreover, MHPP highlighted various previously undiscovered limitations within various LLMs, leading us to believe that it could pave the way for a better understanding of LLMs' capabilities and limitations. MHPP, evaluation pipeline, and leaderboard can be found in https://github.com/SparksofAGI/MHPP.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Jianbo Dai (6 papers)
  2. Jianqiao Lu (20 papers)
  3. Yunlong Feng (26 papers)
  4. Rongju Ruan (5 papers)
  5. Ming Cheng (69 papers)
  6. Haochen Tan (13 papers)
  7. Zhijiang Guo (55 papers)
  8. Dong Huang (102 papers)
  9. Guangtao Zeng (14 papers)
Citations (6)