MLDebugging: Towards Benchmarking Code Debugging Across Multi-Library Scenarios

Published 15 Jun 2025 in cs.SE and cs.AI | (2506.13824v1)

Abstract: Code debugging is a crucial task in software engineering, which attracts increasing attention. While remarkable success has been made in the era of LLMs, current research still focuses on the simple no-library or single-library setting, ignoring the complex multi-library scenario in real-world applications. To address this limitation, we make the first attempt to introduce MLDebugging (Multi-Library Debugging), a comprehensive benchmark designed to assess debugging challenges within multi-library Python code. Specifically, MLDebugging encompasses 126 distinct Python libraries, covering a wide range of multi-library code issues, categorized into seven distinct types. Furthermore, we conduct a thorough evaluation of MLDebugging using both mainstream open-source and closed-source LLMs and highlight that current LLMs still struggle to correctly perform code debugging across multi-library scenarios. We hope this work can uncover the potential of LLMs in multi-library debugging scenario and offer insights for future research.