Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

PyBench: Evaluating LLM Agent on various real-world coding tasks (2407.16732v2)

Published 23 Jul 2024 in cs.SE and cs.AI

Abstract: The LLM Agent, equipped with a code interpreter, is capable of automatically solving real-world coding tasks, such as data analysis and image editing. However, existing benchmarks primarily focus on either simplistic tasks, such as completing a few lines of code, or on extremely complex and specific tasks at the repository level, neither of which are representative of various daily coding tasks. To address this gap, we introduce \textbf{PyBench}, a benchmark encompassing five main categories of real-world tasks, covering more than 10 types of files. Given a high-level user query and related files, the LLM Agent needs to reason and execute Python code via a code interpreter for a few turns before making a formal response to fulfill the user's requirements. Successfully addressing tasks in PyBench demands a robust understanding of various Python packages, superior reasoning capabilities, and the ability to incorporate feedback from executed code. Our evaluations indicate that current open-source LLMs are struggling with these tasks. Hence, we conduct analysis and experiments on four kinds of datasets proving that comprehensive abilities are needed for PyBench. Our fine-tuned 8B size model: \textbf{PyLlama3} achieves an exciting performance on PyBench which surpasses many 33B and 70B size models. Our Benchmark, Training Dataset, and Model are available at: {https://github.com/Mercury7353/PyBench}

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. Program synthesis with large language models. arXiv preprint arXiv:2108.07732.
  2. Qwen technical report. arXiv preprint arXiv:2309.16609.
  3. Autoagents: A framework for automatic agent generation. arXiv preprint arXiv:2309.17288.
  4. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.
  5. Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors in agents. arXiv preprint arXiv:2308.10848.
  6. Agent-flan: Designing data and methods of effective agent tuning for large language models. arXiv preprint arXiv:2403.12881.
  7. Deepseek-coder-v2: Breaking the barrier of closed-source models in code intelligence.
  8. Enhancing chat language models by scaling high-quality instructional conversations. arXiv preprint arXiv:2305.14233.
  9. Classeval: A manually-crafted benchmark for evaluating llms on class-level code generation.
  10. Ronen Eldan and Yuanzhi Li. 2023. Tinystories: How small can language models be and still speak coherent english? arXiv preprint arXiv:2305.07759.
  11. Deepseek-coder: When the large language model meets programming–the rise of code intelligence. arXiv preprint arXiv:2401.14196.
  12. Measuring coding challenge competence with apps. NeurIPS.
  13. Data interpreter: An llm agent for data science. arXiv preprint arXiv:2402.18679.
  14. Metagpt: Meta programming for multi-agent collaborative framework. arXiv preprint arXiv:2308.00352.
  15. Agentcoder: Multi-agent-based code generation with iterative testing and optimisation. arXiv preprint arXiv:2312.13010.
  16. Swe-bench: Can language models resolve real-world github issues? arXiv preprint arXiv:2310.06770.
  17. xcodeeval: A large scale multilingual multitask benchmark for code understanding, generation, translation and retrieval.
  18. Ds-1000: A natural and reliable benchmark for data science code generation. In International Conference on Machine Learning, pages 18319–18345. PMLR.
  19. Devbench: A comprehensive benchmark for software development. arXiv preprint arXiv:2403.08604.
  20. Taco: Topics in algorithmic code generation dataset. arXiv preprint arXiv:2312.14852.
  21. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation.
  22. Scaling laws of rope-based extrapolation. arXiv preprint arXiv:2310.05209.
  23. Starcoder 2 and the stack v2: The next generation.
  24. Wizardcoder: Empowering code large language models with evol-instruct.
  25. Next: Teaching large language models to reason about code execution. arXiv preprint arXiv:2404.14662.
  26. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, pages 1–22.
  27. Communicative agents for software development. arXiv preprint arXiv:2307.07924.
  28. Toolllm: Facilitating large language models to master 16000+ real-world apis. arXiv preprint arXiv:2307.16789.
  29. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950.
  30. Alchemistcoder: Harmonizing and eliciting code capability by hindsight tuning on multi-source data. arXiv preprint arXiv:2405.19265.
  31. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  32. A survey on large language model based autonomous agents. Frontiers of Computer Science, 18(6):186345.
  33. Executable code actions elicit better llm agents. arXiv preprint arXiv:2402.01030.
  34. Mint: Evaluating llms in multi-turn interaction with tools and language feedback. arXiv preprint arXiv:2309.10691.
  35. Unleashing cognitive synergy in large language models: A task-solving agent through multi-persona self-collaboration. arXiv preprint arXiv:2307.05300, 1(2):3.
  36. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837.
  37. Magicoder: Empowering code generation with oss-instruct.
  38. Codescope: An execution-based multilingual multitask multidimensional benchmark for evaluating llms on code understanding and generation.
  39. Matplotagent: Method and evaluation for llm-based agentic scientific data visualization. arXiv preprint arXiv:2402.11453.
  40. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629.
  41. Repocoder: Repository-level code completion through iterative retrieval and generation.
  42. Codegeex: A pre-trained model for code generation with multilingual benchmarking on humaneval-x.
  43. Opencodeinterpreter: Integrating code generation with execution and refinement. arXiv preprint arXiv:2402.14658.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Yaolun Zhang (3 papers)
  2. Yinxu Pan (6 papers)
  3. Yudong Wang (28 papers)
  4. Jie Cai (44 papers)
Citations (4)
Github Logo Streamline Icon: https://streamlinehq.com

GitHub