Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Debugging with Open-Source Large Language Models: An Evaluation (2409.03031v1)

Published 4 Sep 2024 in cs.SE

Abstract: LLMs have shown good potential in supporting software development tasks. This is why more and more developers turn to LLMs (e.g. ChatGPT) to support them in fixing their buggy code. While this can save time and effort, many companies prohibit it due to strict code sharing policies. To address this, companies can run open-source LLMs locally. But until now there is not much research evaluating the performance of open-source LLMs in debugging. This work is a preliminary evaluation of the capabilities of open-source LLMs in fixing buggy code. The evaluation covers five open-source LLMs and uses the benchmark DebugBench which includes more than 4000 buggy code instances written in Python, Java and C++. Open-source LLMs achieved scores ranging from 43.9% to 66.6% with DeepSeek-Coder achieving the best score for all three programming languages.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (34)
  1. Beyond code generation: An observational study of chatgpt usage in software engineering practice. In Proc. ACM Softw. Eng., 2024.
  2. Large language models are few-shot testers: Exploring llm-based general bug reproduction. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 2312–2323. IEEE, 2023.
  3. Large language models in fault localisation. arXiv preprint arXiv:2308.15276, 2023.
  4. Automated program repair in the era of large pre-trained language models. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 1482–1494. IEEE, 2023.
  5. Evaluating large language models trained on code, 2021.
  6. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. Advances in Neural Information Processing Systems, 36, 2024.
  7. Multi-lingual evaluation of code generation models. In The Eleventh International Conference on Learning Representations, 2022.
  8. Program synthesis with large language models, 2021.
  9. Code llama: Open foundation models for code, 2024.
  10. Phind. Phind, phind/phind-codellama-34b-v2 - hugging face. URL https://huggingface.co/Phind/Phind-CodeLlama-34B-v2.
  11. Wizardcoder: Empowering code large language models with evol-instruct, 2023.
  12. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244, 2023.
  13. Starcoder: may the source be with you!, 2023.
  14. Deepseek-coder: When the large language model meets programming – the rise of code intelligence, 2024.
  15. Meta AI. llama3 · hugging face. URL https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct.
  16. Hasselbring Wilhelm. Benchmarking as empirical standard in software engineering research. In Proceedings of the 25th International Conference on Evaluation and Assessment in Software Engineering, pages 365–372, 2021.
  17. ACM SIGSOFT empirical standards. CoRR, abs/2010.03525, 2020.
  18. Debugbench: Evaluating debugging capability of large language models, 2024.
  19. A unified debugging approach via llm-based multi-agent synergy. arXiv preprint arXiv:2404.17153, 2024.
  20. A comprehensive study of the capabilities of large language models for vulnerability detection. arXiv preprint arXiv:2403.17218, 2024.
  21. Where is the bug and how is it fixed? an experiment with practitioners. In Proceedings of the 2017 11th joint meeting on foundations of software engineering, pages 117–128, 2017.
  22. Towards reasoning in large language models: A survey. In 61st Annual Meeting of the Association for Computational Linguistics, ACL 2023, pages 1049–1065. Association for Computational Linguistics (ACL), 2023.
  23. Nlp evaluation in trouble: On the need to measure llm data contamination for each benchmark. arXiv preprint arXiv:2310.18018, 2023.
  24. How much are llms contaminated? a comprehensive survey and the llmsanitize library. arXiv preprint arXiv:2404.00699, 2024.
  25. Leak, cheat, repeat: Data contamination and evaluation malpractices in closed-source llms. arXiv preprint arXiv:2402.03927, 2024.
  26. Livecodebench: Holistic and contamination free evaluation of large language models for code. arXiv preprint arXiv:2403.07974, 2024.
  27. Rethinking benchmark and contamination for language models with rephrased samples. arXiv preprint arXiv:2311.04850, 2023.
  28. Concerned with data contamination? assessing countermeasures in code language model. arXiv preprint arXiv:2403.16898, 2024.
  29. Explainable automated debugging via large language model-driven scientific debugging. Proceedings of the 45th International Conference on Software Engineering, 2023.
  30. Prompting is all you need: Automated android bug replay with large language models. In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering, pages 1–13, 2024.
  31. Ldb: A large language model debugger via verifying runtime execution step-by-step. CoRR, February 2024, February 2024.
  32. Panda: Performance debugging for databases using llm agents. Amazon Science, 2024.
  33. Repairagent: An autonomous, llm-based agent for program repair. arXiv preprint arXiv:2403.17134, 2024.
  34. An analysis of the automatic bug fixing performance of chatgpt. In 2023 IEEE/ACM International Workshop on Automated Program Repair (APR), pages 23–30. IEEE, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Yacine Majdoub (2 papers)
  2. Eya Ben Charrada (2 papers)
Citations (1)