Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CoderUJB: An Executable and Unified Java Benchmark for Practical Programming Scenarios (2403.19287v1)

Published 28 Mar 2024 in cs.SE

Abstract: In the evolving landscape of LLMs tailored for software engineering, the need for benchmarks that accurately reflect real-world development scenarios is paramount. Current benchmarks are either too simplistic or fail to capture the multi-tasking nature of software development. To address this, we introduce CoderUJB, a new benchmark designed to evaluate LLMs across diverse Java programming tasks that are executable and reflective of actual development scenarios, acknowledging Java's prevalence in real-world software production. CoderUJB comprises 2,239 programming questions derived from 17 real open-source Java projects and spans five practical programming tasks. Our empirical study on this benchmark investigates the coding abilities of various open-source and closed-source LLMs, examining the effects of continued pre-training in specific programming languages code and instruction fine-tuning on their performance. The findings indicate that while LLMs exhibit strong potential, challenges remain, particularly in non-functional code generation (e.g., test generation and defect detection). Importantly, our results advise caution in the specific programming languages continued pre-training and instruction fine-tuning, as these techniques could hinder model performance on certain tasks, suggesting the need for more nuanced strategies. CoderUJB thus marks a significant step towards more realistic evaluations of programming capabilities in LLMs, and our study provides valuable insights for the future development of these models in software engineering.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (67)
  1. 2023. ChatGPT. Website. https://openai.com/blog/chatgpt.
  2. 2023. Claude. Website. https://www.anthropic.com.
  3. 2023. Cobertura. Website. http://cobertura.github.io/cobertura.
  4. 2023. CoderUJB. Website. https://github.com/WisdomShell/ujb.
  5. 2023. GPT-4. Website. https://openai.com/gpt-4.
  6. 2023. tiobe. Website. https://www.tiobe.com/tiobe-index/.
  7. Program Synthesis with Large Language Models. CoRR abs/2108.07732 (2021). arXiv:2108.07732 https://arxiv.org/abs/2108.07732
  8. A framework for the evaluation of code generation models. https://huggingface.co/spaces/bigcode/bigcode-models-leaderboard.
  9. MultiPL-E: a scalable and polyglot approach to benchmarking neural code generation. IEEE Transactions on Software Engineering (2023).
  10. Evaluating Large Language Models Trained on Code. CoRR abs/2107.03374 (2021). arXiv:2107.03374 https://arxiv.org/abs/2107.03374
  11. Evaluation of ChatGPT Model for Vulnerability Detection. arXiv preprint arXiv:2304.07232 (2023).
  12. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), Jill Burstein, Christy Doran, and Thamar Solorio (Eds.). Association for Computational Linguistics, 4171–4186. https://doi.org/10.18653/V1/N19-1423
  13. ClassEval: A Manually-Crafted Benchmark for Evaluating LLMs on Class-level Code Generation. CoRR abs/2308.01861 (2023). https://doi.org/10.48550/ARXIV.2308.01861 arXiv:2308.01861
  14. Towards Generating Functionally Correct Code Edits from Natural Language Issue Descriptions. CoRR abs/2304.03816 (2023). https://doi.org/10.48550/ARXIV.2304.03816 arXiv:2304.03816
  15. Chatgpt for vulnerability detection, classification, and repair: How far are we? arXiv preprint arXiv:2310.09810 (2023).
  16. AixBench: A Code Generation Benchmark Dataset. CoRR abs/2206.13179 (2022). https://doi.org/10.48550/ARXIV.2206.13179 arXiv:2206.13179
  17. Measuring Massive Multitask Language Understanding. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net. https://openreview.net/forum?id=d7KBjmI3GmQ
  18. C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models. CoRR abs/2305.08322 (2023). https://doi.org/10.48550/ARXIV.2305.08322 arXiv:2305.08322
  19. CURE: Code-Aware Neural Machine Translation for Automatic Program Repair. In 43rd IEEE/ACM International Conference on Software Engineering, ICSE 2021, Madrid, Spain, 22-30 May 2021. IEEE, 1161–1173. https://doi.org/10.1109/ICSE43902.2021.00107
  20. SpanBERT: Improving Pre-training by Representing and Predicting Spans. Trans. Assoc. Comput. Linguistics 8 (2020), 64–77. https://doi.org/10.1162/TACL_A_00300
  21. Defects4J: a database of existing faults to enable controlled testing studies for Java programs. In International Symposium on Software Testing and Analysis, ISSTA ’14, San Jose, CA, USA - July 21 - 26, 2014, Corina S. Pasareanu and Darko Marinov (Eds.). ACM, 437–440. https://doi.org/10.1145/2610384.2628055
  22. Large Language Models are Few-shot Testers: Exploring LLM-based General Bug Reproduction. In 45th IEEE/ACM International Conference on Software Engineering, ICSE 2023, Melbourne, Australia, May 14-20, 2023. IEEE, 2312–2323. https://doi.org/10.1109/ICSE48619.2023.00194
  23. xCodeEval: A Large Scale Multilingual Multitask Benchmark for Code Understanding, Generation, Translation and Retrieval. CoRR abs/2303.03004 (2023). https://doi.org/10.48550/ARXIV.2303.03004 arXiv:2303.03004
  24. The Stack: 3 TB of permissively licensed source code. CoRR abs/2211.15533 (2022). https://doi.org/10.48550/ARXIV.2211.15533 arXiv:2211.15533
  25. Mode Collapse in Generative Adversarial Networks: An Overview. In 2022 8th International Conference on Optimization and Applications (ICOA). IEEE, 1–6.
  26. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetreault (Eds.). Association for Computational Linguistics, 7871–7880. https://doi.org/10.18653/V1/2020.ACL-MAIN.703
  27. StarCoder: may the source be with you! CoRR abs/2305.06161 (2023). https://doi.org/10.48550/ARXIV.2305.06161 arXiv:2305.06161
  28. DeepFL: integrating multiple fault diagnosis dimensions for deep fault localization. In Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2019, Beijing, China, July 15-19, 2019, Dongmei Zhang and Anders Møller (Eds.). ACM, 169–180. https://doi.org/10.1145/3293882.3330574
  29. QuixBugs: A multi-lingual program repair benchmark set based on the Quixey Challenge. In Proceedings Companion of the 2017 ACM SIGPLAN international conference on systems, programming, languages, and applications: software for humanity. 55–56.
  30. CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual, Joaquin Vanschoren and Sai-Kit Yeung (Eds.). https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/c16a5320fa475530d9583c34fd356ef5-Abstract-round1.html
  31. WizardCoder: Empowering Code Large Language Models with Evol-Instruct. arXiv preprint arXiv:2306.08568 (2023).
  32. CoCoNuT: combining context-aware neural translation models using ensemble for program repair. In ISSTA ’20: 29th ACM SIGSOFT International Symposium on Software Testing and Analysis, Virtual Event, USA, July 18-22, 2020, Sarfraz Khurshid and Corina S. Pasareanu (Eds.). ACM, 101–114. https://doi.org/10.1145/3395363.3397369
  33. What Happens To BERT Embeddings During Fine-tuning?. In Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, BlackboxNLP@EMNLP 2020, Online, November 2020, Afra Alishahi, Yonatan Belinkov, Grzegorz Chrupala, Dieuwke Hupkes, Yuval Pinter, and Hassan Sajjad (Eds.). Association for Computational Linguistics, 33–44. https://doi.org/10.18653/V1/2020.BLACKBOXNLP-1.4
  34. Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (Eds.). Association for Computational Linguistics, 11048–11064. https://doi.org/10.18653/V1/2022.EMNLP-MAIN.759
  35. On the Stability of Fine-tuning BERT: Misconceptions, Explanations, and Strong Baselines. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net. https://openreview.net/forum?id=nzpLWnVAyah
  36. Lever: Learning to verify language-to-code generation with execution. In International Conference on Machine Learning. PMLR, 26106–26128.
  37. CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net. https://openreview.net/pdf?id=iaYcJKpY2B_
  38. OpenAI. 2023. GPT-4 Technical Report. CoRR abs/2303.08774 (2023). https://doi.org/10.48550/ARXIV.2303.08774 arXiv:2303.08774
  39. Training language models to follow instructions with human feedback. In NeurIPS. http://papers.nips.cc/paper_files/paper/2022/hash/b1efde53be364a73914f58805a001731-Abstract-Conference.html
  40. Communicative agents for software development. arXiv preprint arXiv:2307.07924 (2023).
  41. Improving language understanding by generative pre-training. (2018).
  42. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res. 21 (2020), 140:1–140:67. http://jmlr.org/papers/v21/20-074.html
  43. CodeBLEU: a Method for Automatic Evaluation of Code Synthesis. CoRR abs/2009.10297 (2020). arXiv:2009.10297 https://arxiv.org/abs/2009.10297
  44. Code Llama: Open Foundation Models for Code. CoRR abs/2308.12950 (2023). https://doi.org/10.48550/ARXIV.2308.12950 arXiv:2308.12950
  45. Pangu-coder2: Boosting large language models for code with ranking feedback. arXiv preprint arXiv:2307.14936 (2023).
  46. Automatic and scalable detection of logical errors in functional programming assignments. Proc. ACM Program. Lang. 3, OOPSLA (2019), 188:1–188:30. https://doi.org/10.1145/3360614
  47. Multi-Task Pre-Training for Plug-and-Play Task-Oriented Dialogue System. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (Eds.). Association for Computational Linguistics, 4661–4676. https://doi.org/10.18653/V1/2022.ACL-LONG.319
  48. Text Classification via Large Language Models. CoRR abs/2305.08377 (2023). https://doi.org/10.48550/ARXIV.2305.08377 arXiv:2305.08377
  49. Llama 2: Open Foundation and Fine-Tuned Chat Models. CoRR abs/2307.09288 (2023). https://doi.org/10.48550/ARXIV.2307.09288 arXiv:2307.09288
  50. Attention is All you Need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (Eds.). 5998–6008. https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html
  51. Software Testing with Large Language Model: Survey, Landscape, and Vision. CoRR abs/2307.07221 (2023). https://doi.org/10.48550/ARXIV.2307.07221 arXiv:2307.07221
  52. Usb: A unified semi-supervised learning benchmark for classification. Advances in Neural Information Processing Systems 35 (2022), 3938–3961.
  53. Exploring vision-language models for imbalanced learning. International Journal of Computer Vision 132, 1 (2024), 224–237.
  54. Pandalm: An automatic evaluation benchmark for llm instruction tuning optimization. arXiv preprint arXiv:2306.05087 (2023).
  55. Finetuned Language Models are Zero-Shot Learners. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net. https://openreview.net/forum?id=gEZrGCozdqR
  56. Automated Program Repair in the Era of Large Pre-trained Language Models. In 45th IEEE/ACM International Conference on Software Engineering, ICSE 2023, Melbourne, Australia, May 14-20, 2023. IEEE, 1482–1494. https://doi.org/10.1109/ICSE48619.2023.00129
  57. CodeShell Technical Report. arXiv:2403.15747 [cs.SE]
  58. What do code models memorize? an empirical study on large language models of code. arXiv preprint arXiv:2308.09932 (2023).
  59. CoderEval: A Benchmark of Pragmatic Code Generation with Generative Pre-trained Models. CoRR abs/2302.00288 (2023). https://doi.org/10.48550/ARXIV.2302.00288 arXiv:2302.00288
  60. KIEval: A Knowledge-grounded Interactive Evaluation Framework for Large Language Models. arXiv preprint arXiv:2402.15043 (2024).
  61. Evaluating instruction-tuned large language models on code comprehension and generation. arXiv preprint arXiv:2308.01240 (2023).
  62. No More Manual Tests? Evaluating and Improving ChatGPT for Unit Test Generation. CoRR abs/2305.04207 (2023). https://doi.org/10.48550/ARXIV.2305.04207 arXiv:2305.04207
  63. Repocoder: Repository-level code completion through iterative retrieval and generation. arXiv preprint arXiv:2303.12570 (2023).
  64. A Survey of Large Language Models. arXiv preprint arXiv:2303.18223 (2023). http://arxiv.org/abs/2303.18223
  65. Codegeex: A pre-trained model for code generation with multilingual benchmarking on humaneval-x. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 5673–5684.
  66. Devign: Effective Vulnerability Identification by Learning Comprehensive Program Semantics via Graph Neural Networks. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett (Eds.). 10197–10207. https://proceedings.neurips.cc/paper/2019/hash/49265d2447bc3bbfe9e76306ce40a31f-Abstract.html
  67. Yichu Zhou and Vivek Srikumar. 2022. A Closer Look at How Fine-tuning Changes BERT. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (Eds.). Association for Computational Linguistics, 1046–1061. https://doi.org/10.18653/V1/2022.ACL-LONG.75
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Zhengran Zeng (9 papers)
  2. Yidong Wang (43 papers)
  3. Rui Xie (59 papers)
  4. Wei Ye (110 papers)
  5. Shikun Zhang (82 papers)
Citations (4)
X Twitter Logo Streamline Icon: https://streamlinehq.com