Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

On the Limitations of Embedding Based Methods for Measuring Functional Correctness for Code Generation (2405.01580v1)

Published 26 Apr 2024 in cs.SE and cs.AI

Abstract: The task of code generation from natural language (NL2Code) has become extremely popular, especially with the advent of LLMs. However, efforts to quantify and track this progress have suffered due to a lack of reliable metrics for functional correctness. While popular benchmarks like HumanEval have test cases to enable reliable evaluation of correctness, it is time-consuming and requires human effort to collect test cases. As an alternative several reference-based evaluation metrics have been proposed, with embedding-based metrics like CodeBERTScore being touted as having a high correlation with human preferences and functional correctness. In our work, we analyze the ability of embedding-based metrics like CodeBERTScore to measure functional correctness and other helpful constructs like editing effort by analyzing outputs of ten models over two popular code generation benchmarks. Our results show that while they have a weak correlation with functional correctness (0.16), they are strongly correlated (0.72) with editing effort.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (83)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  2. JuICe: A large scale distantly supervised dataset for open domain context-based code generation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5436–5446, Hong Kong, China. Association for Computational Linguistics.
  3. Unified pre-training for program understanding and generation. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.
  4. AI@Meta. 2024. Llama 3 model card.
  5. Santacoder: Don’t reach for the stars!
  6. Unsupervised evaluation of code llms with round-trip correctness.
  7. Richard A Armstrong. 2014. When to use the bonferroni correction. Ophthalmic Physiol Opt, 34(5):502–508.
  8. Program synthesis with large language models. arXiv preprint arXiv:2108.07732.
  9. Qwen technical report. arXiv preprint arXiv:2309.16609.
  10. Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michigan. Association for Computational Linguistics.
  11. Ran Bar-Zik. 2023. Guarding the gates of language: Exploring vulnerabilities in llm output. Medium.
  12. Grounded copilot: How programmers interact with code-generating models. Proc. ACM Program. Lang., 7(OOPSLA1).
  13. Neural code comprehension: A learnable representation of code semantics. Advances in neural information processing systems, 31.
  14. Multipl-e: A scalable and extensible approach to benchmarking neural code generation. arXiv preprint arXiv:2208.08227.
  15. Evaluating large language models trained on code.
  16. Aligning offline metrics and human judgments of value for code generation models.
  17. Codescore: Evaluating code generation by learning code execution. arXiv preprint arXiv:2301.09043.
  18. Classeval: A manually-crafted benchmark for evaluating llms on class-level code generation. arXiv preprint arXiv:2308.01861.
  19. Aryaz Eghbali and Michael Pradel. 2023. Crystalbleu: Precisely and efficiently measuring the similarity of code. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering, ASE ’22, New York, NY, USA. Association for Computing Machinery.
  20. Out of the bleu: how should we assess quality of the code generation models? Journal of Systems and Software, 203:111741.
  21. CodeBERT: A pre-trained model for programming and natural languages. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1536–1547, Online. Association for Computational Linguistics.
  22. Incoder: A generative model for code infilling and synthesis. In The Eleventh International Conference on Learning Representations.
  23. Pal: Program-aided language models. In International Conference on Machine Learning, pages 10764–10799. PMLR.
  24. Unixcoder: Unified cross-modal pre-training for code representation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7212–7225.
  25. Graphcodebert: Pre-training code representations with data flow. arXiv preprint arXiv:2009.08366.
  26. Deepseek-coder: When the large language model meets programming–the rise of code intelligence. arXiv preprint arXiv:2401.14196.
  27. Measuring coding challenge competence with apps. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2).
  28. Metagpt: Meta programming for a multi-agent collaborative framework.
  29. Agentcoder: Multi-agent-based code generation with iterative testing and optimisation.
  30. Execution-based evaluation for data science code generation models. In Proceedings of the Fourth Workshop on Data Science with Human-in-the-Loop (Language Advances), pages 28–36, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
  31. Swe-bench: Can language models resolve real-world github issues? ArXiv, abs/2310.06770.
  32. Ds-1000: a natural and reliable benchmark for data science code generation. In Proceedings of the 40th International Conference on Machine Learning, pages 18319–18345.
  33. Vladimir Iosifovich Levenshtein. 1966. Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady, 10(8):707–710. Doklady Akademii Nauk SSSR, V163 No4 845-848 1965.
  34. Python code generation by asking clarification questions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14287–14306, Toronto, Canada. Association for Computational Linguistics.
  35. Codeie: Large code generation models are better few-shot information extractors. arXiv preprint arXiv:2305.05711.
  36. Starcoder: may the source be with you! Transactions on Machine Learning Research.
  37. Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
  38. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. Advances in Neural Information Processing Systems, 36.
  39. Repobench: Benchmarking repository-level code auto-completion systems. In The Twelfth International Conference on Learning Representations.
  40. Starcoder 2 and the stack v2: The next generation. arXiv preprint arXiv:2402.19173.
  41. Wizardcoder: Empowering code large language models with evol-instruct. arXiv preprint arXiv:2306.08568.
  42. Convolutional neural networks over tree structures for programming language processing. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, AAAI’16, page 1287–1293. AAAI Press.
  43. The realhumaneval: Evaluating large language models’ abilities to support programmers.
  44. Probing semantic grounding in language models of code with representational similarity analysis. ArXiv, abs/2207.07706.
  45. Codegen: An open large language model for code with multi-turn program synthesis. arXiv preprint arXiv:2203.13474.
  46. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744.
  47. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
  48. Maja Popović. 2015. chrF: character n-gram F-score for automatic MT evaluation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 392–395, Lisbon, Portugal. Association for Computational Linguistics.
  49. COMET: A neural framework for MT evaluation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2685–2702, Online. Association for Computational Linguistics.
  50. Codebleu: a method for automatic evaluation of code synthesis. arXiv preprint arXiv:2009.10297.
  51. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950.
  52. Tetsuya Sakai. 2014. Metrics, statistics, tests. In Bridging Between Information Retrieval and Databases - PROMISE Winter School 2013, Revised Tutorial Lectures, Lecture Notes in Computer Science, pages 116–163. Springer Verlag.
  53. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. ArXiv, abs/2402.03300.
  54. Program synthesis and semantic parsing with learned code idioms. In Neural Information Processing Systems.
  55. Repofusion: Training code models to understand your repository.
  56. Progprompt: Generating situated robot task plans using large language models. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 11523–11530. IEEE.
  57. Choose your programming copilot: a comparison of the program synthesis performance of github copilot and genetic programming. In Proceedings of the genetic and evolutionary computation conference, pages 1019–1027.
  58. A survey of neural code intelligence: Paradigms, advances and beyond.
  59. A grammar-based structural cnn decoder for code generation. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 7055–7062.
  60. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295.
  61. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  62. Does bleu score work for code migration? In 2019 IEEE/ACM 27th International Conference on Program Comprehension (ICPC), pages 165–176. IEEE.
  63. Sergey Troshin and Nadezhda Chirkova. 2022. Probing pretrained models of source codes. In Proceedings of the Fifth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, pages 371–383, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
  64. Syncobert: Syntax-guided multi-modal contrastive pre-training for code representation. arXiv preprint arXiv:2108.04556.
  65. Codet5+: Open code large language models for code understanding and generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 1069–1088.
  66. Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. ArXiv, abs/2109.00859.
  67. Mconala: A benchmark for code generation from multiple natural languages. In Findings of the Association for Computational Linguistics: EACL 2023, pages 265–273.
  68. Execution-based evaluation for open-domain code generation. arXiv preprint arXiv:2212.10481.
  69. Magicoder: Source code is all you need. arXiv preprint arXiv:2312.02120.
  70. Lemur: Harmonizing natural language and code for language agents. In The Twelfth International Conference on Learning Representations.
  71. Intercode: Standardizing and benchmarking interactive coding with execution feedback. Advances in Neural Information Processing Systems, 36.
  72. Learning to mine aligned code and natural language pairs from stack overflow. In International Conference on Mining Software Repositories, MSR, pages 476–486. ACM.
  73. Natural language to code generation in interactive data science notebooks. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 126–173.
  74. Codereval: A benchmark of pragmatic code generation with generative pre-trained models. arXiv preprint arXiv:2302.00288.
  75. RepoCoder: Repository-level code completion through iterative retrieval and generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2471–2484, Singapore. Association for Computational Linguistics.
  76. A novel neural source code representation based on abstract syntax tree. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), pages 783–794.
  77. Codeagent: Enhancing code generation with tool-integrated agent systems for real-world repo-level coding challenges. arXiv preprint arXiv:2401.07339.
  78. bert_score/journal/rescale_baseline.md at master · Tiiiger/bert_score — github.com. https://github.com/Tiiiger/bert_score/blob/master/journal/rescale_baseline.md. [Accessed 26-04-2024].
  79. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations.
  80. Codegeex: A pre-trained model for code generation with multilingual evaluations on humaneval-x.
  81. Ldb: A large language model debugger via verifying runtime execution step-by-step.
  82. Language agent tree search unifies reasoning acting and planning in language models.
  83. Codebertscore: Evaluating code generation with pretrained models of code.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (1)
  1. Atharva Naik (17 papers)
Citations (1)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets