Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CodeScore: Evaluating Code Generation by Learning Code Execution (2301.09043v4)

Published 22 Jan 2023 in cs.SE

Abstract: A proper code evaluation metric (CEM) profoundly impacts the evolution of code generation, which is an important research field in NLP and software engineering. Prevailing match-based CEMs (e.g., BLEU, Accuracy, and CodeBLEU) suffer from two significant drawbacks. 1. They primarily measure the surface differences between codes without considering their functional equivalence. However, functional equivalence is pivotal in evaluating the effectiveness of code generation, as different codes can perform identical operations. 2. They are predominantly designed for the Ref-only input format. However, code evaluation necessitates versatility in input formats. Aside from Ref-only, there are NL-only and Ref&NL formats, which existing match-based CEMs cannot effectively accommodate. In this paper, we propose CodeScore, a LLM-based CEM, which estimates the functional correctness of generated code on three input types. To acquire CodeScore, we present UniCE, a unified code generation learning framework, for LLMs to learn code execution (i.e., learning PassRatio and Executability of generated code) with unified input. Extensive experimental results on multiple code evaluation datasets demonstrate that CodeScore absolutely improves up to 58.87% correlation with functional correctness compared to other CEMs, achieves state-of-the-art performance, and effectively handles three input formats.

CodeScore: Evaluating Code Generation by Learning Code Execution

The paper proposes an innovative evaluation metric for code generation, named CodeScore, which aims to overcome the limitations of traditional match-based code evaluation metrics (CEMs) that focus on surface-level differences and are restricted to specific input formats. CodeScore employs LLMs to assess functional correctness of generated code across three input types, namely Ref-only, NL-only, and Ref{content}NL. The authors introduce a unified code generation learning framework, UniCE, to train LLMs for predicting PassRatio and Executability through simulated code execution.

Motivation and Challenges

The automatic evaluation of code generation is of substantial interest within both NLP and software engineering communities. Existing match-based CEMs like BLEU and CodeBLEU primarily emphasize lexical features and fail to account for functional equivalence, an essential factor for code evaluation. Furthermore, these metrics are designed to manage only Ref-only input formats, limiting their adaptability when natural language descriptions (NL) or additional context are involved.

CodeScore and UniCE Framework

CodeScore, as described in the paper, is an LLM-based metric measuring functional correctness by evaluating execution output similarity. The UniCE framework is designed to finetune LLMs, enabling them to learn code execution with unified inputs. The model evaluates generated code based on PassRatio—the fraction of test cases passed over total cases—and binary Executability, which distinguishes between executable and non-executable code. Through multiple experiments, the approach achieved up to 58.87% better correlation with functional correctness than other CEMs.

Experimental Validation

Empirical results demonstrate CodeScore's efficacy across three constructed datasets—APPS-Eval, MBPP-Eval, and HE-Eval. Notably, CodeScore outperformed traditional metrics and LLM-based EMs, establishing strong correlation with functional correctness and reducing mean absolute error. Additionally, the paper highlights CodeScore's versatility across different input formats. Its evaluation speed is significantly enhanced, drastically lowering the computational cost compared to execution-based CEMs.

Implications and Future Directions

The paper provides a pathway toward more accurate and computationally efficient code evaluation metrics. This research potentially facilitates the advancement of code generation technologies by improving feedback accuracy for model training, revolutionizing programming paradigms, and cutting development costs. Future work might expand CodeScore's capabilities to encompass broader programming scenarios and refine its efficiency further.

In conclusion, CodeScore presents a robust approach to measuring code functional correctness, addressing longstanding inefficiencies in match-based CEMs. This advancement supports more holistic and practical code evaluation, paving the path for future innovations in AI-driven coding solutions.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. NS3: Neuro-Symbolic Semantic Code Search. CoRR abs/2205.10674 (2022).
  2. Program Synthesis with Large Language Models. CoRR abs/2108.07732 (2021).
  3. Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In IEEvaluation@ACL. Association for Computational Linguistics, 65–72.
  4. Auguste Bravais. 1844. Analyse mathématique sur les probabilités des erreurs de situation d’un point. Impr. Royale.
  5. Evaluating Large Language Models Trained on Code. CoRR abs/2107.03374 (2021).
  6. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT (1). Association for Computational Linguistics, 4171–4186.
  7. Self-collaboration Code Generation via ChatGPT. CoRR abs/2304.07590 (2023).
  8. Antecedent Predictions Are Dominant for Tree-Based Code Generation. CoRR abs/2208.09998 (2022).
  9. CODEP: Grammatical Seq2Seq Model for General-Purpose Code Generation. In ISSTA.
  10. Aryaz Eghbali and Michael Pradel. 2022. CrystalBLEU: Precisely and Efficiently Measuring the Similarity of Code. In ASE. ACM, 28:1–28:12.
  11. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. In EMNLP (Findings) (Findings of ACL, Vol. EMNLP 2020). Association for Computational Linguistics, 1536–1547.
  12. Incoder: A generative model for code infilling and synthesis. arXiv preprint arXiv:2204.05999 (2022).
  13. UniXcoder: Unified Cross-Modal Pre-training for Code Representation. In ACL (1). Association for Computational Linguistics, 7212–7225.
  14. Learning to Complete Code with Sketches. In ICLR. OpenReview.net.
  15. AixBench: A Code Generation Benchmark Dataset. CoRR abs/2206.13179 (2022).
  16. Measuring Coding Challenge Competence With APPS. In NeurIPS Datasets and Benchmarks.
  17. Fault-Aware Neural Code Rankers. CoRR abs/2206.03865 (2022).
  18. Self-planning Code Generation with Large Language Model. CoRR abs/2303.06689 (2023).
  19. Maurice G Kendall. 1938. A new measure of rank correlation. Biometrika 30, 1/2 (1938), 81–93.
  20. Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In ICLR.
  21. SPoC: Search-based Pseudocode to Code. In NeurIPS. 11883–11894.
  22. Competition-level code generation with alphacode. Science 378, 6624 (2022), 1092–1097.
  23. Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out. Association for Computational Linguistics, 74–81.
  24. ReACC: A Retrieval-Augmented Code Completion Framework. In ACL. Association for Computational Linguistics, 6227–6240.
  25. Alexander McFarlane Mood. 1950. Introduction to the Theory of Statistics. (1950).
  26. Neural Program Generation Modulo Static Analysis. In NeurIPS. 18984–18996.
  27. CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis. arXiv preprint arXiv:2203.13474 (2022).
  28. ]ChatGPT OpenAI. [n. d.]. ChatGPT: Optimizing Language Models for Dialogue. https://openai.com/blog/chatgpt/
  29. Bleu: a Method for Automatic Evaluation of Machine Translation. In ACL. ACL, 311–318.
  30. Deep Contextualized Word Representations. In NAACL-HLT. Association for Computational Linguistics, 2227–2237.
  31. Maja Popovic. 2015. chrF: character n-gram F-score for automatic MT evaluation. In WMT@EMNLP. The Association for Computer Linguistics, 392–395.
  32. TransQuest: Translation Quality Estimation with Cross-lingual Transformers. In COLING. International Committee on Computational Linguistics, 5070–5081.
  33. COMET-22: Unbabel-IST 2022 Submission for the Metrics Shared Task. In WMT@EMNLP. Association for Computational Linguistics, 578–585.
  34. Are References Really Needed? Unbabel-IST 2021 Submission for the Metrics Shared Task. In WMT@EMNLP. Association for Computational Linguistics, 1030–1040.
  35. COMET: A Neural Framework for MT Evaluation. In EMNLP (1). Association for Computational Linguistics, 2685–2702.
  36. Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In EMNLP/IJCNLP (1). Association for Computational Linguistics, 3980–3990.
  37. CodeBLEU: a Method for Automatic Evaluation of Code Synthesis. CoRR abs/2009.10297 (2020).
  38. Unsupervised Translation of Programming Languages. In NeurIPS.
  39. BLEURT: Learning Robust Metrics for Text Generation. In ACL. Association for Computational Linguistics, 7881–7892.
  40. Incorporating domain knowledge through task augmentation for front-end JavaScript code generation. In ESEC/SIGSOFT FSE. ACM, 1533–1543.
  41. Code Search based on Context-aware Code Translation. In ICSE. ACM, 388–400.
  42. BERT Rediscovers the Classical NLP Pipeline. In ACL (1). Association for Computational Linguistics, 4593–4601.
  43. UniTE: Unified Translation Evaluation. In ACL (1). Association for Computational Linguistics, 8117–8127.
  44. Pengcheng Yin and Graham Neubig. 2018. TRANX: A Transition-based Neural Abstract Syntax Parser for Semantic Parsing and Code Generation. In EMNLP. 7–12.
  45. BARTScore: Evaluating Generated Text as Text Generation. In NeurIPS. 27263–27277.
  46. BERTScore: Evaluating Text Generation with BERT. In ICLR. OpenReview.net.
  47. MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance. In EMNLP/IJCNLP (1). Association for Computational Linguistics, 563–578.
  48. CodeBERTScore: Evaluating Code Generation with Pretrained Models of Code. CoRR abs/2302.05527 (2023).
  49. Multilingual Code Snippets Training for Program Translation. In AAAI. AAAI Press, 11783–11790.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Yihong Dong (35 papers)
  2. Jiazheng Ding (5 papers)
  3. Xue Jiang (82 papers)
  4. Ge Li (213 papers)
  5. Zhuo Li (164 papers)
  6. Zhi Jin (160 papers)
Citations (41)
Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com