Papers
Topics
Authors
Recent
Search
2000 character limit reached

Unsupervised Evaluation of Code LLMs with Round-Trip Correctness

Published 13 Feb 2024 in cs.SE and cs.LG | (2402.08699v2)

Abstract: To evaluate code LLMs, research has relied on a few small manually curated benchmarks, such as HumanEval and MBPP, which represent a narrow part of the real-world software domains. In this work, we introduce round-trip correctness (RTC) as an alternative evaluation method. RTC allows Code LLM evaluation on a broader spectrum of real-world software domains without the need for costly human curation. RTC rests on the idea that we can ask a model to make a prediction (e.g., describe some code using natural language), feed that prediction back (e.g., synthesize code from the predicted description), and check if this round-trip leads to code that is semantically equivalent to the original input. We show how to employ RTC to evaluate code synthesis and editing. We find that RTC strongly correlates with model performance on existing narrow-domain code synthesis benchmarks while allowing us to expand to a much broader set of domains and tasks which was not previously possible without costly human annotations.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (25)
  1. Juice: A large scale distantly supervised dataset for open domain context-based code generation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.  5436–5446, 2019.
  2. PaLM 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
  3. Faithfulness tests for natural language explanations. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp.  283–294. Association for Computational Linguistics, July 2023.
  4. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.
  5. On multi-modal learning of editing source code. In 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp.  443–455. IEEE, 2021.
  6. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
  7. Understanding back-translation at scale. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp.  489–500, 2018.
  8. Property-based testing: a new approach to testing for assurance. ACM SIGSOFT Software Engineering Notes, 22(4):74–80, 1997.
  9. Measuring coding challenge competence with apps. arXiv preprint arXiv:2105.09938, 2021.
  10. SWE-bench: Can language models resolve real-world GitHub issues? arXiv preprint arXiv:2310.06770, 2023.
  11. DS-1000: A natural and reliable benchmark for data science code generation. In International Conference on Machine Learning, pp. 18319–18345. PMLR, 2023.
  12. Competition-level code generation with alphacode. Science, 378(6624):1092–1097, 2022a.
  13. Automating code review activities by large-scale pre-training. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp.  1035–1047, 2022b.
  14. Beyond accuracy: Evaluating self-consistency of code large language models with IdentityChain. arXiv preprint arXiv:2310.14053, 2023.
  15. Automatic test generation: A use case driven approach. IEEE Transactions on Software Engineering, 32(3):140–155, 2006.
  16. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp.  311–318, 2002.
  17. CodeBLEU: a method for automatic evaluation of code synthesis. arXiv preprint arXiv:2009.10297, 2020.
  18. Improving neural machine translation models with monolingual data. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  86–96, 2016.
  19. Data augmentation using back-translation for context-aware neural machine translation. In Proceedings of the fourth workshop on discourse in machine translation (DiscoMT 2019), pp.  35–44, 2019.
  20. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  21. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations, 2022.
  22. Code generation as a dual task of code summarization. Advances in neural information processing systems, 32, 2019.
  23. A syntactic neural model for general-purpose code generation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  440–450, 2017.
  24. Natural language to code generation in interactive data science notebooks. arXiv preprint arXiv:2212.09248, 2022.
  25. Codebertscore: Evaluating code generation with pretrained models of code. arXiv preprint arXiv:2302.05527, 2023.
Citations (7)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 8 tweets with 7 likes about this paper.