Aligning Offline Metrics and Human Judgments of Value for Code Generation Models (2210.16494v2)
Abstract: LLMs have demonstrated great potential to assist programmers in generating code. For such human-AI pair programming scenarios, we empirically demonstrate that while generated code is most often evaluated in terms of their functional correctness (i.e., whether generations pass available unit tests), correctness does not fully capture (e.g., may underestimate) the productivity gains these models may provide. Through a user study with N = 49 experienced programmers, we show that while correctness captures high-value generations, programmers still rate code that fails unit tests as valuable if it reduces the overall effort needed to complete a coding task. Finally, we propose a hybrid metric that combines functional correctness and syntactic similarity and show that it achieves a 14% stronger correlation with value and can therefore better represent real-world gains when evaluating and comparing models.
- TensorFlow: Large-scale machine learning on heterogeneous systems. Software available from tensorflow.org.
- Unified pre-training for program understanding and generation. arXiv preprint arXiv:2103.06333.
- Miltiadis Allamanis. 2019. The adverse effects of code duplication in machine learning models of code. In Proceedings of the 2019 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software, pages 143–153.
- Learning natural coding conventions. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, pages 281–293.
- A survey of machine learning for big code and naturalness. ACM Computing Surveys (CSUR), 51(4):1–37.
- Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.
- Antonio Valerio Miceli Barone and Rico Sennrich. 2017. A parallel corpus of python functions and documentation strings for automated code documentation and code generation. arXiv preprint arXiv:1707.02275.
- Beyond accuracy: Grounding evaluation metrics for human-machine learning systems. In Advances in Neural Information Processing Systems.
- Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
- The space of developer productivity: There’s more to it than you think. ACM Queue, 19(1):1–29.
- Incoder: A generative model for code infilling and synthesis. arXiv preprint arXiv:2204.05999.
- David J. Hand. 2006. Classifier technology and the illusion of progress. Statistical Science, 21(1).
- Will they like this? evaluating code contributions with language models. In 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories, pages 157–167. IEEE.
- When code completion fails: A case study on real-world completions. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), pages 960–970.
- Measuring coding challenge competence with apps. arXiv preprint arXiv:2105.09938.
- Abigail Z. Jacobs and Hanna Wallach. 2021. Measurement and fairness. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. ACM.
- Eirini Kalliamvakou. 2022. Quantifying GitHub Copilot’s impact on developer productivity and happiness. https://github.blog/2022-09-07-research-quantifying-github-copilots-impact-on-developer-productivity-and-happiness/.
- Phrase-based statistical translation of programming languages. In Proceedings of the 2014 ACM International Symposium on New Ideas, New Paradigms, and Reflections on Programming & Software, pages 173–184.
- Spoc: Search-based pseudocode to code. Advances in Neural Information Processing Systems, 32.
- Deduplicating training data makes language models better. arXiv preprint arXiv:2107.06499.
- Competition-level code generation with alphacode. arXiv preprint arXiv:2203.07814.
- Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
- Codexglue: A machine learning benchmark dataset for code understanding and generation. arXiv preprint arXiv:2102.04664.
- Tangled up in BLEU: Reevaluating the evaluation of automatic machine translation evaluation metrics. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4984–4997, Online. Association for Computational Linguistics.
- Lexical statistical machine translation for language migration. In Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering, pages 651–654.
- A conversational paradigm for program synthesis. arXiv preprint arXiv:2203.13474.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318.
- The fallacy of AI functionality. In 2022 ACM Conference on Fairness, Accountability, and Transparency. ACM.
- Codebleu: a method for automatic evaluation of code synthesis. arXiv preprint arXiv:2009.10297.
- Unsupervised translation of programming languages. Advances in Neural Information Processing Systems, 33.
- Intellicode compose: Code generation using transformer. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 1433–1443.
- Rachel L. Thomas and David Uminsky. 2022. Reliance on metrics is a fundamental challenge for ai. Patterns, 3(5):100476.
- Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. arXiv preprint arXiv:2109.00859.
- Perfection not required? human-ai partnerships in code translation. In 26th International Conference on Intelligent User Interfaces, pages 402–412.
- A systematic evaluation of large language models of code. arXiv preprint arXiv:2202.13169.
- Deconstructing nlg evaluation: Evaluation practices, assumptions, and their implications. arXiv preprint arXiv:2205.06828.
- Productivity assessment of neural code completion. In Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming, MAPS 2022.
- Victor Dibia (15 papers)
- Adam Fourney (16 papers)
- Gagan Bansal (21 papers)
- Forough Poursabzi-Sangdeh (3 papers)
- Han Liu (340 papers)
- Saleema Amershi (12 papers)