Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The RealHumanEval: Evaluating Large Language Models' Abilities to Support Programmers (2404.02806v2)

Published 3 Apr 2024 in cs.SE, cs.AI, and cs.HC
The RealHumanEval: Evaluating Large Language Models' Abilities to Support Programmers

Abstract: Evaluation of LLMs for code has primarily relied on static benchmarks, including HumanEval (Chen et al., 2021), or more recently using human preferences of LLM responses. As LLMs are increasingly used as programmer assistants, we study whether gains on existing benchmarks or more preferred LLM responses translate to programmer productivity when coding with LLMs, including time spent coding. We introduce RealHumanEval, a web interface to measure the ability of LLMs to assist programmers, through either autocomplete or chat support. We conducted a user study (N=243) using RealHumanEval in which users interacted with seven LLMs of varying base model performance. Despite static benchmarks not incorporating humans-in-the-loop, we find that improvements in benchmark performance lead to increased programmer productivity; however gaps in benchmark versus human performance are not proportional -- a trend that holds across both forms of LLM support. In contrast, we find that programmer preferences do not correlate with their actual performance, motivating the need for better proxy signals. We open-source RealHumanEval to enable human-centric evaluation of new models and the study data to facilitate efforts to improve code models.

Evaluating the Impact of LLMs on Programmer Productivity through RealHumanEval

Introduction

Recent advancements in LLMs have led to their increasing adoption as tools to aid programmers in various tasks, ranging from autocomplete functionalities to answering queries via chat interfaces. While static benchmarks have been instrumental in gauging the capabilities of these models in generating syntactically correct and logically sound code, there is a growing interest in understanding how these enhancements translate into real-world productivity gains for programmers. This paper introduces RealHumanEval, a comprehensive framework designed to evaluate the effectiveness of LLMs in improving programmer productivity through a human-centered approach.

The RealHumanEval Framework

RealHumanEval provides a platform allowing programmers to interact with LLMs in two primary modes: autocomplete and chat-based assistance. The framework facilitates measuring various performance metrics such as task completion time and acceptance rates of model suggestions, offering insights into the practical utility of LLMs in programming contexts. It also enables an assessment of the correlation between programmers' preferences for specific LLM interventions and their actual performance improvements.

User Study Methodology

A user paper conducted with 213 participants highlights the utility of RealHumanEval in examining the impact of different LLMs on programmer productivity. Participants were divided into groups receiving either no LLM support, autocomplete support, or chat-based support from one of six different LLMs of varying performance levels on static benchmarks. The paper's design allowed for a nuanced analysis of how LLM assistance, benchmark performance, and programmer preferences contribute to productivity in real-world programming tasks.

Key Findings

  • Benchmark Performance and Productivity: The paper reveals a positive correlation between an LLM's performance on static benchmarks and its ability to enhance programmer productivity, particularly in reducing the time spent on coding tasks. However, this correlation is not necessarily linear, indicating diminishing returns in productivity gains with further improvements in benchmark performance.
  • Programmer Preferences: Contrary to expectations, the paper finds no significant correlation between programmers' preferences for certain types of LLM support (e.g., acceptance rates of autocomplete suggestions) and actual improvements in productivity metrics such as task completion times.
  • Impact of LLM Assistance Type: While both autocomplete and chat-based supports were found to improve productivity compared to no LLM support, there were notable differences in programmer perceptions of their utility. Interestingly, chat-based assistance received higher helpfulness ratings from participants, despite similar productivity gains observed with autocomplete support.
  • Task Type Sensitivity: The paper also highlights how the effectiveness of LLM assistance varies across different types of programming tasks, with data manipulation tasks benefiting more from LLM support compared to algorithmic problem-solving tasks.

Implications and Future Directions

The findings underscore the importance of considering human-centric measures and direct productivity metrics in evaluating LLMs for programming support, beyond static benchmark performance. RealHumanEval's open-source availability promises to facilitate further research in this direction, enabling the exploration of new models and interaction paradigms. Future work could focus on enhancing LLMs' context understanding capabilities, personalizing the timing and nature of interventions, and developing more refined mechanisms for integrating LLM assistance into programming workflows.

Conclusion

Through the development and deployment of RealHumanEval, this paper provides valuable insights into the complex dynamics between LLM benchmark performance, programmer preferences, and real-world productivity. As LLMs continue to evolve, frameworks like RealHumanEval will play a critical role in guiding their development towards maximizing tangible benefits for programmers.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (48)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. Amazon. Ml-powered coding companion – amazon codewhisperer, 2022. URL https://aws.amazon.com/codewhisperer/.
  3. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.
  4. Grounded copilot: How programmers interact with code-generating models. arXiv preprint arXiv:2206.15000, 2022.
  5. Taking flight with copilot: Early insights and opportunities of ai-powered pair-programming tools. Queue, 20(6):35–57, 2022.
  6. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  7. Multipl-e: a scalable and polyglot approach to benchmarking neural code generation. IEEE Transactions on Software Engineering, 2023.
  8. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
  9. Chatbot arena: An open platform for evaluating llms by human preference. arXiv preprint arXiv:2403.04132, 2024.
  10. Conversational challenges in ai-powered data science: Obstacles, needs, and design opportunities. arXiv preprint arXiv:2310.16164, 2023.
  11. Aligning offline metrics and human judgments of value for code generation models. In Findings of the Association for Computational Linguistics: ACL 2023, pages 8516–8528, 2023.
  12. Large language models of code fail at completing code with potential bugs. arXiv preprint arXiv:2306.03438, 2023.
  13. Alpacafarm: A simulation framework for methods that learn from human feedback. arXiv preprint arXiv:2305.14387, 2023.
  14. Github. Github copilot - your ai pair programmer, 2022. URL https://github.com/features/copilot.
  15. How do analysts understand and verify ai-assisted data analyses? arXiv preprint arXiv:2309.10947, 2023.
  16. Sandra G Hart. Nasa-task load index (nasa-tlx); 20 years later. In Proceedings of the human factors and ergonomics society annual meeting, volume 50, pages 904–908. Sage publications Sage CA: Los Angeles, CA, 2006.
  17. Large language models for software engineering: A systematic literature review. arXiv preprint arXiv:2308.10620, 2023.
  18. Swe-bench: Can language models resolve real-world github issues? arXiv preprint arXiv:2310.06770, 2023.
  19. How novices use llm-based code generators to solve cs1 coding tasks in a self-paced learning environment. arXiv preprint arXiv:2309.14049, 2023.
  20. xcodeeval: A large scale multilingual multitask benchmark for code understanding, generation, translation and retrieval. arXiv preprint arXiv:2303.03004, 2023.
  21. Ds-1000: A natural and reliable benchmark for data science code generation. In International Conference on Machine Learning, pages 18319–18345. PMLR, 2023.
  22. A systematic study and comprehensive evaluation of chatgpt on benchmark datasets. arXiv preprint arXiv:2305.18486, 2023.
  23. Can gpt-4 replicate empirical software engineering research? arXiv preprint arXiv:2310.01727, 2023.
  24. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. arXiv preprint arXiv:2305.01210, 2023.
  25. CodeXGLUE: A machine learning benchmark dataset for code understanding and generation. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021. URL https://openreview.net/forum?id=6lE4dQXaUcb.
  26. Language models of code are few-shot commonsense learners. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 1384–1403, 2022.
  27. Reading between the lines: Modeling user behavior and costs in ai-assisted programming. arXiv preprint arXiv:2210.14306, 2022.
  28. Simulating iterative human-ai interaction in programming with llms. In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following, 2023.
  29. Octopack: Instruction tuning code large language models. arXiv preprint arXiv:2308.07124, 2023.
  30. Codegen: An open large language model for code with multi-turn program synthesis. arXiv preprint arXiv:2203.13474, 2022.
  31. OpenAI. Chatgpt: Optimizing language models for dialogue, 2022a. URL https://openai.com/blog/chatgpt/.
  32. OpenAI. Chatgpt: Introducing chatgpt. https://openai.com/blog/chatgpt, 2022b.
  33. The impact of ai on developer productivity: Evidence from github copilot. arXiv preprint arXiv:2302.06590, 2023.
  34. “it’s weird that it knows what i want”: Usability and interactions with copilot for novice programmers. ACM Trans. Comput.-Hum. Interact., 31(1), nov 2023. ISSN 1073-0516. doi: 10.1145/3617367. URL https://doi.org/10.1145/3617367.
  35. replit. Meet ghostwriter, your partner in code., 2023. URL https://replit.com/site/ghostwriter.
  36. The programmer’s assistant: Conversational interaction with a large language model for software development. In Proceedings of the 28th International Conference on Intelligent User Interfaces, pages 491–514, 2023.
  37. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
  38. An analysis of the automatic bug fixing performance of chatgpt. arXiv preprint arXiv:2301.08653, 2023.
  39. Expectation vs. experience: Evaluating the usability of code generation tools powered by large language models. In CHI Conference on Human Factors in Computing Systems Extended Abstracts, pages 1–7, 2022.
  40. Recode: Robustness evaluation of code generation models. arXiv preprint arXiv:2212.10264, 2022.
  41. Is ai the better programming partner? human-human pair programming vs. human-ai pair programming. arXiv preprint arXiv:2306.05153, 2023.
  42. Devgpt: Studying developer-chatgpt conversations. arXiv preprint arXiv:2309.03914, 2023.
  43. Codescope: An execution-based multilingual multitask multidimensional benchmark for evaluating llms on code understanding and generation. arXiv preprint arXiv:2311.08588, 2023.
  44. Intercode: Standardizing and benchmarking interactive coding with execution feedback. arXiv preprint arXiv:2306.14898, 2023.
  45. Large language models meet nl2code: A survey. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7443–7464, 2023.
  46. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023.
  47. Xlcost: A benchmark dataset for cross-lingual code intelligence, 2022. URL https://arxiv.org/abs/2206.08474.
  48. Productivity assessment of neural code completion. In Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming, pages 21–29, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Hussein Mozannar (20 papers)
  2. Valerie Chen (23 papers)
  3. Mohammed Alsobay (2 papers)
  4. Subhro Das (38 papers)
  5. Sebastian Zhao (6 papers)
  6. Dennis Wei (64 papers)
  7. Manish Nagireddy (15 papers)
  8. Prasanna Sattigeri (70 papers)
  9. Ameet Talwalkar (89 papers)
  10. David Sontag (95 papers)
Citations (11)
Github Logo Streamline Icon: https://streamlinehq.com