Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LLM-Based Test-Driven Interactive Code Generation: User Study and Empirical Evaluation (2404.10100v2)

Published 15 Apr 2024 in cs.SE

Abstract: LLMs have shown great potential in automating significant aspects of coding by producing natural code from informal natural language (NL) intent. However, given NL is informal, it does not lend easily to checking that the generated code correctly satisfies the user intent. In this paper, we propose a novel interactive workflow TiCoder for guided intent clarification (i.e., partial formalization) through tests to support the generation of more accurate code suggestions. Through a mixed methods user study with 15 programmers, we present an empirical evaluation of the effectiveness of the workflow to improve code generation accuracy. We find that participants using the proposed workflow are significantly more likely to correctly evaluate AI generated code, and report significantly less task-induced cognitive load. Furthermore, we test the potential of the workflow at scale with four different state-of-the-art LLMs on two python datasets, using an idealized proxy for a user feedback. We observe an average absolute improvement of 45.97% in the pass@1 code generation accuracy for both datasets and across all LLMs within 5 user interactions, in addition to the automatic generation of accompanying unit tests.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. Is github’s copilot as bad as humans at introducing vulnerabilities in code? arXiv preprint arXiv:2204.04741 (2022).
  2. Program synthesis with large language models. arXiv preprint arXiv:2108.07732 (2021).
  3. Program Synthesis with Large Language Models. https://doi.org/10.48550/ARXIV.2108.07732
  4. Grounded copilot: How programmers interact with code-generating models. Proceedings of the ACM on Programming Languages 7, OOPSLA1 (2023), 85–111.
  5. Yoav Benjamini and Yosef Hochberg. 1995. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal statistical society: series B (Methodological) 57, 1 (1995), 289–300.
  6. Taking Flight with Copilot: Early insights and opportunities of AI-powered pair-programming tools. Queue 20, 6 (2022), 35–57.
  7. CodeT: Code Generation with Generated Tests. https://doi.org/10.48550/ARXIV.2207.10397
  8. Evaluating Large Language Models Trained on Code. https://doi.org/10.48550/ARXIV.2107.03374
  9. PaLM: Scaling Language Modeling with Pathways. https://doi.org/10.48550/ARXIV.2204.02311
  10. TOGA: A Neural Method for Test Oracle Generation. In ICSE 2022. ACM. https://www.microsoft.com/en-us/research/publication/toga-a-neural-method-for-test-oracle-generation/
  11. Formalizing Natural Language Intent into Program Specifications via Large Language Models. arXiv preprint arXiv:2310.01831 (2023).
  12. Towards Generating Functionally Correct Code Edits from Natural Language Issue Descriptions. arXiv preprint arXiv:2304.03816 (2023).
  13. InCoder: A Generative Model for Code Infilling and Synthesis. https://doi.org/10.48550/ARXIV.2204.05999
  14. GitHub. 2022. GitHub Copilot. Accessed August 5, 2022. https://github.com/features/copilot/.
  15. Sumit Gulwani. 2011. Automating String Processing in Spreadsheets using Input-Output Examples. In PoPL’11, January 26-28, 2011, Austin, Texas, USA. https://www.microsoft.com/en-us/research/publication/automating-string-processing-spreadsheets-using-input-output-examples/
  16. Program Synthesis. Found. Trends Program. Lang. 4, 1-2 (2017), 1–119. https://doi.org/10.1561/2500000010
  17. Sandra G Hart. 2006. NASA-task load index (NASA-TLX); 20 years later. In Proceedings of the human factors and ergonomics society annual meeting, Vol. 50. Sage publications Sage CA: Los Angeles, CA, 904–908.
  18. Sandra G Hart and Lowell E Staveland. 1988. Development of NASA-TLX (Task Load Index): Results of empirical and theoretical research. In Advances in psychology. Vol. 52. Elsevier, 139–183.
  19. Jigsaw: Large Language Models meet Program Synthesis. In International Conference on Software Engineering (ICSE). https://www.microsoft.com/en-us/research/publication/jigsaw-large-language-models-meet-program-synthesis/
  20. Oracle-guided component-based program synthesis. In Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering - Volume 1, ICSE 2010, Cape Town, South Africa, 1-8 May 2010, Jeff Kramer, Judith Bishop, Premkumar T. Devanbu, and Sebastián Uchitel (Eds.). ACM, 215–224. https://doi.org/10.1145/1806799.1806833
  21. Susmit Jha and Sanjit A. Seshia. 2017. A theory of formal synthesis via inductive learning. Acta Informatica 54, 7 (2017), 693–726. https://doi.org/10.1007/s00236-017-0294-5
  22. Question Selection for Interactive Program Synthesis. In Proceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation (London, UK) (PLDI 2020). Association for Computing Machinery, New York, NY, USA, 1143–1158. https://doi.org/10.1145/3385412.3386025
  23. Interactive Code Generation via Test-Driven User-Intent Formalization. CoRR abs/2208.05950 (2023). https://doi.org/10.48550/ARXIV.2208.05950 arXiv:2208.05950
  24. Tessa Lau. 2009. Why programming-by-demonstration systems fail: Lessons learned for usable ai. AI Magazine 30, 4 (2009), 65–65.
  25. Interactive program synthesis. arXiv preprint arXiv:1703.03539 (2017).
  26. Codamosa: Escaping coverage plateaus in test generation with pre-trained large language models. In 45th International Conference on Software Engineering, ser. ICSE.
  27. Competition-Level Code Generation with AlphaCode. https://doi.org/10.48550/ARXIV.2203.07814
  28. Understanding the Usability of AI Programming Assistants. arXiv preprint arXiv:2303.17125 (2023).
  29. User interaction models for disambiguation in programming by example. In Proceedings of the 28th Annual ACM Symposium on User Interface Software & Technology. 291–301.
  30. On the design of ai-powered code assistants for notebooks. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. 1–16.
  31. Reading between the lines: Modeling user behavior and costs in AI-assisted programming. arXiv preprint arXiv:2210.14306 (2022).
  32. Codegen2: Lessons for training llms on programming and natural languages. arXiv preprint arXiv:2305.02309 (2023).
  33. A Conversational Paradigm for Program Synthesis. https://doi.org/10.48550/ARXIV.2203.13474
  34. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems 35 (2022), 27730–27744.
  35. Do Users Write More Insecure Code with AI Assistants? arXiv preprint arXiv:2211.03622 (2022).
  36. Multi-modal program inference: a marriage of pre-trained language models and component-based synthesis. Proc. ACM Program. Lang. 5, OOPSLA (2021), 1–29. https://doi.org/10.1145/3485535
  37. Adaptive test generation using a large language model. arXiv preprint arXiv:2302.06527 (2023).
  38. Armando Solar-Lezama. 2009. The Sketching Approach to Program Synthesis. In Programming Languages and Systems, Zhenjiang Hu (Ed.). Springer Berlin Heidelberg, Berlin, Heidelberg, 4–13.
  39. A Systematic Evaluation of Large Language Models of Code. In Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming (San Diego, CA, USA) (MAPS 2022). Association for Computing Machinery, New York, NY, USA, 1–10. https://doi.org/10.1145/3520312.3534862
  40. In-ide code generation from natural language: Promise and challenges. ACM Transactions on Software Engineering and Methodology (TOSEM) 31, 2 (2022), 1–47.
  41. CoderEval: A Benchmark of Pragmatic Code Generation with Generative Pre-trained Models. arXiv preprint arXiv:2302.00288 (2023).
  42. Interactive program synthesis by augmented examples. In Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology. 627–648.
  43. Productivity assessment of neural code completion. In MAPS@PLDI 2022: 6th ACM SIGPLAN International Symposium on Machine Programming, San Diego, CA, USA, 13 June 2022, Swarat Chaudhuri and Charles Sutton (Eds.). ACM, 21–29. https://doi.org/10.1145/3520312.3534864
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Sarah Fakhoury (10 papers)
  2. Aaditya Naik (8 papers)
  3. Georgios Sakkas (4 papers)
  4. Saikat Chakraborty (62 papers)
  5. Shuvendu K. Lahiri (32 papers)
Citations (7)