Lessons from Building StackSpot AI: A Contextualized AI Coding Assistant (2311.18450v3)
Abstract: With their exceptional natural language processing capabilities, tools based on LLMs like ChatGPT and Co-Pilot have swiftly become indispensable resources in the software developer's toolkit. While recent studies suggest the potential productivity gains these tools can unlock, users still encounter drawbacks, such as generic or incorrect answers. Additionally, the pursuit of improved responses often leads to extensive prompt engineering efforts, diverting valuable time from writing code that delivers actual value. To address these challenges, a new breed of tools, built atop LLMs, is emerging. These tools aim to mitigate drawbacks by employing techniques like fine-tuning or enriching user prompts with contextualized information. In this paper, we delve into the lessons learned by a software development team venturing into the creation of such a contextualized LLM-based application, using retrieval-based techniques, called CodeBuddy. Over a four-month period, the team, despite lacking prior professional experience in LLM-based applications, built the product from scratch. Following the initial product release, we engaged with the development team responsible for the code generative components. Through interviews and analysis of the application's issue tracker, we uncover various intriguing challenges that teams working on LLM-based applications might encounter. For instance, we found three main group of lessons: LLM-based lessons, User-based lessons, and Technical lessons. By understanding these lessons, software development teams could become better prepared to build LLM-based applications.
- ]Googleblog2022 [n. d.]. ML-Enhanced Code Completion Improves Developer Productivity. https://ai.googleblog.com/2022/07/ml-enhanced-code-completion-improves.html. Accessed: 2022-12-12.
- Guidelines for Human-AI Interaction. In CHI 2019. ACM. https://www.microsoft.com/en-us/research/publication/guidelines-for-human-ai-interaction/ CHI 2019 Honorable Mention Award.
- Samuel R Bowman. 2023. Eight things to know about large language models. arXiv preprint arXiv:2304.00612 (2023).
- Language Models are Few-Shot Learners. arXiv:2005.14165 [cs.CL]
- A survey on evaluation of large language models. arXiv preprint arXiv:2307.03109 (2023).
- Victoria Clarke and Virginia Braun. 2013. Successful Qualitative Research: A Practical Guide for Beginners. Sage, London.
- Stable Diffusion Prompt Book. Accessed: 2023-10-01.
- Piloting Copilot and Codex: Hot Temperature, Cold Prompts, or Black Magic? arXiv:2210.14699 [cs.SE]
- Large Language Models for Software Engineering: Survey and Open Problems. arXiv preprint arXiv:2310.03533 (2023).
- Automated repair of programs from large language models. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 1469–1481.
- Görkem Giray. 2021. A software engineering perspective on engineering machine learning systems: State of the art and challenges. Journal of Systems and Software 180 (2021), 111031.
- Jingxuan He and Martin T. Vechev. 2023. Controlling Large Language Models to Generate Secure and Vulnerable Code. CoRR abs/2302.05319 (2023). https://doi.org/10.48550/arXiv.2302.05319 arXiv:2302.05319
- Large Language Models for Software Engineering: A Systematic Literature Review. CoRR abs/2308.10620 (2023). https://doi.org/10.48550/arXiv.2308.10620 arXiv:2308.10620
- CS Krishna. 2023. Prompt Generate Train (PGT): A framework for few-shot domain adaptation, alignment, and uncertainty calibration of a retriever augmented generation (RAG) model for domain specific open book question-answering. arXiv preprint arXiv:2307.05915 (2023).
- Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (Eds.). https://proceedings.neurips.cc/paper/2020/hash/6b493230205f780e1bc26945df7481e5-Abstract.html
- AceCoder: Utilizing Existing Code to Enhance Code Generation. arXiv:2303.17780 [cs.SE]
- Lost in the Middle: How Language Models Use Long Contexts. CoRR abs/2307.03172 (2023). https://doi.org/10.48550/arXiv.2307.03172 arXiv:2307.03172
- LLM is Like a Box of Chocolates: the Non-determinism of ChatGPT in Code Generation. CoRR abs/2308.02828 (2023). https://doi.org/10.48550/arXiv.2308.02828 arXiv:2308.02828
- Large language models in medicine. Nature medicine (2023), 1–11.
- Expectation vs. Experience: Evaluating the Usability of Code Generation Tools Powered by Large Language Models. In Extended Abstracts of the 2022 CHI Conference on Human Factors in Computing Systems (New Orleans, LA, USA) (CHI EA ’22). Association for Computing Machinery, New York, NY, USA, Article 332, 7 pages. https://doi.org/10.1145/3491101.3519665
- Attention is All You Need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (Long Beach, California, USA) (NIPS’17). Curran Associates Inc., Red Hook, NY, USA, 6000–6010.
- Software testing with large language model: Survey, landscape, and vision. arXiv preprint arXiv:2307.07221 (2023).
- Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv:2201.11903 [cs.CL]
- Perfection Not Required? Human-AI Partnerships in Code Translation. In 26th International Conference on Intelligent User Interfaces (College Station, TX, USA) (IUI ’21). Association for Computing Machinery, New York, NY, USA, 402–412. https://doi.org/10.1145/3397481.3450656
- Bloomberggpt: A large language model for finance. arXiv preprint arXiv:2303.17564 (2023).
- In-IDE Code Generation from Natural Language: Promise and Challenges. ACM Trans. Softw. Eng. Methodol. 31, 2, Article 29 (mar 2022), 47 pages. https://doi.org/10.1145/3487569
- Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond. CoRR abs/2304.13712 (2023). https://doi.org/10.48550/arXiv.2304.13712 arXiv:2304.13712
- What Do Code Models Memorize? An Empirical Study on Large Language Models of Code. CoRR abs/2308.09932 (2023). https://doi.org/10.48550/arXiv.2308.09932 arXiv:2308.09932
- Productivity Assessment of Neural Code Completion. In Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming (San Diego, CA, USA) (MAPS 2022). Association for Computing Machinery, New York, NY, USA, 21–29. https://doi.org/10.1145/3520312.3534864
- Gustavo Pinto (33 papers)
- Cleidson de Souza (4 papers)
- João Batista Neto (2 papers)
- Alberto de Souza (6 papers)
- Tarcísio Gotto (1 paper)
- Edward Monteiro (3 papers)