Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Building Your Own Product Copilot: Challenges, Opportunities, and Needs (2312.14231v1)

Published 21 Dec 2023 in cs.SE

Abstract: A race is underway to embed advanced AI capabilities into products. These product copilots enable users to ask questions in natural language and receive relevant responses that are specific to the user's context. In fact, virtually every large technology company is looking to add these capabilities to their software products. However, for most software engineers, this is often their first encounter with integrating AI-powered technology. Furthermore, software engineering processes and tools have not caught up with the challenges and scale involved with building AI-powered applications. In this work, we present the findings of an interview study with 26 professional software engineers responsible for building product copilots at various companies. From our interviews, we found pain points at every step of the engineering process and the challenges that strained existing development practices. We then conducted group brainstorming sessions to collaborative on opportunities and tool designs for the broader software engineering community.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (27)
  1. ChainForge: A Visual Toolkit for Prompt Engineering and LLM Hypothesis Testing. arXiv:2309.09128 [cs.HC]
  2. The Bones of the System: A Case Study of Logging and Telemetry at Microsoft. In 2016 IEEE/ACM 38th International Conference on Software Engineering Companion (ICSE-C). 92–101.
  3. Grounded Copilot: How Programmers Interact with Code-Generating Models. 7, OOPSLA1, Article 78 (apr 2023), 27 pages. https://doi.org/10.1145/3586030
  4. Promptify: Text-to-Image Generation through Interactive Prompt Exploration with Large Language Models. arXiv:2304.09337 [cs.HC]
  5. Joy Buolamwini and Timnit Gebru. 2018. Gender shades: Intersectional accuracy disparities in commercial gender classification. In Conference on fairness, accountability and transparency. PMLR, 77–91.
  6. ”There’s no way to keep up!”: Diverse Motivations and Challenges Faced by Informal Learners of ML. In 2022 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC). 1–11. https://doi.org/10.1109/VL/HCC53370.2022.9833100
  7. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021).
  8. Metamorphic Testing: A Review of Challenges and Opportunities. ACM Comput. Surv. 51, 1, Article 4 (jan 2018), 27 pages. https://doi.org/10.1145/3143561
  9. Large Language Models for Software Engineering: Survey and Open Problems. arXiv:2310.03533 [cs.SE]
  10. Robert L. Forward. 1996. Ad Astra! Journal of the British Interplanetary Society 49, 1 (1996), 23–32.
  11. Fairness Testing: Testing Software for Discrimination. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering (Paderborn, Germany) (ESEC/FSE 2017). Association for Computing Machinery, New York, NY, USA, 498–510. https://doi.org/10.1145/3106237.3106277
  12. PromptMaker: Prompt-Based Prototyping with Large Language Models. In Extended Abstracts of the 2022 CHI Conference on Human Factors in Computing Systems (New Orleans, LA, USA) (CHI EA ’22). Association for Computing Machinery, New York, NY, USA, Article 35, 8 pages. https://doi.org/10.1145/3491101.3503564
  13. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates, Inc., 9459–9474. https://proceedings.neurips.cc/paper_files/paper/2020/file/6b493230205f780e1bc26945df7481e5-Paper.pdf
  14. Holistic Evaluation of Language Models. arXiv:2211.09110 [cs.CL]
  15. On the Design of AI-Powered Code Assistants for Notebooks. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (Hamburg, Germany) (CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 434, 16 pages. https://doi.org/10.1145/3544548.3580940
  16. CodeCompose: A Large-Scale Industrial Deployment of AI-assisted Code Authoring. arXiv:2305.12050 [cs.SE]
  17. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics. 311–318.
  18. Surveying the Developer Experience of Flaky Tests. In Proceedings of the 44th International Conference on Software Engineering: Software Engineering in Practice (Pittsburgh, Pennsylvania) (ICSE-SEIP ’22). Association for Computing Machinery, New York, NY, USA, 253–262. https://doi.org/10.1145/3510457.3513037
  19. Test-Case Reduction for C Compiler Bugs. SIGPLAN Not. 47, 6 (jun 2012), 335–346. https://doi.org/10.1145/2345156.2254104
  20. Remote, but Connected: How #TidyTuesday Provides an Online Community of Practice for Data Scientists. Proc. ACM Hum.-Comput. Interact. 5, CSCW1, Article 52 (apr 2021), 31 pages. https://doi.org/10.1145/3449126
  21. Towards More Effective AI-Assisted Programming: A Systematic Design Exploration to Improve Visual Studio IntelliCode’s User Experience. In 2023 IEEE/ACM 45th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). 185–195. https://doi.org/10.1109/ICSE-SEIP58684.2023.00022
  22. PromptChainer: Chaining Large Language Model Prompts through Visual Programming. In Extended Abstracts of the 2022 CHI Conference on Human Factors in Computing Systems (New Orleans, LA, USA) (CHI EA ’22). Association for Computing Machinery, New York, NY, USA, Article 359, 10 pages. https://doi.org/10.1145/3491101.3519729
  23. AI Chains: Transparent and Controllable Human-AI Interaction by Chaining Large Language Model Prompts. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (New Orleans, LA, USA) (CHI ’22). Association for Computing Machinery, New York, NY, USA, Article 385, 22 pages. https://doi.org/10.1145/3491102.3517582
  24. Chloe Xiang. 2023. Man Dies by Suicide After Talking with AI Chatbot, Widow Says. https://www.vice.com/en/article/pkadgm/man-dies-by-suicide-after-talking-with-ai-chatbot-widow-says Accessed: 10/1/2023.
  25. WizardLM: Empowering Large Language Models to Follow Complex Instructions. arXiv:2304.12244 [cs.CL]
  26. Concept-Annotated Examples for Library Comparison. In Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology (Bend, OR, USA) (UIST ’22). Association for Computing Machinery, New York, NY, USA, Article 65, 16 pages. https://doi.org/10.1145/3526113.3545647
  27. Andreas Zeller. 1999. Yesterday, My Program Worked. Today, It Does Not. Why?. In Proceedings of the 7th European Software Engineering Conference Held Jointly with the 7th ACM SIGSOFT International Symposium on Foundations of Software Engineering (Toulouse, France) (ESEC/FSE-7). Springer-Verlag, Berlin, Heidelberg, 253–267.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Chris Parnin (19 papers)
  2. Gustavo Soares (21 papers)
  3. Rahul Pandita (6 papers)
  4. Sumit Gulwani (55 papers)
  5. Jessica Rich (1 paper)
  6. Austin Z. Henley (12 papers)
Citations (16)

Summary

  • The paper identifies key challenges in integrating AI copilots, focusing on prompt engineering and complex interaction orchestration.
  • It highlights the trial-and-error process of tuning large language models and managing multi-turn conversations.
  • The study underscores the urgent need for innovative tools and automated benchmarks to enable reliable AI-driven product development.

Introduction

Integrating advanced AI capabilities into software products has become a prevalent undertaking in the tech industry. Software engineers embarking on this journey encounter a new paradigm fraught with challenges unique to building applications powered by artificial intelligence, particularly when engaging with LLMs. Despite their potential to transform user interactions with technology, embedding AI into products, especially in the form of conversational agents or copilots, requires a considerable evolution of both tooling and software engineering practices.

Understanding the Software Engineer's AI Challenges

The implementation of product copilots often marks the first foray into AI for many software engineers. The paper at hand entails an interview with 26 professionals in this nascent field, divulging the numerous hurdles encountered throughout. Prompt engineering surfaces as a particularly strenuous task, where engineers manipulate LLMs to glean desirable behaviors and responses—a process that veers more towards an art than a science due to the volatile nature of these models. They face the arduous process of trial and error, crafting prompts in playground environments and continuously tweaking them to handle corner cases and context variations. This painstaking task highlights the necessity for improved tools capable of systematically managing and validating prompts.

Orchestration: Crafting the Interactions

Beyond prompt engineering, the orchestration of copilots poses its share of complexities. Intent detection and routing workflows demand a delicate balance in providing context and executing commands within applications. Existing frameworks and libraries offer the initial building blocks, but the development often transcends simple prompt engineering. Engineers grapple with the limitations in commanding the AI copilots, planning multi-turn interactions, and maintaining a coherent conversational state—a reflection of the sophistication required in managing AI behaviors within product environments.

Testing Copilots and the Quest for Reliability

Software engineers typically seek refuge in classical engineering methodologies, like unit testing, to measure reliability and performance. However, generative models defy these traditional practices; their probabilistic nature makes every test unpredictable. Respondents employ diverse strategies, from running multiple tests to check for passing thresholds, to manually curating input and output examples—an unsustainable solution that underscores the pressing need for automation in benchmark creation and suitable metrics for AI tasks.

Conclusion: The Road Ahead for AI-driven Development

The integration of AI into products is still a growing domain, and as the capabilities of models like GPT and BERT evolve, so too must the expertise and toolsets of software engineers. A clear message emerges: software engineering for AI is fundamentally different. It requires an open mind, iterative learning, and new definitions of what constitutes successful testing and validation. The field stands on the brink of a tooling revolution that could streamline AI integration into software engineering workflows, making the development of product copilots more accessible, efficient, and robust. This paper lays the groundwork for future innovations and establishes the need for a collaborative effort to craft a new era of AI-first software development.

Youtube Logo Streamline Icon: https://streamlinehq.com