Papers
Topics
Authors
Recent
Search
2000 character limit reached

Exploring Prompt Engineering Practices in the Enterprise

Published 13 Mar 2024 in cs.HC and cs.AI | (2403.08950v1)

Abstract: Interaction with LLMs is primarily carried out via prompting. A prompt is a natural language instruction designed to elicit certain behaviour or output from a model. In theory, natural language prompts enable non-experts to interact with and leverage LLMs. However, for complex tasks and tasks with specific requirements, prompt design is not trivial. Creating effective prompts requires skill and knowledge, as well as significant iteration in order to determine model behavior, and guide the model to accomplish a particular goal. We hypothesize that the way in which users iterate on their prompts can provide insight into how they think prompting and models work, as well as the kinds of support needed for more efficient prompt engineering. To better understand prompt engineering practices, we analyzed sessions of prompt editing behavior, categorizing the parts of prompts users iterated on and the types of changes they made. We discuss design implications and future directions based on these prompt engineering practices.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (29)
  1. Chainforge: An open-source visual programming environment for prompt engineering. In Adjunct Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, pages 1–3, 2023.
  2. Promptsource: An integrated development environment and repository for natural language prompts. arXiv preprint arXiv:2202.01279, 2022.
  3. Beyond accuracy: The role of mental models in human-ai team performance. In Proceedings of the AAAI conference on human computation and crowdsourcing, volume 7, pages 2–11, 2019.
  4. User intent recognition and satisfaction with large language models: A user study with chatgpt. arXiv preprint arXiv:2402.02136, 2024.
  5. The foundation model transparency index. arXiv preprint arXiv:2310.12941, 2023.
  6. Can (a) i have a word with you? a taxonomy on the design dimensions of ai prompts. 2024.
  7. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  8. Early llm-based tools for enterprise information workers likely provide meaningful boosts to productivity. 2023.
  9. How to prompt? opportunities and challenges of zero-and few-shot learning for human-ai interaction in creative applications of generative models. arXiv preprint arXiv:2209.01390, 2022.
  10. Mental models of ai agents in a cooperative game setting. In Proceedings of the 2020 chi conference on human factors in computing systems, pages 1–12, 2020.
  11. Discovering the syntax and strategies of natural language programming with generative language models. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems, pages 1–19, 2022.
  12. Understanding users’ dissatisfaction with chatgpt responses: Types, resolving tactics, and the effect of knowledge level. arXiv preprint arXiv:2311.07434, 2023.
  13. Guiding large language models via directional stimulus prompting. Advances in Neural Information Processing Systems, 36, 2024.
  14. Ai transparency in the age of llms: A human-centered research roadmap. arXiv preprint arXiv:2306.01941, 2023.
  15. Zhicheng Lin. Ten simple rules for crafting effective prompts for large language models. Available at SSRN 4565553, 2023.
  16. “what it wants me to say”: Bridging the abstraction gap between end-user programmers and code-generating large language models. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, pages 1–31, 2023.
  17. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9):1–35, 2023.
  18. Propane: Prompt design as an inverse problem. arXiv preprint arXiv:2311.07064, 2023.
  19. Promptaid: Prompt exploration, perturbation, testing and iteration using visual analytics for large language models. arXiv preprint arXiv:2304.01964, 2023.
  20. Python Software Foundation. difflib: Helpers for Computing String Similarities and Differences. Python Software Foundation, 2024.
  21. Transforming boundaries: how does chatgpt change knowledge work? Journal of Business Strategy, 2023.
  22. Cataloging prompt patterns to enhance the discipline of prompt engineering. URL: https://www. dre. vanderbilt. edu/~ schmidt/PDF/ADA_Europe_Position_Paper. pdf [accessed 2023-09-25], 2023.
  23. Interactive and visual prompt engineering for ad-hoc task adaptation with large language models. IEEE transactions on visualization and computer graphics, 29(1):1146–1156, 2022.
  24. Investigating explainability of generative ai for code through scenario-based design. In 27th International Conference on Intelligent User Interfaces, pages 212–228, 2022.
  25. Promptagent: Strategic planning with language models enables expert-level prompt optimization. arXiv preprint arXiv:2310.16427, 2023.
  26. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
  27. A prompt pattern catalog to enhance prompt engineering with chatgpt. arXiv preprint arXiv:2302.11382, 2023.
  28. Promptchainer: Chaining large language model prompts through visual programming. In CHI Conference on Human Factors in Computing Systems Extended Abstracts, pages 1–10, 2022.
  29. Why johnny can’t prompt: how non-ai experts try (and fail) to design llm prompts. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, pages 1–21, 2023.
Citations (6)

Summary

  • The paper examines prompt editing behaviors in enterprise environments, revealing iterative refinements of context and task instructions as dominant practices.
  • The paper analyzes quantitative metrics from 57 sessions, highlighting frequent parameter changes and high similarity ratios that underscore incremental edits.
  • The paper discusses implications for tool design, advocating enhanced version control, structured debugging, and standardized templates to boost efficiency.

Detailed Analysis of Enterprise Prompt Engineering Practices

Introduction and Context

The paper "Exploring Prompt Engineering Practices in the Enterprise" (2403.08950) systematically examines how practitioners in enterprise settings edit and refine prompts for LLMs across a variety of use cases. Rather than focusing on optimization strategies or technical frameworks, the study provides granular insight into prompt engineering behaviors, including the types of edits users make, their frequency, and the components of prompts most subject to modification.

Enterprise LLM use cases are diverse, comprising tasks such as code/SQL generation, content summarization, classification, extraction, and content-grounded Q&A. The operational requirements often demand outputs with high specificity and accuracy, and involve interaction with models varying in capability, cost, and specialization. The paper builds upon prior taxonomies and qualitative studies, expanding the understanding of prompt engineering to real-world enterprise contexts with dedicated tools for prompt experimentation.

Quantitative Analysis of Prompt Sessions

The analysis draws from 57 prompt editing sessions sampled from a broader dataset, capturing a total of 1523 individual prompt edits. Sessions tend to be lengthy, with a mean duration of 43.4 minutes per session and a median duration of 39 minutes, reflecting substantial iterative efforts. Figure 1

Figure 1: Distribution of prompt editing session durations, highlighting the commonly extended session lengths.

Prompt similarity ratios between successive edits predominantly range from 0.7 to 1.0, illustrating that most changes are incremental such as refinements or tweaks, rather than wholesale revisions. Occasionally, larger edits occur, potentially due to parallel prompting workflows or significant task shifts. Figure 2

Figure 2: Sequence similarity ratios between successive prompts, with values near 1 indicating incremental edits.

Parameter changes are widespread; 93% of sessions include at least one change to model parameters, most frequently shifting between target models, altering max token settings, or adjusting repetition penalties. Users demonstrated a mean of 3.6 different models per session, suggesting a comparative, exploratory approach to model selection. Figure 3

Figure 3: Frequency of inference parameter changes across sessions, with model selection, token limits, and repetition penalties as predominant targets.

Figure 4

Figure 4: Distribution of the number of models used per session, supporting observations of multi-model comparison.

Prompt Component Editing Behavior

The qualitative coding reveals that practitioners focus primarily on two components: context and task instructions. Context encompasses embedded examples, grounding documents, and input queries, and is both the most frequently edited and a key lever for influencing model behavior. Task instructions, describing the goal or output specifics, are edited less often than context but remain central to the iterative process. Figure 5

Figure 5: Frequency of edits across prompt components, with context dominating and task instructions following.

Edits are predominantly modifications (maintaining the same meaning with refined phrasing), followed by additions, changes (altering meaning), removals, and formatting. The pairing of edit types with prompt components further clarifies prevailing patterns: context addition, instruction task modification, and label modification are among the most common. Figure 6

Figure 6: Top prompt component and edit type combinations observed, illustrating prevalent editing behaviors.

Edits are not exclusively sequential; 22% of edits involve multiple simultaneous changes, most typically including context. Nearly half of multi-edits combine context and instruction edits, while 11% involve context and label edits. The frequency of such multi-edits and parameter adjustments underscores the complexity and inefficiency inherent in the current iterative paradigm.

Rollbacks, where practitioners undo or redo edits, account for 11% of prompt edits. Rollbacks are notably frequent for components less commonly edited such as handle-unknown instructions (40% rollback rate), output-length (25%), and persona (18%), suggesting uncertainty about their impact or adverse effects.

Patterns in Context, Instructions, and Labels

Context editing emerges as dominant due to both dialog simulation and example manipulation. The interface’s design, appending generated output directly to prompt input, facilitates iterative elaboration or removal of conversational turns and grounding data.

Editing of instructions, especially task modifications, aligns with trial-and-error practices. Variants include rewording between different formulations (commands, questions, descriptions) and adjusting detail or structure. Surprisingly, edits to secondary instruction components (output format, inclusion rules, persona, handle-unknown, output length) are relatively infrequent—potentially due to task domain conventions or standardized requirements.

Label editing is notably common. Labels (identifiers, tags) delineate structure within prompts (instructions, context, examples, output) and serve as constructs for more precise control. Users frequently modify output labels to attempt to constrain model generation behavior.

Implications for Prompt Engineering Tools and Practices

Observed behaviors—frequent rollbacks, multi-component edits, parameter switching, and high reliance on context—highlight cognitive challenges and inefficiencies in prompt iteration. Existing tooling, including visual prompt engineering environments and GUI-based frameworks, partially address these gaps but lack robust version control and edit impact tracking tailored to prompt engineering’s requirements.

The paper suggests that systematic support for prompt debugging and testing could improve productivity, for example via enhanced edit histories, structured prompting frameworks, or semi-automated variation authoring. Standardizing structural components (labels, formatting) may yield further gains, especially in enterprise applications requiring consistent, document-grounded or multi-turn prompts.

These findings call for both tooling innovation (richer interface support, systematic experiment tracking, composable prompt frameworks) and further research into how prompt structure, context management, and edit impact feedback can be optimized for enterprise prompt engineering.

Theoretical and Practical Implications, Future Directions

From a theoretical perspective, the study elucidates user mental models, revealing a heavy reliance on iterative context manipulation and task refinement as a means to control LLM behavior. The prevalence of parameter changes and rollback behaviors signals gaps in user understanding and tool-mediated feedback. The results reinforce the need for prompt engineering to be conceptualized as both a linguistic and a procedural discipline.

Practically, the insights can inform the design of LLM-driven enterprise workflows, including automated prompt optimization, modular prompt construction, editable prompt collections, and tools for comparative model evaluation. Future work may explore:

  • Prompt quality evaluation metrics based on use case taxonomy
  • Semi-automated variation exploration with impact visualization
  • Standardized template libraries for enterprise use cases
  • Enhanced support for context management and output constraint specification

Such developments may close the iteration-to-adoption loop and accelerate robust application of LLMs in knowledge-intensive domains.

Conclusion

This paper provides a granular analysis of prompt editing in enterprise settings, identifying context and task instruction as principal loci for iterative modification, with label editing and parameter changes also prevalent. Incremental edits dominate, but inefficiencies persist due to cognitive overload and lack of systematic tool support. The findings inform both the immediate development of prompt engineering assistance and the longer-term quest for standardization and automation in LLM-powered enterprise workflows.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 5 tweets with 24 likes about this paper.