Prompt Engineering: Optimizing AI Queries

Updated 13 July 2025

Prompt engineering is the targeted design and optimization of queries that guide large pre-trained models without altering their internal parameters.
It encompasses methods such as zero-shot, few-shot, chain-of-thought, and soft prompting to adapt AI outputs for specific tasks.
Its applications span diverse fields like healthcare, software development, and education, boosting efficiency, transparency, and model performance.

Prompt engineering is the targeted design, structuring, and optimization of input queries—prompts—to guide the behavior of large pre-trained language, vision, or multimodal models without changing their underlying parameters. As large foundation models now serve as the backbone for a growing range of applications, prompt engineering has emerged as a critical mediator between domain-specific intent and general-purpose AI capability. Its impact is evident across domains such as business process management, healthcare, requirements engineering, education, software development, and computer vision. Prompt engineering encompasses an array of methodologies, from simple template-based instructions and manually authored examples, to sophisticated tuning and automated, interactive or ensemble-driven frameworks. The approach enables model adaptation to downstream tasks with little or no additional model retraining, providing efficiency, flexibility, and, in many cases, improved transparency, accessibility, and explainability.

1. Fundamental Concepts and Methodologies

Prompt engineering is defined as the systematic development of prompts that specify the desired output or behavior of a pre-trained model at inference time, as opposed to fine-tuning model weights with gradient-based updates or supervised learning on labeled datasets (2304.07183, 2402.07927). Techniques range from manually authored, “hard” prompts to data-driven and trainable “soft” prompts operating at the embedding level (2405.01249).

Common prompting strategies and their properties include:

Zero-shot prompting: Direct natural language instructions supplied with no worked examples (2402.07927). Effective for tasks well-represented in pretraining but limited in nuance for specialized tasks.
Few-shot prompting: A small number of input–output demonstrations are embedded in the prompt, providing contextual guidance (in-context learning) (2402.07927, 2507.07682).
Chain-of-Thought (CoT) prompting: The model is prompted with explicit reasoning steps, transforming complex tasks into sequences of intermediate steps (2402.07927, 2410.12843). Self-consistency and other variants improve reliability by aggregating multiple generated reasoning chains.
Continuous/Soft prompting (prompt tuning): Learnable vector embeddings prepended to the input are optimized for a target task; this approach allows for parameter-efficient adaptation (2405.01249).
Ensemble and multi-agent prompting: Multiple prompt variants are evaluated in parallel (ensembles), or several LLM instances, each with distinct prompt sets, collaborate to solve a problem (2310.14201). Aggregation functions (e.g., majority voting) combine ensemble or agent responses.

A formalization appears in optimal control frameworks, where prompt engineering becomes a sequential decision-making process:

$\max_{ \{z^p_t\} } f(z_\tau^r; z^q ) + R(\tau) \quad \text{subject to } z^r_t = \text{LLM}(z^p_t),\ z^p_t \in \mathcal{P}_t$

where $\mathcal{P}_t$ represents the available prompt set at round $t$ , and $f$ is a task-specific evaluation function (2310.14201).

2. Application Domains and Use Cases

Prompt engineering has achieved considerable impact in multiple domains:

Business Process Management (BPM): Used for process information extraction, activity recommendation, process-to-text translation, and anomaly detection without model fine-tuning. Prompts act as task instructions and can include domain definitions for accurate entity extraction. Key applications address data scarcity and practitioner accessibility (2304.07183).
Healthcare and Medicine: Deployed for clinical text classification (e.g., HealthPrompt), text simplification, medical data de-identification (DeID-GPT), and QA. Manual, automated discrete, and continuous prompts have facilitated robust performance even in low-data scenarios. Chain-of-Thought is particularly prevalent for tasks demanding structured reasoning (2304.14670, 2405.01249).
Requirements Engineering: Enables elicitation, validation, traceability, and model/code generation in software engineering workflows. Few-shot and chain-of-thought prompts are used for requirements extraction, conflict detection, and downstream artifact generation (2507.07682, 2311.03359).
Software Engineering and Code Generation: Prompt engineering enables code summarization, generation, translation, and secure coding workflows. Task-specific prompting and iterative refinement approaches (including Chain-of-Thought and recursive critique loops) reduce vulnerability rates and improve code quality (2310.10508, 2502.06039).
Vision and Multimodal Models: Visual prompt engineering adjusts input signals (points, bounding boxes, trainable embeddings) for segmentation, zero-shot classification, and multi-modal retrieval, reducing reliance on expensive fine-tuning (2307.00855).
Education: Prompt engineering provides structured strategies for scaffolding student learning, with applications in formative assessment, self-reflection, and critical thinking in medical curricula (2308.11628, 2408.07302).

3. Challenges, Evaluation, and Optimization

Several challenges persist across domains:

Representation bottlenecks: Mapping non-sequential or structured data (e.g., BPMN diagrams, clinical tables) into linear prompt tokens while controlling for prompt length and model input constraints (2304.07183, 2507.07682).
Prompt sensitivity and transferability: Prompts effective for one model or version may not transfer across architectures or scales, necessitating systematic evaluation and benchmarking (2304.07183, 2403.08950, 2402.07927).
Evaluation gaps: Lack of standardized, non-prompt baselines (e.g., traditional fine-tuning); PD studies often do not report model-independent comparisons, hindering the isolation of prompt impact (2405.01249).
Iterative and collaborative design: Effective prompt engineering involves incremental, user-driven edits, often with significant cognitive burden and trial-and-error; support for debugging, rollback, and multi-edit tracking is crucial for enterprise adoption (2403.08950).
Optimization: Prompt engineering is rarely “set-and-forget”; both manual and automated (search, reinforcement learning) optimization loops are needed (2304.14670, 2311.05661, 2407.11000). Structured meta-prompting, as in PE2, augments the prompt engineering process itself with explicit reasoning guidance to outperform standard baseline prompts (2311.05661).
Security: Proactive and post-hoc prompt modifications (e.g., security-aware prefixes, recursive improvement strategies) reduce code vulnerabilities significantly but require trade-offs with task pass rates and efficiency (2502.06039).

4. Best Practices, Reporting Guidelines, and Responsible Design

A growing consensus emphasizes responsibility and systematic practice:

Prompt documentation and management: Version history and transparent evaluation support reproducibility, accountability, and compliance with emerging regulatory standards (e.g., EU AI Act) (2403.08950, 2504.16204).
Ethical and societal considerations: Embedding fairness, legal compliance, and inclusivity into prompt design and evaluation (e.g., counterfactual augmentation, bias checks, stakeholder participation) is part of the “Responsibility by Design” paradigm (2504.16204).
Reporting standards: Explicit documentation of task language, LLM status (fine-tuned or frozen), prompt optimization methodology, and baseline comparisons is recommended to enhance rigor (2405.01249, 2507.07682).
Taxonomies and hybrid frameworks: Classification schemes are suggested to link technique-oriented prompt templates (e.g., few-shot, CoT) to domain-specific tasks and roles, facilitating the systematization and scaling of prompt engineering workflows (2507.07682, 2402.07927, 2410.12843).

5. Vocabulary, Specificity, and Domain Knowledge Integration

Research into vocabulary specificity shows nuanced findings:

A synonymization framework quantifies the specificity of prompt words via exact metrics for nouns and verbs, revealing an optimal specificity range for maximizing domain-specific model performance: prompts that are too generic or too specific can both diminish accuracy, especially in STEM and legal reasoning tasks (2505.17037).
Automated synonym replacement with WordNet and part-of-speech tagging further demonstrates that optimal vocabulary choices—rather than maximal specificity—yield superior LLM performance, underlining the importance of controlled prompt editing and vocabulary selection.

6. Professionalization, Skill Requirements, and Human Factors

Prompt engineering is emerging as a distinct professional discipline:

Skill set differentiation: Prompt engineers are expected to combine AI/NLP domain expertise (22.8%), prompt design skills (18.7%), communication and collaboration abilities (21.9%), and creative problem-solving (15.8%), distinguishing them from traditional data scientists or ML engineers (2506.00058).
Iterative, user-centered workflows: Human-in-the-loop prompt optimization (conversational editing, debugging, and feedback) often outperforms both automated baseline and fine-tuned systems, particularly in subjective or context-sensitive tasks such as code translation and summarization (2310.10508, 2408.04560).
Educational training: Dedicated instruction in prompt engineering improves both AI knowledge and prompt ability among students, facilitating a shift from intuitive to systematic LLM usage across academic domains (2408.07302).

7. Future Directions and Open Research Problems

Key avenues for future investigation include:

Automated and agentic prompt engineering: Autonomous frameworks, such as APET, enable self-improving models that select, critique, and refine prompts without human intervention, expanding LLM utility but highlighting current limits for complex or strategic tasks (2407.11000, 2311.05661).
Robustness, fairness, and interpretability: Developing verification, transparency, and bias mitigation methods for both prompt templates and model outputs remains a critical area (2402.07927, 2410.12843).
Adaptive, multimodal, and community-based workflows: Roadmaps call for integrating visual, structured, and multilingual data into prompt engineering flows, creating standardized benchmarks, and establishing shared reporting and evaluation ecosystems (2507.07682, 2307.00855).
Interdisciplinary, responsible paradigms: Further synthesis of technical, ethical, legal, and sociotechnical considerations is needed for scalable and trustworthy AI deployment (2504.16204, 2410.12843).

In summary, prompt engineering is central to the practical deployment of large models across diverse fields, bridging technical innovation and domain demands through systematic, optimized, and increasingly responsible design of model inputs. As research expands, new methodologies, rigorous evaluation, and emerging professional norms will continue to shape prompt engineering as a foundational pillar in the AI landscape.