Papers
Topics
Authors
Recent
2000 character limit reached

Automatic Prompt Engineering (APE)

Updated 12 January 2026
  • Automatic Prompt Engineering (APE) is a method that algorithmically synthesizes effective natural-language prompts for LLMs using minimal task examples.
  • It generates candidate prompts through diverse sub-shot sampling and selects the optimal one via in-memory string centrality using the Jaro-Winkler metric.
  • APE achieves competitive performance in tasks like cryptic column name expansion without relying on hand-crafted seeds or model tuning.

Automatic Prompt Engineering (APE) refers to the algorithmic synthesis of natural-language prompts that optimize the outputs of LLMs for specific downstream tasks, without relying on hand-tuned templates, explicit task cues, or parameter tuning. In contrast to manual prompt engineering, which depends on human intuition, domain knowledge, and labor-intensive iteration, APE systems autonomously generate, evaluate, and select prompts that yield superior model predictions. Recent work demonstrates that minimalist, tuning-free APE paradigms can produce task-agnostic instruction prompts rivaling or outperforming more complex, tuning-dependent frameworks for real-world applications such as cryptic column name expansion (CNE) and information extraction in both English and German (Chowdhury et al., 6 Jan 2026).

1. Principles and Goals of APE

The central goal of APE is to identify a prompt PP^* such that, when prepended to an input xx and fed to a fixed LLM, the output y=LLM(P;x)y = \text{LLM}(P^*; x) is likely to solve the task as specified by a small set of input–output examples {(xk,yk)}\{(x_k,y_k)\}. The challenge is to automate prompt synthesis subject to key constraints:

  • No hand-crafted seed prompt beyond a generic, task-agnostic meta-prompt.
  • No model tuning: Model weights and hyperparameters are untouched; there are no training/validation splits for prompt selection.
  • No extra scoring calls: Candidate prompt selection cannot rely on additional LLM queries for scoring, ranking, or preference feedback.
  • Task-agnostic and language-agnostic: The framework should generalize across tasks (e.g., CNE, triple extraction, classification) and languages (e.g., English, German) without domain adaptation or manual localization.
  • Efficiency and minimalism: All steps should use a small pool of demonstration examples (typically K8K \approx 8–$10$), a fixed meta-prompt, and a single LLM instance with multinomial sampling.

The motivating application in (Chowdhury et al., 6 Jan 2026) is the cryptic column name expansion problem for tabular data, but the framework is in principle applicable wherever task behavior can be described by few-shot (xk,yk)(x_k, y_k) pairs.

2. APE Workflow: Sampling and Centrality Ranking

APE systems described in (Chowdhury et al., 6 Jan 2026) proceed in two phases:

A. Candidate Prompt Generation

  • Meta-prompt template: A generic prompt e.g., “I gave a friend an instruction. Based on the instruction he produced the following input and output pairs: Input: ... Output: ... Complete the following text. The instruction was to <COMPLETE>”
  • Diverse sub-shot sampling: From KK seed examples, construct three overlapping subsets AA, BB, CC (each $4$–$5$ pairs, with CC mixing AA and BB).
  • Prompt completion: Each subset fills the meta-prompt, and the LLM generates N10N \approx 10 continuations per subset (multinomial sampling, T=0.8T=0.8, top-p=0.9p=0.9), for a total M=3N30M = 3N \sim 30 candidate prompts after postprocessing.

B. Prompt Selection via String Centrality

  • For all pairs (pi,pj)(p_i, p_j) among the MM candidates, compute the normalized Jaro-Winkler distance JW(pi,pj)\mathrm{JW}(p_i, p_j).
  • For each candidate pip_i, define its centrality:

S(pi)=1M1jiJW(pi,pj)S(p_i) = \frac{1}{M-1} \sum_{j \neq i} \mathrm{JW}(p_i, p_j)

  • Select the “central” prompt:

P=argmaxpiS(pi)P^* = \arg\max_{p_i} S(p_i)

  • Crucially, this ranking requires only in-memory string similarity and no further LLM calls.

Manual inspection confirmed that centrality-selected prompts are generally concise, precise, and free of extraneous instructions, while outliers are verbose or inconsistent.

3. Mathematical Framework

Let {p1,...,pM}\{p_1, ..., p_M\} be candidate prompts, and define the prompt similarity score by the aggregated Jaro-Winkler similarity:

S(pi)=1M1j=1,jiMJW(pi,pj)S(p_i) = \frac{1}{M-1}\sum_{j=1,\,j\neq i}^M \mathrm{JW}(p_i,p_j)

The final selected prompt PP^* maximizes this

P=argmaxiS(pi)P^* = \arg\max_{i} S(p_i)

No prompt is selected based on its downstream task performance during this phase; this strictly enforces the constraint against extra LLM scoring.

4. Empirical Results and Comparative Analysis

The APE system of (Chowdhury et al., 6 Jan 2026) was evaluated on several datasets in both English and German for CNE:

System German SAP CDO_435 Tele_1186
InstInduc 21.08 48.11 46.77
APE Zeroshot 41.13 79.95 68.92
TextGrad 48.11 72.17 59.04
DSPy 51.89 69.34 75.00
Our APE 51.89 82.61 70.73

Notable observations:

  • On German SAP, ties with DSPy and outperforms all other baselines by at least $3.8$ percentage points.
  • On English CDO_435, achieves 82.61%82.61\% accuracy, +2.66+2.66 points over APE Zeroshot.
  • On Tele_1186, 70.73%70.73\%, second to DSPy but +1.8+1.8 points over APE Zeroshot.
  • Outperforms or matches methods that require model tuning, extra validation, or complex pipelines.
  • All results are reported as overall accuracy (number of correct expansions divided by total cryptic columns), with a match defined by Jaro-Winkler similarity 0.85\ge 0.85.

Ablation highlights:

  • Use of three candidate subsets (A/B/C) ensures diverse prompt contexts; omitting subset C reduces accuracy by 1.5\sim 1.5 points.
  • Multinomial sampling (vs. greedy decoding) is critical for candidate diversity and raises accuracy by 2\sim 2 points.
  • Using fewer demonstration examples (K=4K=4) drops performance by 4\sim 4 points; gains plateau for K12K\gtrsim 12.

5. Generalization, Language Adaptability, and Limitations

APE in this framework generalizes immediately to any language: both English and German meta-prompts are constructed by changing only connective phrases. No manual translation or hand-crafted template engineering is required.

Further, the same approach (not detailed in (Chowdhury et al., 6 Jan 2026)) was reported to perform strongly on triple-extraction for knowledge-graph construction, confirming applicability beyond CNE.

However, several limitations persist:

  • The prompt selection metric leverages only surface lexicographic similarity; semantically rich but lexically distinct prompts could be undervalued.
  • The framework was evaluated using a single (large) LLM; extending to smaller or specialized models remains open.
  • No ablation on alternate selection metrics (e.g., embedding-based centrality); investigating richer similarity functions may further improve results, at the cost of increased computation.

6. Comparative Methodological Perspective

This approach establishes that highly effective task prompts can be synthesized using a minimalist, LLM-powered, sampling-and-rank scheme, without recourse to handcrafted seeds, tuning, extra validation splits, or human domain cues. The central innovations are:

  • Use solely of LLM completions over small, diverse, overlapping few-shot subsets.
  • Centrality-based prompt selection via in-memory string similarity.
  • Demonstrated competitive or superior performance relative to recent prompt optimization methods that include seed construction, model tuning, or LLM-based scoring (Chowdhury et al., 6 Jan 2026).

This paradigm underscores the potential of minimalist, combinatorial prompt synthesis frameworks for scalable, task-agnostic, and language-agnostic prompt engineering.

7. Future Directions

Key open directions include:

  • Evaluating prompt selection with richer, semantics-aware similarity metrics.
  • Confirming generalizability on smaller, task-specific, or multilingual LLMs.
  • Extending centrality-based synthesis to optimization-in-the-loop frameworks that incorporate limited scoring or reasoning feedback.

A plausible implication is that centrality-based APE can serve as a low-cost first-pass optimizer before resorting to more computationally intensive tuning or ensemble-based prompt selection methods.


For a comprehensive and technical description, see "Automatic Prompt Engineering with No Task Cues and No Tuning" (Chowdhury et al., 6 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Automatic Prompt Engineering (APE).