Automatic Prompt Engineering (APE)

Updated 12 January 2026

Automatic Prompt Engineering (APE) is a method that algorithmically synthesizes effective natural-language prompts for LLMs using minimal task examples.
It generates candidate prompts through diverse sub-shot sampling and selects the optimal one via in-memory string centrality using the Jaro-Winkler metric.
APE achieves competitive performance in tasks like cryptic column name expansion without relying on hand-crafted seeds or model tuning.

Automatic Prompt Engineering (APE) refers to the algorithmic synthesis of natural-language prompts that optimize the outputs of LLMs for specific downstream tasks, without relying on hand-tuned templates, explicit task cues, or parameter tuning. In contrast to manual prompt engineering, which depends on human intuition, domain knowledge, and labor-intensive iteration, APE systems autonomously generate, evaluate, and select prompts that yield superior model predictions. Recent work demonstrates that minimalist, tuning-free APE paradigms can produce task-agnostic instruction prompts rivaling or outperforming more complex, tuning-dependent frameworks for real-world applications such as cryptic column name expansion (CNE) and information extraction in both English and German (Chowdhury et al., 6 Jan 2026).

1. Principles and Goals of APE

The central goal of APE is to identify a prompt $P^*$ such that, when prepended to an input $x$ and fed to a fixed LLM, the output $y = \text{LLM}(P^*; x)$ is likely to solve the task as specified by a small set of input–output examples $\{(x_k,y_k)\}$ . The challenge is to automate prompt synthesis subject to key constraints:

No hand-crafted seed prompt beyond a generic, task-agnostic meta-prompt.
No model tuning: Model weights and hyperparameters are untouched; there are no training/validation splits for prompt selection.
No extra scoring calls: Candidate prompt selection cannot rely on additional LLM queries for scoring, ranking, or preference feedback.
Task-agnostic and language-agnostic: The framework should generalize across tasks (e.g., CNE, triple extraction, classification) and languages (e.g., English, German) without domain adaptation or manual localization.
Efficiency and minimalism: All steps should use a small pool of demonstration examples (typically $K \approx 8$ –$10$), a fixed meta-prompt, and a single LLM instance with multinomial sampling.

The motivating application in (Chowdhury et al., 6 Jan 2026) is the cryptic column name expansion problem for tabular data, but the framework is in principle applicable wherever task behavior can be described by few-shot $(x_k, y_k)$ pairs.

2. APE Workflow: Sampling and Centrality Ranking

APE systems described in (Chowdhury et al., 6 Jan 2026) proceed in two phases:

A. Candidate Prompt Generation

Meta-prompt template: A generic prompt e.g., “I gave a friend an instruction. Based on the instruction he produced the following input and output pairs: Input: ... Output: ... Complete the following text. The instruction was to <COMPLETE>”
Diverse sub-shot sampling: From $K$ seed examples, construct three overlapping subsets $A$ , $B$ , $C$ (each $4$–$5$ pairs, with $C$ mixing $A$ and $B$ ).
Prompt completion: Each subset fills the meta-prompt, and the LLM generates $N \approx 10$ continuations per subset (multinomial sampling, $T=0.8$ , top- $p=0.9$ ), for a total $M = 3N \sim 30$ candidate prompts after postprocessing.

B. Prompt Selection via String Centrality

For all pairs $(p_i, p_j)$ among the $M$ candidates, compute the normalized Jaro-Winkler distance $\mathrm{JW}(p_i, p_j)$ .
For each candidate $p_i$ , define its centrality:

$S(p_i) = \frac{1}{M-1} \sum_{j \neq i} \mathrm{JW}(p_i, p_j)$

Select the “central” prompt:

$P^* = \arg\max_{p_i} S(p_i)$

Crucially, this ranking requires only in-memory string similarity and no further LLM calls.

Manual inspection confirmed that centrality-selected prompts are generally concise, precise, and free of extraneous instructions, while outliers are verbose or inconsistent.

3. Mathematical Framework

Let $\{p_1, ..., p_M\}$ be candidate prompts, and define the prompt similarity score by the aggregated Jaro-Winkler similarity:

$S(p_i) = \frac{1}{M-1}\sum_{j=1,\,j\neq i}^M \mathrm{JW}(p_i,p_j)$

The final selected prompt $P^*$ maximizes this

$P^* = \arg\max_{i} S(p_i)$

No prompt is selected based on its downstream task performance during this phase; this strictly enforces the constraint against extra LLM scoring.

4. Empirical Results and Comparative Analysis

The APE system of (Chowdhury et al., 6 Jan 2026) was evaluated on several datasets in both English and German for CNE:

System	German SAP	CDO_435	Tele_1186
InstInduc	21.08	48.11	46.77
APE Zeroshot	41.13	79.95	68.92
TextGrad	48.11	72.17	59.04
DSPy	51.89	69.34	75.00
Our APE	51.89	82.61	70.73

Notable observations:

On German SAP, ties with DSPy and outperforms all other baselines by at least $3.8$ percentage points.
On English CDO_435, achieves $82.61\%$ accuracy, $+2.66$ points over APE Zeroshot.
On Tele_1186, $70.73\%$ , second to DSPy but $+1.8$ points over APE Zeroshot.
Outperforms or matches methods that require model tuning, extra validation, or complex pipelines.
All results are reported as overall accuracy (number of correct expansions divided by total cryptic columns), with a match defined by Jaro-Winkler similarity $\ge 0.85$ .

Ablation highlights:

Use of three candidate subsets (A/B/C) ensures diverse prompt contexts; omitting subset C reduces accuracy by $\sim 1.5$ points.
Multinomial sampling (vs. greedy decoding) is critical for candidate diversity and raises accuracy by $\sim 2$ points.
Using fewer demonstration examples ( $K=4$ ) drops performance by $\sim 4$ points; gains plateau for $K\gtrsim 12$ .

5. Generalization, Language Adaptability, and Limitations

APE in this framework generalizes immediately to any language: both English and German meta-prompts are constructed by changing only connective phrases. No manual translation or hand-crafted template engineering is required.

Further, the same approach (not detailed in (Chowdhury et al., 6 Jan 2026)) was reported to perform strongly on triple-extraction for knowledge-graph construction, confirming applicability beyond CNE.

However, several limitations persist:

The prompt selection metric leverages only surface lexicographic similarity; semantically rich but lexically distinct prompts could be undervalued.
The framework was evaluated using a single (large) LLM; extending to smaller or specialized models remains open.
No ablation on alternate selection metrics (e.g., embedding-based centrality); investigating richer similarity functions may further improve results, at the cost of increased computation.

6. Comparative Methodological Perspective

This approach establishes that highly effective task prompts can be synthesized using a minimalist, LLM-powered, sampling-and-rank scheme, without recourse to handcrafted seeds, tuning, extra validation splits, or human domain cues. The central innovations are:

Use solely of LLM completions over small, diverse, overlapping few-shot subsets.
Centrality-based prompt selection via in-memory string similarity.
Demonstrated competitive or superior performance relative to recent prompt optimization methods that include seed construction, model tuning, or LLM-based scoring (Chowdhury et al., 6 Jan 2026).

This paradigm underscores the potential of minimalist, combinatorial prompt synthesis frameworks for scalable, task-agnostic, and language-agnostic prompt engineering.

7. Future Directions

Key open directions include:

Evaluating prompt selection with richer, semantics-aware similarity metrics.
Confirming generalizability on smaller, task-specific, or multilingual LLMs.
Extending centrality-based synthesis to optimization-in-the-loop frameworks that incorporate limited scoring or reasoning feedback.

A plausible implication is that centrality-based APE can serve as a low-cost first-pass optimizer before resorting to more computationally intensive tuning or ensemble-based prompt selection methods.

For a comprehensive and technical description, see "Automatic Prompt Engineering with No Task Cues and No Tuning" (Chowdhury et al., 6 Jan 2026).

Markdown Upgrade to Chat

References (1)

Automatic Prompt Engineering with No Task Cues and No Tuning (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Automatic Prompt Engineering (APE).

Automatic Prompt Engineering (APE)

1. Principles and Goals of APE

2. APE Workflow: Sampling and Centrality Ranking

3. Mathematical Framework

4. Empirical Results and Comparative Analysis

5. Generalization, Language Adaptability, and Limitations

6. Comparative Methodological Perspective

7. Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

Automatic Prompt Engineering (APE)

1. Principles and Goals of APE

2. APE Workflow: Sampling and Centrality Ranking

3. Mathematical Framework

4. Empirical Results and Comparative Analysis

5. Generalization, Language Adaptability, and Limitations

6. Comparative Methodological Perspective

7. Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research