Optimization by PROmpting (OPRO)

Updated 30 December 2025

Optimization by PROmpting (OPRO) is a paradigm that uses large language models as discrete, iterative black-box optimizers through natural language feedback loops.
It integrates historical performance signals and meta-prompts to refine candidate solutions across both continuous and combinatorial search spaces.
Extensions like population-based and branched frameworks boost performance in tasks ranging from QA and reasoning to specialized engineering applications.

Optimization by PROmpting (OPRO) is a paradigm that leverages LLMs as iterative discrete black-box optimizers in settings where the optimization objective, search space, or feedback can be described in natural language. Unlike classical optimization frameworks reliant on explicit gradients or parametric structure, OPRO utilizes the LLM itself to propose, refine, and evaluate candidate solutions—frequently prompts or instruction templates—through a feedback loop involving historical performance signals and meta-instruction prompts. The method generalizes across both continuous and combinatorial spaces and is applicable wherever gradient-free optimization over text, discrete choices, or structurally rich instructions is required.

1. Formalism and Core Algorithmic Structure

OPRO recasts optimization as a search for the best input $x^*$ in a decision space $X$ , under a black-box objective $f:X\to\mathbb{R}$ :

$x^* = \arg\min_{x\in X} f(x)$

Each OPRO iteration consists of providing the LLM with a meta-prompt containing the problem description, history of prior solutions and their scores, and explicit update instructions. The LLM stochastically generates new candidates, which are evaluated externally (using scorer models, simulators, or domain evaluators), and the history is updated by retaining N best-scoring solutions (Yang et al., 2023). The process continues until convergence or budget exhaustion. In prompt optimization settings, the analog is to maximize task accuracy over prompts $p\in P$ :

$p^* = \arg\max_{p\in P} S(p) \quad \text{where}\quad S(p) = \frac{1}{N} \sum_{i=1}^N \mathbb{1}\{M_\theta(p,x_i) = y_i\}$

(Zhang et al., 16 May 2024, Zehle et al., 2 Dec 2025).

The OPRO loop hinges on natural-language meta-prompts structured as:

Past solutions and scores ("trajectory")
Explicit problem description
Update guidance ("Write a new instruction with higher score…")

The underlying LLM acts as a prompt generator or proposer; performance is assessed by a black-box evaluator; top candidates are consolidated for the next round.

2. Discrete Prompt Optimization and Population-Based Extensions

OPRO is instantiated in dedicated frameworks and optimization pipelines such as "promptolution" (Zehle et al., 2 Dec 2025), which abstracts the optimization loop as pluggable modules: LLM interface, predictor, task (objective/metric), optimizer (search algorithm), and experiment config. The optimizer may realize OPRO directly (single meta-LLM informer), or use population-based genetic algorithms (GA), differential evolution (DE), or cost-aware hybrid strategies (e.g., CAPO, EvoPrompt).

Recent extensions include population-maintaining algorithms that use evolutionary operators, e.g., GAAPO integrates block-level crossover and eight mutation types (instruction expansion, expert-persona injection, structural variation, constraint addition, creative backstory, task decomposition, concise optimization, role assignment) and combines OPRO meta-prompt trajectory refinement with gradient and few-shot perturbation strategies (Sécheresse et al., 9 Apr 2025). Selection methods range from exhaustive ranking, successive halving, and bandit-based arm selection for parallel and sample-efficient search.

Explicit tracking of population–generations, mutation–crossover operator weights, and validation–test generalization gaps are prescribed for robust performance (Sécheresse et al., 9 Apr 2025).

3. Specialized Search Spaces and Structural Optimization

OPRO algorithms increasingly operate not solely over flat prompt text but also structured, multi-factor prompt spaces and modular pipelines.

Heuristic Prompting Strategy Search (HPSS) models prompt strategy as an 8-factor genome, searching over the combinatorial set of discrete factor choices by estimating each value's advantage to accelerate convergence (Wen et al., 18 Feb 2025). The core fitness is alignment (Spearman correlation) of LLM judgments to human scores, with heuristic $H(s) = \sum_{i=1}^8 A_i$ guiding both UCB-style exploration and exploitation.

Automatic Multi-Branched Prompt Optimization (AMPO) expands the objective from single-flow prompts to a multi-branch decision structure, introducing pattern recognition, branch adjustment, and pruning modules to efficiently handle diverse task patterns (Yang et al., 11 Oct 2024). The optimizer iteratively refines branched prompt trees directly, combating overfitting and supporting greedy pattern selection.

ORPP restricts OPRO search to role-playing prompt subspaces, leveraging the observed inductive prior that expertise role instructions consistently elicit deeper model reasoning and creativity. Gradients are approximated implicitly via iterative feedback and candidate generation (Duan et al., 3 Jun 2025).

Complex pipelines (MIPRO) extend OPRO to multi-stage LM programs, factorizing prompt optimization into (i) module instructions and (ii) few-shot demonstrations. Credit assignment is solved by Bayesian surrogates (TPE) and mini-batch stochastic evaluations, supporting meta-optimization of proposal strategies (Opsahl-Ong et al., 17 Jun 2024).

4. Practical Implementations and Application Domains

OPRO exhibits considerable flexibility across domains. In nuclear engineering, OPRO enables candidate fuel-lattice configuration optimization for Boiling Water Reactors, outperforming classical metaheuristics (GA) with zero hyperparameter tuning and direct plain-English meta-prompt specifications (Oktavian et al., 25 Mar 2025). In financial trading, Adaptive-OPRO extends the methodology to order-aware multi-agent frameworks, updating trading agent instructions based on windowed real-time ROI, with prompt placeholders rigorously preserved (Papadakis et al., 10 Oct 2025).

In prompt optimization for evaluators and general tasks, OPRO yields substantial alignment and accuracy gains compared to both hand-designed and previously automated methods (Wen et al., 18 Feb 2025, Zhu et al., 15 May 2025, Zhang et al., 21 Jul 2025), particularly when large LLMs are employed as both optimizers and scorers. Population/component size, search depth, and diversity are critical to efficacy and generalization.

5. Evaluation, Limitations, and Design Recommendations

Empirical evidence demonstrates that OPRO with well-configured large LLMs outperforms model baselines by 5–10 points on general QA and reasoning, and holds consistent upwards compatibility with both compact/local models and very large decoders (Zhu et al., 15 May 2025, Zhang et al., 21 Jul 2025). However, OPRO shows limited efficacy for small-scale LLMs ( $\leq$ 13B) due to constrained inference and optimization abilities (Zhang et al., 16 May 2024). Chain-of-thought prompting remains preferred for such models.

Trade-offs exist between population size and generations in genetic/population-based extensions; larger populations converge faster but exhibit broader generalization gaps (Sécheresse et al., 9 Apr 2025). Operator scheduling (gradient, random mutation, few-shot, OPRO) should be dynamically tuned per search phase.

Merit-guided OPRO, as instantiated in MePO, achieves robust prompt improvement via interpretable prompt merits (Clarity, Precision, Concise chain-of-thought, Preservation of information), supporting downward and upward compatibility irrespective of model size and removing cost and privacy bottlenecks associated with proprietary API-based methods (Zhu et al., 15 May 2025).

6. Future Extensions and Theoretical Perspectives

Current research suggests synergistic combinations—Monte-Carlo tree search for branch optimization (Yang et al., 11 Oct 2024), meta-optimization of proposal hyperparameters and surrogate-based credit assignment (Opsahl-Ong et al., 17 Jun 2024), multi-stage/pipeline joint optimization (Opsahl-Ong et al., 17 Jun 2024), as well as plug-and-play role-playing modules (Duan et al., 3 Jun 2025). Open problems persist regarding automated selection and dynamic adjustment of prompt structure vs. flat rephrasing, the integration of explicit error feedback, and scalable multi-objective optimization.

Generalized OPRO workflows are increasingly formalized under unified frameworks and experiment APIs (Zehle et al., 2 Dec 2025), offering standardized modular optimization pipelines, comparative benchmarking, and reproducible experiment tracking.

7. Comparative Summary and Empirical Benchmarks

Major results from diverse benchmarks can be summarized as follows:

Method	General QA (%)	Reasoning (%)	GSM8K (%)	SST-5 (%)
Baseline (Raw prompt)	42.4	77.1	69.7	44.6
BPO/PAS	43.1–47.5	81.3	71.47	43.84
OPRO	56.44–59.78	79.5	76.9	56.0
GAAPO	0.60 (val)	—	—	—
CAPO (GA hybrid)	—	—	93.7	56.3
MePO (merit-guided)	74.37	—	—	—
P3	57.2–59.1	82.1	—	—
AMPO	59.78	—	—	—
HPSS (LLM eval)	+30.6% rel.	—	—	—
ORPP (role-playing)	+1–3 points	45.45	—	—

These quantitative evaluations indicate that, for diverse NLP/EQA/engineering tasks, OPRO and its population-based, branched, and merit-guided extensions dominate traditional chain-of-thought and static prompt baselines, confronting previous limitations in template generalization and combinatorial optimization (Yang et al., 2023, Oktavian et al., 25 Mar 2025, Wen et al., 18 Feb 2025, Zhang et al., 21 Jul 2025, Yang et al., 11 Oct 2024).

Optimization by PROmpting positions LLMs as meta-optimizers for discrete, combinatorial, and text-based search spaces, yielding state-of-the-art empirical performance via principled, modular feedback loops. Future extensions are anticipated to incorporate advanced credit assignment, branching, and context-driven synthesis, advancing prompt-based control of models across increasingly complex pipelines and application domains.