Prompt Duel Optimizer (PDO)
- Prompt Duel Optimizer (PDO) is a label-free framework that recasts prompt selection as a dueling bandit process, enabling efficient, sample-optimal ranking of prompts.
- The framework employs Double Thompson Sampling to compare candidate prompts via pairwise evaluations by an LLM judge, thereby reducing the need for costly labeled data.
- PDO integrates guided mutation to expand the prompt candidate pool, ensuring continuous exploration and practical adaptation in real-world LLM deployments.
A Prompt Duel Optimizer (PDO) is a framework for sample-efficient, label-free optimization of prompts used with LLMs and other generative systems. Defined formally, a PDO recasts the prompt selection problem as a dueling bandit process, where candidate prompts are compared in pairs and optimization is driven by relative preference feedback from an automated judge—frequently an LLM itself—rather than relying on ground-truth references or explicit supervision. This paradigm is motivated by the high sensitivity of LLMs to small prompt changes, the prohibitive cost of collecting labels at scale, and the practical need for rapid deployment and adaptation in real-world settings.
1. Problem Definition and Significance
PDO addresses the challenge that prompt effectiveness—as measured by the downstream performance of an LLM—is highly variable and often context-dependent. Traditional prompt optimization requires labeled data to evaluate and score candidate prompts, which is costly when ground truth is scarce or unavailable. PDO operates in the label-free setting: it leverages pairwise prompt comparisons, conducted via LLMs serving as judges, to drive an efficient search for high-performing prompts. This enables rapid prompt engineering with minimal manual labeling, making it especially useful for industrial and real-world LLM deployment where ground-truth labels may be delayed or absent (Wu et al., 14 Oct 2025).
Formally, given a finite or expanding set of prompt candidates and an evaluation scenario (such as a question-answering task), PDO seeks to find a prompt that maximizes the probability of being preferred in pairwise comparisons against alternatives: where is the probability that prompt is judged superior to on the task.
2. Dueling Bandit Formulation and Double Thompson Sampling
In the PDO framework, the prompt optimization process is modeled as a dueling bandit problem (Wu et al., 14 Oct 2025). Here, each prompt candidate is an arm, and the feedback for each "pull" is a binary outcome from dueling two prompts : does outperform according to the LLM judge? The key probability parameter is .
PDO uses Double Thompson Sampling (D-TS) to efficiently decide which prompt pairs to compare. For each pair, empirical win/loss statistics are maintained. The posterior for each pair's win probability is modeled as a Beta distribution: At each iteration, two independent Thompson samples are drawn for each pair to estimate optimistic and pessimistic Copeland scores:
These bounds help to select the pairs with the most ambiguous outcomes for further exploration, focusing sampling on the most informative duels.
The selection criterion is to identify a Condorcet winner, if one exists (a prompt beating all others with probability ). If not, the Copeland winner—the prompt with the maximum number of wins—is selected.
3. Candidate Expansion via Top-Performer Guided Mutation
Fixed candidate sets present an inherent limitation: once the optimal within-pool prompt is found, exploration cannot improve upon it. PDO incorporates a mutation module that periodically expands the prompt pool by generating new candidates via guided mutations of already top-performing prompts. Typical strategies include template-level token edits or passing high-ranking prompts to an LLM for creative rewriting, leveraging the notion that local mutations of top prompts—assuming a Lipschitz-like performance landscape—are likely to yield superior candidates without sacrificing sample efficiency (Wu et al., 14 Oct 2025).
At each expansion phase, the highest Copeland score prompt(s) are mutated, and the resulting variants are injected into the candidate pool. This iterative enlargement ensures continued progress even after local maxima are reached within an initial candidate set.
4. Label-Free Preference Feedback and Hybridization
A distinguishing feature of PDO is its reliance on label-free, pairwise preference feedback. The LLM judge compares the outputs generated by two prompts on the same input and indicates a winner, eliminating the need for gold references. This feedback is typically less susceptible to calibration artifacts than pointwise scoring. All win/loss updates are taken directly from such judgments. When available, PDO can also ingest a fraction of ground-truth labels to provide tie-breaks or correct systematic bias introduced by noisy or imperfect judges, but this is optional and not required for convergence (Wu et al., 14 Oct 2025).
This approach makes prompt optimization practical in scenarios where human-labeled metrics are unavailable, expensive, or slow to obtain.
5. Empirical Results and Sample Efficiency
Experiments conducted with PDO on BIG-bench Hard (BBH) multiple-choice tasks and MS MARCO open-ended question answering demonstrate superior sample efficiency and prompt quality compared to baselines. PDO, using only LLM judge feedback, consistently outperformed random sampling and other naive dueling bandit algorithms, rapidly converging to high-performing prompts with far fewer comparisons. Even when compared to supervised baselines with partial label access, PDO closed the performance gap (Wu et al., 14 Oct 2025).
Ablation studies highlight that D-TS drastically reduces the number of duels required to find top prompts, and that guided mutation is essential for escaping local optima and achieving global improvement.
6. Practical Implications, Limitations, and Extensions
PDO offers significant benefits for real-world LLM deployment, enabling online prompt adaptation in industrial text classification, interactive agents, or domains where rapid iteration and cost minimization are priorities. Its label-free design, reliance on preference-based duels, and continual candidate expansion make it robust to judge noise and efficient in exploration.
However, the framework’s efficiency depends on the underlying judge’s reliability and the structure of the prompt space: highly non-Lipschitz performance landscapes or extremely high-dimensional, semantically orthogonal prompt candidates may reduce the effectiveness of mutation strategies. A plausible implication is that future improvements in mutation mechanisms, judge modeling, or hybrid dueling strategies (e.g., combining dueling and pointwise feedback) could further enhance performance.
Given the generality of the dueling bandit paradigm, the PDO methodology may inform other discrete or combinatorial search tasks that lack reliable ground-truth scores but support informative pairwise comparisons.
Table: PDO Core Components
| Module | Functionality | Key Fact |
|---|---|---|
| D-TS Bandit Selector | Efficiently chooses pairs to duel | Uses independent Thompson samples for Copeland ranking |
| Preference Feedback | Label-free supervision | Relies on LLM judge pairwise decisions, not reference labels |
| Guided Mutation | Expands candidate pool | Mutates top performers to ensure continued progress |
In summary, the Prompt Duel Optimizer (PDO) framework establishes a principled, empirically validated approach to label-free, sample-efficient prompt optimization through dueling bandits, integrating efficient exploration with mutative expansion and preference-based learning (Wu et al., 14 Oct 2025). This approach shifts prompt engineering from a label-reliant, manual process to an automated, competitive search that is well-matched to the realities of LLM deployment.