Papers
Topics
Authors
Recent
Search
2000 character limit reached

Prompt-OIRL: Query-Dependent Prompt Evaluation

Updated 30 March 2026
  • Prompt-OIRL is a framework that uses offline inverse reinforcement learning to select optimal, query-specific prompts for zero-shot arithmetic reasoning.
  • It recasts prompt selection as a one-step decision process by employing a learned reward model to efficiently rank candidate prompts.
  • Empirical results show significant accuracy improvements and over 10× reduction in API costs compared to traditional LLM self-critique methods.

Prompt-OIRL (“Query-Dependent Prompt Evaluation and Optimization with Offline Inverse RL”) is a framework addressing the problem of per-query prompt selection for zero-shot arithmetic reasoning with LLMs. It combines concepts from offline inverse reinforcement learning (IRL) to learn a cost-effective, query-dependent prompt evaluator, enabling the system to select an optimal prompt for each new query without incurring high inference-time computational expenses. The methodology directly targets the limitations of standard, query-agnostic prompt selection and the prohibitive cost of online prompt evaluation, achieving notable gains in arithmetic reasoning accuracy on established benchmarks while substantially reducing required LLM calls (Sun et al., 2023).

1. Problem Formulation

Prompt-OIRL focuses on maximizing per-instance accuracy when using black-box LLMs for tasks such as arithmetic word problem solving. Formally, for a given query xx and an LLM \ell, a prompt-augmented input π(x)\pi(x) yields a response y^=(π(x))\hat y = \ell(\pi(x)). Correctness is measured by a reward function:

r(y,y^)=1{y^=y}r(y^*, \hat y) = \mathbb 1\{\hat y = y^*\}

where yy^* is the gold standard answer (available only during benchmarking). Traditional zero-shot prompt optimization finds a single prompt πˉ\bar\pi^* maximizing average reward:

πˉ=argmaxπ  E(x,y)D[r(y,(π(x)))](1)\bar\pi^* = \arg\max_{\pi}\; \mathbb{E}_{(x, y^*) \sim \mathcal{D}} [r(y^*, \ell(\pi(x)))] \tag{1}

Prompt-OIRL introduces a per-query, query-dependent objective:

π(x)=argmaxπ  r(y,(π(x)))(2)\pi^*(x) = \arg\max_{\pi}\; r(y^*, \ell(\pi(x))) \tag{2}

In practical deployment, evaluating r(y,(π(x)))r(y^*, \ell(\pi(x))) is infeasible since yy^* is unknown at inference, and brute-force API-based search is expensive. These are the two primary challenges addressed by the approach.

2. Offline Inverse Reinforcement Learning Approach

The method reframes prompt selection as a one-step Markov decision process:

  • State: The current query xx.
  • Action: Candidate prompt πΠ\pi \in \Pi.
  • Policy: A distribution over prompts, πθ(x)\pi_\theta(x) parameterized by θ\theta.
  • Reward Model: A learned proxy, Υϕ(x,π(x))\Upsilon_\phi(x, \pi(x)), trained to approximate r(y,(π(x)))r(y^*, \ell(\pi(x))) using only (x,π(x))(x, \pi(x))—without LLM access or gold labels at test time.

In analog to classical IRL, the model minimizes a divergence between the distribution of expert/behavioral demonstrations πD\pi_D and the policy πϕ\pi_\phi to recover a reward function under which the demonstrated behaviors are optimal. Given that prompt selection is a single-step decision, Prompt-OIRL circumvents temporal credit assignment and directly learns a scalar reward proxy.

The optimal policy at inference is estimated as:

π(x)    argmaxπ  Υϕ(x,π(x))\pi^*(x)\; \approx\; \arg\max_{\pi}\; \Upsilon_\phi(x, \pi(x))

enabling prompt selection without repeated LLM queries.

3. Reward Model Training

Reward model training is based on an offline demonstration dataset Ddem\mathcal{D}_{\rm dem}, constructed by executing KK candidate zero-shot prompts on a held-out dataset. For each (x(i),y(i))(x^{(i)}, y^{*(i)}) and prompt π(k)\pi^{(k)}:

r(i,k)=r(y(i),(π(k)(x(i)))){0,1}r^{(i, k)} = r(y^{*(i)}, \ell(\pi^{(k)}(x^{(i)}))) \in \{0, 1\}

yielding samples (x(i),π(k),r(i,k))(x^{(i)}, \pi^{(k)}, r^{(i, k)}). The pair (x,π)(x, \pi) is featurized via pretrained LLM embeddings (ex,eπ)Rd(e_x, e_\pi) \in \mathbb{R}^d. The proxy reward model Υϕ\Upsilon_\phi is trained as a binary classifier with cross-entropy loss:

L(ϕ)=1NKi,k[r(i,k)logσ(Υϕ(ex(i),eπ(k)))+(1r(i,k))log(1σ(Υϕ()))]\mathcal{L}(\phi) = -\frac{1}{N K} \sum_{i, k} \big[ r^{(i,k)} \log \sigma(\Upsilon_\phi(e_x^{(i)}, e_\pi^{(k)})) + (1 - r^{(i,k)}) \log (1 - \sigma(\Upsilon_\phi(\cdot))) \big]

where σ\sigma is the sigmoid function. Gradient-boosted trees (e.g., XGBoost) applied to concatenated embeddings are empirically robust, outperforming deeper neural models in this reward modeling scenario.

4. Inference-Time Prompt Evaluation and Best-of-N Selection

At inference, for a given query xx and a candidate pool of prompts {πn}n=1N\{\pi_n\}_{n=1}^N, Prompt-OIRL computes reward scores Υϕ(x,πn(x))\Upsilon_\phi(x, \pi_n(x)) for each nn and selects:

π(x)=argmaxnΥϕ(x,πn(x))\pi^*(x) = \arg\max_n \Upsilon_\phi(x, \pi_n(x))

This "best-of-N" strategy sources NN prompts either via generative LLM output or from existing prompt pools. The best candidate under the learned proxy reward is chosen for submission to the black-box LLM, circumventing the need for direct LLM evaluation on the full set of competitors. Theoretical IRL guarantees state that if the proxy reward Υϕ\Upsilon_\phi reliably ranks prompts by true reward, this greedy selection is statistically sound.

5. Experimental Setup and Empirical Results

Prompt-OIRL was evaluated on three arithmetic reasoning datasets: GSM8K (7.5K train/1.3K test), SVAMP (15K/4.7K), and MAWPS (6K/1.7K), using LLMs including GPT-3.5-turbo, LLaMA-2-7B-chat, and TigerBot-13B-chat.

Baselines included:

  • Best-of-Training (BoTr Eq 1): Single best prompt per average training reward.
  • Best-of-Training (BoTr Eq 2): Per-query selection among training prompts using Υϕ\Upsilon_\phi.
  • LLM Self-Critique (LMSC): LLM judges correctness of KK prompted answers (requires $2K$ LLM calls per query).
  • LLM-Confidence: Selects answer/prompt via maximal log probability.

Results (success rates and model prediction metrics):

  • Under minimal demonstration (single prompt), Prompt-OIRL increased success from 40%\sim 40\% to 65%\sim 65\%.
  • With six demonstration prompts, Prompt-OIRL improved over trained distributional optimizers by 8.8\sim 8.8 percentage points.
  • Proxy reward model accuracy ranged from 75–96%, with precision 60–97%, outperforming LMSC.
  • API costs per query are reduced by more than 10×10\times compared to LLM self-critique: $0.00041$ USD (Prompt-OIRL) vs $0.0056$ USD (LMSC for K=6K=6).

6. Computational Efficiency and Limitations

Prompt-OIRL achieves high computational efficiency since reward model inference is a vector lookup with a classifier—requiring no LLM calls per candidate—and only the final selected prompt is processed by the target LLM for answer generation. This results in substantial cost reductions and real-time applicability for deployments with expensive or rate-limited LLM APIs.

Known limitations include dependence on the availability of offline demonstration logs, potential reward modeling degradation under extreme label imbalance, and heuristic aspects of the best-of-N policy solver. Reward model generalization to broader tasks (beyond arithmetic reasoning) remains an open extension.

7. Extensions and Future Directions

Potential avenues for improvement involve:

  • More sophisticated policy optimization strategies (e.g., RL, beam/evolutionary search) under the proxy reward Υϕ\Upsilon_\phi.
  • Joint co-training of prompt pools and reward models.
  • Application to open-ended text generation tasks or domains outside arithmetic.
  • Establishment of shared, publicly accessible offline demonstration repositories to amplify reward model training efficacy and prompt diversity.

A plausible implication is that integrating Prompt-OIRL with richer prompt search or adaptive demonstration selection may further enhance performance and generalization. Its framework provides principled, low-cost per-query zero-shot prompt optimization by leveraging offline prompt usage data, learning a generalized proxy reward, and eliminating expensive LLM calls during prompt selection (Sun et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Prompt-OIRL.