This paper introduces Prompt-OIRL, a novel method for improving the arithmetic reasoning abilities of LLMs through zero-shot prompt optimization. The core idea is to move beyond finding a single, universally best prompt (query-agnostic) and instead optimize prompts based on the specific query being asked (query-dependent).
The authors identify two main challenges with traditional prompt optimization:
- Inference Time Evaluation is Hard: Determining the best prompt for a new query during inference is difficult because the correct answer (ground truth) is usually unavailable. Simply trying multiple prompts and generating answers doesn't reveal which answer is correct without extra effort.
- Online Prompt Evaluation and Optimization is Expensive: Searching for optimal prompts by interacting with LLMs (online evaluation) is resource-intensive due to the cost of API calls or computational resources, especially given the vast space of possible natural language prompts.
To address these challenges and achieve query-dependent optimization, Prompt-OIRL utilizes Offline Inverse Reinforcement Learning (Offline IRL). The process involves three main steps:
- Leveraging Offline Prompt Demonstrations: The method uses existing datasets that are often generated as by-products when researchers benchmark different prompting strategies (like Zero-shot CoT, APE, etc.) on standard tasks (e.g., arithmetic reasoning datasets). These datasets contain
(query, prompt, success_label)
tuples, where the success label indicates if the prompt led the LLM to the correct answer for that query. This data captures the "preference" of the LLM for certain prompts on certain queries. - Offline Reward Modeling (Inverse RL): An offline reward model, denoted as , is trained on this demonstration dataset. This model learns to predict the probability that a given prompt will lead to a correct answer for a specific query , without needing to interact with the LLM or know the ground-truth answer . It takes embeddings of the query and prompt as input. The training uses supervised learning, like minimizing Cross-Entropy loss against the observed success labels from the dataset:
The paper finds that gradient boosting models (like XGBoost) work well for . This learned reward model effectively solves Challenge 1 by providing an offline evaluation mechanism.
- Offline Prompt Optimization: During inference, given a new query , the learned reward model is used to evaluate a set of candidate prompts . The prompt predicted to have the highest success probability is selected:
The paper uses a simple "best-of-N" strategy, where N candidate prompts (including known effective prompts and potentially newly generated ones) are evaluated using , and the best one is chosen to send to the LLM. This step solves Challenge 2 by optimizing the prompt choice offline, minimizing costly LLM interactions during inference.
Experiments were conducted on arithmetic reasoning datasets (GSM8K, MAWPS, SVAMP) using various LLMs (GPT-3.5-turbo, LLaMA-2-7B-Chat, TigerBot-13B-Chat). The results demonstrate that:
- Prompt-OIRL successfully achieves the query-dependent objective, significantly improving performance over using the single best prompt found during training (query-agnostic) or using LLM's self-reported confidence.
- The learned reward model accurately predicts prompt success offline, outperforming LLM self-criticism baselines in evaluation accuracy and precision.
- The approach is highly cost-effective compared to methods requiring online LLM interactions for evaluating multiple prompt candidates at inference time.
In conclusion, Prompt-OIRL offers a practical and efficient framework for optimizing prompts at a query-dependent level by leveraging offline demonstration data and inverse reinforcement learning to train a reward model for offline evaluation and selection.