Textual Reward Prompt Framework (TRPrompt)
- The paper introduces TRPrompt, a framework that integrates detailed textual feedback with reinforcement learning to iteratively refine query-specific prompts.
- It employs an end-to-end cycle that uses synthetic data generation and high-resolution textual rewards to achieve state-of-the-art performance on GSMHard and MATH benchmarks.
- The framework eliminates the need for expert initialization, using train-free optimization to improve prompt quality and enhance LLM reasoning in complex tasks.
The Textual Reward Prompt Framework (TRPrompt) represents a major direction in natural language prompt optimization, unifying the use of detailed textual feedback and reinforcement learning to improve the query-specific performance of LLMs. By integrating high-resolution, context-dependent textual rewards directly into the prompt model’s training loop, TRPrompt enables efficient, iterative refinement of prompts without the need for prior dataset collection or parameter updates to the target model. This methodology contrasts with traditional approaches that rely on fixed numerical rewards or heuristically constructed prompts, and achieves state-of-the-art reasoning performance on challenging mathematical LLM benchmarks.
1. End-to-End Framework Structure
TRPrompt is built around an iterative, data-driven optimization cycle involving a prompt model, a textual reward model, and a target LLM:
- Prompt Model (Π_query): Receives an input question and textual feedback (the reward) and outputs a query-specific prompt .
- Textual Reward Model (): Evaluates the effectiveness of a generated prompt by comparing the prompt , the associated question , the target model’s answer, and the ground truth . The output is detailed, context-rich textual feedback providing actionable critique or guidance.
- Target Model (): A frozen LLM that answers questions using the concatenation of and the generated prompt .
The framework’s core mechanism is an iterative process:
- For each question , generate a candidate prompt using the current version of the prompt model and the latest optimal textual reward .
- Obtain an answer from the target LLM.
- Evaluate using the textual reward model: .
- Collect tuples to form a synthetic training set.
- Fine-tune the prompt model using supervised learning to improve via
- Update with train-free optimization (Textgrad), seeking the textual reward yielding the best aggregated downstream results.
- Repeat for cycles, progressively enhancing prompt quality and specificity.
This cycle enables learning from high-information, context-dependent rewards and supports prompt model improvement without reliance on manually crafted exemplars or static prompt templates.
2. Methodological Innovations
Distinctive aspects of TRPrompt include:
- Direct Use of Textual Rewards: Unlike methods that use feedback only post hoc, TRPrompt incorporates detailed natural language critiques as first-class training signals. Textual rewards encode richer, fine-grained guidance than sparse numerical scores.
- Query-Aware Prompt Optimization: The prompt model is trained end-to-end to produce prompts that are highly tailored to individual queries or problem instances, leading to improved performance on tasks requiring complex reasoning such as mathematical problem solving.
- No Expert Initialization Required: Training sets are generated synthetically via iterative model self-improvement, eliminating the need for initial expert prompts or manual dataset curation.
The framework’s dual objectives are formalized as:
- Optimal prompt model:
- Optimal textual reward:
3. Comparison to Related Approaches
TRPrompt unifies and advances two dominant lines of previous research:
- Training-free, heuristic-based prompt methods (e.g., "Think step by step," TextGrad, APO) employ LLMs to variationally improve prompts using textual feedback, but do not embed this supervision in a trainable prompt model.
- Numerical-reward prompt learning (e.g., Prompt-OIRL, QPO) optimizes discrete prompts by maximizing scalar rewards using RL or search, with limited reward expressivity.
TRPrompt surpasses both by:
- Embedding text feedback as supervision in prompt model training (unlike post-hoc refinement methods).
- Achieving higher query specificity and precision through detailed textual signals rather than coarse numerical rewards.
- Enabling synthetic, task- and instance-specific data collection rather than relying on static, expert-curated prompts.
4. Application to Mathematical Reasoning Tasks
The framework demonstrates strong empirical gains in challenging LLM benchmarks, notably on GSMHard and MATH datasets. In these domains, prompt specificity and nuanced guidance are critical due to the complexity and diversity of arithmetic reasoning problems. TRPrompt’s iterative refinement and rich feedback enable:
- Fully query-dependent prompt generation, adapting to varied mathematical tasks without initialization bias.
- Accuracies of 31.76% on GSMHard and 41.37% on MATH, outperforming static CoT prompts, Prompt-OIRL, and QPO approaches by margins up to +2% on the most complex datasets.
- Consistent cumulative improvements (e.g., 7.5 percentage points across iterations) due to progressive refinement and learning from prior errors.
5. Training Process and Optimization
A synthetic training database, , is constructed in each iteration from the current model’s output and the real-time feedback of the textual reward model. The prompt model is fine-tuned with supervised learning, conditioned jointly on the question and the feedback , according to the loss
Concurrently, Textgrad is used to update by directly searching the reward space for the textual feedback that maximally improves downstream target accuracy. This train-free search guides the entire bootstrapping process and ensures that only high-value feedback is used to direct prompt optimization.
The framework is iterated for cycles, with each round producing increasingly effective, query-aware prompts and feedback.
6. Formalization and Mathematical Expressions
TRPrompt’s optimization objectives center on finding to maximize expected task reward:
The SFT loss guides prompt model training:
7. Empirical Results and Significance
TRPrompt yields state-of-the-art results among prompt optimization frameworks—characterized by:
| Dataset | Method | Accuracy (%) | Query-aware | Training-free |
|---|---|---|---|---|
| GSMHard | TRPrompt | 31.76 | Yes | Yes |
| Prompt-OIRL/QPO | ≤ 30 | Partial | No | |
| MATH | TRPrompt | 41.37 | Yes | Yes |
| CoT prompts, QPO | ≤ 39 | No | Partial |
- Gains of approximately +1% on GSMHard and +2% on MATH over best baselines.
- High consistency across rounds due to automatic, feedback-driven dataset construction.
The framework is robust to initial conditions and generalizes across varied mathematical queries, reflecting the advantage of rich, high-resolution textual reward supervision over conventional methods.
TRPrompt establishes a new paradigm for prompt optimization by embedding detailed textual feedback into an iterative, supervised prompt model training loop. It achieves strong performance improvements in complex reasoning domains by leveraging query-aware prompts, synthetic data generation, and train-free textual reward search, laying the foundation for further advances in LLM instruction following and adaptivity.