Papers
Topics
Authors
Recent
2000 character limit reached

Textual Reward Prompt Framework (TRPrompt)

Updated 26 July 2025
  • The paper introduces TRPrompt, a framework that integrates detailed textual feedback with reinforcement learning to iteratively refine query-specific prompts.
  • It employs an end-to-end cycle that uses synthetic data generation and high-resolution textual rewards to achieve state-of-the-art performance on GSMHard and MATH benchmarks.
  • The framework eliminates the need for expert initialization, using train-free optimization to improve prompt quality and enhance LLM reasoning in complex tasks.

The Textual Reward Prompt Framework (TRPrompt) represents a major direction in natural language prompt optimization, unifying the use of detailed textual feedback and reinforcement learning to improve the query-specific performance of LLMs. By integrating high-resolution, context-dependent textual rewards directly into the prompt model’s training loop, TRPrompt enables efficient, iterative refinement of prompts without the need for prior dataset collection or parameter updates to the target model. This methodology contrasts with traditional approaches that rely on fixed numerical rewards or heuristically constructed prompts, and achieves state-of-the-art reasoning performance on challenging mathematical LLM benchmarks.

1. End-to-End Framework Structure

TRPrompt is built around an iterative, data-driven optimization cycle involving a prompt model, a textual reward model, and a target LLM:

  • Prompt Model (Π_query): Receives an input question qq and textual feedback tt (the reward) and outputs a query-specific prompt pp.
  • Textual Reward Model (RtextualR_{textual}): Evaluates the effectiveness of a generated prompt by comparing the prompt pp, the associated question qq, the target model’s answer, and the ground truth yy^*. The output is detailed, context-rich textual feedback tt providing actionable critique or guidance.
  • Target Model (MtargetM_{target}): A frozen LLM that answers questions using the concatenation of qq and the generated prompt pp.

The framework’s core mechanism is an iterative process:

  1. For each question qiq_i, generate a candidate prompt pi=Πquery(qi,t)p_i = \Pi_{query}(q_i, t^*) using the current version of the prompt model and the latest optimal textual reward tt^*.
  2. Obtain an answer yi=Mtarget(qi,pi)y_i = M_{target}(q_i, p_i) from the target LLM.
  3. Evaluate using the textual reward model: ti=Rtextual(pi,(qi,yi,yi))t_i = R_{textual}(p_i, (q_i, y_i, y^*_i)).
  4. Collect (pi,qi,ti)(p_i, q_i, t_i) tuples to form a synthetic training set.
  5. Fine-tune the prompt model using supervised learning to improve P(pq,t)P(p|q, t) via

LSFT=E(p,q,t)DtrainlogP(pq,t)L_{SFT} = -\mathbb{E}_{(p, q, t) \sim D_{train}} \log P(p|q, t)

  1. Update tt^* with train-free optimization (Textgrad), seeking the textual reward yielding the best aggregated downstream results.
  2. Repeat for KK cycles, progressively enhancing prompt quality and specificity.

This cycle enables learning from high-information, context-dependent rewards and supports prompt model improvement without reliance on manually crafted exemplars or static prompt templates.

2. Methodological Innovations

Distinctive aspects of TRPrompt include:

  • Direct Use of Textual Rewards: Unlike methods that use feedback only post hoc, TRPrompt incorporates detailed natural language critiques as first-class training signals. Textual rewards encode richer, fine-grained guidance than sparse numerical scores.
  • Query-Aware Prompt Optimization: The prompt model is trained end-to-end to produce prompts that are highly tailored to individual queries or problem instances, leading to improved performance on tasks requiring complex reasoning such as mathematical problem solving.
  • No Expert Initialization Required: Training sets are generated synthetically via iterative model self-improvement, eliminating the need for initial expert prompts or manual dataset curation.

The framework’s dual objectives are formalized as:

  • Optimal prompt model:

Πquery=argmaxΠqueryir(yi,Mtarget(qi,Πquery(qi,t)))\Pi^*_{query} = \arg\max_{\Pi_{query}} \sum_i r(y^*_i, M_{target}(q_i, \Pi_{query}(q_i, t^*)))

  • Optimal textual reward:

t=argmaxtTir(yi,Mtarget(qi,Πquery(qi,t)))t^* = \arg\max_{t \in T} \sum_i r(y^*_i, M_{target}(q_i, \Pi_{query}(q_i, t)))

TRPrompt unifies and advances two dominant lines of previous research:

  • Training-free, heuristic-based prompt methods (e.g., "Think step by step," TextGrad, APO) employ LLMs to variationally improve prompts using textual feedback, but do not embed this supervision in a trainable prompt model.
  • Numerical-reward prompt learning (e.g., Prompt-OIRL, QPO) optimizes discrete prompts by maximizing scalar rewards using RL or search, with limited reward expressivity.

TRPrompt surpasses both by:

  • Embedding text feedback as supervision in prompt model training (unlike post-hoc refinement methods).
  • Achieving higher query specificity and precision through detailed textual signals rather than coarse numerical rewards.
  • Enabling synthetic, task- and instance-specific data collection rather than relying on static, expert-curated prompts.

4. Application to Mathematical Reasoning Tasks

The framework demonstrates strong empirical gains in challenging LLM benchmarks, notably on GSMHard and MATH datasets. In these domains, prompt specificity and nuanced guidance are critical due to the complexity and diversity of arithmetic reasoning problems. TRPrompt’s iterative refinement and rich feedback enable:

  • Fully query-dependent prompt generation, adapting to varied mathematical tasks without initialization bias.
  • Accuracies of 31.76% on GSMHard and 41.37% on MATH, outperforming static CoT prompts, Prompt-OIRL, and QPO approaches by margins up to +2% on the most complex datasets.
  • Consistent cumulative improvements (e.g., 7.5 percentage points across iterations) due to progressive refinement and learning from prior errors.

5. Training Process and Optimization

A synthetic training database, DtrainD_{train}, is constructed in each iteration from the current model’s output and the real-time feedback of the textual reward model. The prompt model is fine-tuned with supervised learning, conditioned jointly on the question qq and the feedback tt, according to the loss

LSFT=E(pi,qi,ti)DtrainlogP(piqi,ti)L_{SFT} = -\mathbb{E}_{(p_i, q_i, t_i) \sim D_{train}} \log P(p_i|q_i, t_i)

Concurrently, Textgrad is used to update tt^* by directly searching the reward space for the textual feedback that maximally improves downstream target accuracy. This train-free search guides the entire bootstrapping process and ensures that only high-value feedback is used to direct prompt optimization.

The framework is iterated for KK cycles, with each round producing increasingly effective, query-aware prompts and feedback.

6. Formalization and Mathematical Expressions

TRPrompt’s optimization objectives center on finding (Πquery,t)(\Pi_{query}^*, t^*) to maximize expected task reward:

Πquery=argmaxΠqueryir(yi,Mtarget(qi,Πquery(qi,t)))\Pi_{query}^* = \arg\max_{\Pi_{query}} \sum_i r(y^*_i, M_{target}(q_i, \Pi_{query}(q_i, t^*)))

t=argmaxtTir(yi,Mtarget(qi,Πquery(qi,t)))t^* = \arg\max_{t \in T} \sum_i r(y^*_i, M_{target}(q_i, \Pi_{query}(q_i, t)))

The SFT loss guides prompt model training:

LSFT=E(pi,qi,ti)DtrainlogP(piqi,ti)L_{SFT} = -\mathbb{E}_{(p_i, q_i, t_i) \sim D_{train}} \log P(p_i|q_i, t_i)

7. Empirical Results and Significance

TRPrompt yields state-of-the-art results among prompt optimization frameworks—characterized by:

Dataset Method Accuracy (%) Query-aware Training-free
GSMHard TRPrompt 31.76 Yes Yes
Prompt-OIRL/QPO ≤ 30 Partial No
MATH TRPrompt 41.37 Yes Yes
CoT prompts, QPO ≤ 39 No Partial
  • Gains of approximately +1% on GSMHard and +2% on MATH over best baselines.
  • High consistency across rounds due to automatic, feedback-driven dataset construction.

The framework is robust to initial conditions and generalizes across varied mathematical queries, reflecting the advantage of rich, high-resolution textual reward supervision over conventional methods.


TRPrompt establishes a new paradigm for prompt optimization by embedding detailed textual feedback into an iterative, supervised prompt model training loop. It achieves strong performance improvements in complex reasoning domains by leveraging query-aware prompts, synthetic data generation, and train-free textual reward search, laying the foundation for further advances in LLM instruction following and adaptivity.

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Textual Reward Prompt Framework (TRPrompt).