Papers
Topics
Authors
Recent
Search
2000 character limit reached

PRewrite: Prompt Rewriting with RL

Updated 9 March 2026
  • The paper introduces a novel PRewrite framework that uses reinforcement learning to optimize prompts, enhancing interpretability and performance.
  • It casts prompt rewriting as an MDP, employing state, action, and reward structures with methods like PPO and GRPO for stable learning.
  • Empirical results show significant gains in metrics like BLEU and accuracy across diverse tasks, from personalized text to text-to-image generation.

Prompt Rewriting with Reinforcement Learning (PRewrite) refers to a family of techniques that automate the optimization of input prompts for large language or vision-LLMs using reinforcement learning (RL) as the primary mechanism for search, evaluation, and improvement. This paradigm allows the discovery of more effective, interpretable, and human-editable prompts than those produced by conventional hand-engineering, directly targeting downstream performance metrics and aligning prompt policies with end-task desiderata.

1. Formalization and RL Problem Structure

Prompt rewriting is cast as a Markov Decision Process (MDP), commonly with the following instantiation:

  • State space S\mathcal{S}: Each state sSs \in \mathcal{S} typically encodes the original under-optimized prompt (instruction, template, or segment graph), possibly augmented with context (e.g., input instance, demonstration, or user query) (Kong et al., 2024, Li et al., 2023, Liu et al., 2024).
  • Action space A\mathcal{A}: Actions correspond either to discrete token-level generation (for sequence-to-sequence rewriters) or higher-level edit operations such as INSERT, DELETE, SUBSTITUTE on structured prompt templates or in-context example sets (Liu et al., 2024).
  • Transition dynamics: Either deterministic concatenation (autoregressive rewriting), or application of discrete edit operations; the process may terminate after a fixed number of tokens or when a STOP token/edit is emitted.
  • Policy πθ\pi_\theta: Parameterized by model weights θ\theta, often realized by a (partially or fully) frozen LLM with lightweight trainable heads for RL optimization (Kong et al., 2024).
  • Reward R(s,a)R(s, a): Defined as an externally measured, possibly non-differentiable metric computed on the downstream model’s output after applying the candidate prompt. Typical choices include task accuracy, EM, ROUGE, BLEU, human preference, or more domain-specific compositional and alignment metrics (Kong et al., 2024, Wang et al., 1 Feb 2026, Lee et al., 1 Oct 2025).
  • Objective: Learn θ=argmaxθEτπθ[R(τ)]\theta^* = \arg\max_\theta \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)], where τ\tau is a full prompt rewrite.

This abstraction supports prompt-only optimization for black-box models, downstream RL-based reward maximization, and structured exploration of prompt spaces.

2. Policy Architectures and Learning Algorithms

Approaches to PRewrite utilize diverse architectures and RL algorithms:

  • Sequence-to-sequence policies: Most direct, as in (Kong et al., 2024, Li et al., 2023); the rewriter model generates a revised prompt autoregressively, with actions as next-token emission.
  • Graph-based or structured policies: The prompt is encoded as a labeled segment graph; actions are edit operations chosen via policy networks operating on graph embeddings (Liu et al., 2024).
  • Plug-and-play and collaborative agents: Co-training settings where a small RL-optimized LLM composes prompts that steer a larger environment model or diffusion model (Liu et al., 2 Nov 2025, Lee et al., 1 Oct 2025).

Optimization typically uses on-policy RL algorithms:

Supervised pretraining ("warm up") is often used to initialize policies and reduce the RL search space, followed by RL fine-tuning to exploit non-differentiable rewards (Li et al., 2023, Wang et al., 4 Sep 2025).

3. Reward Design and Variance Stabilization

Effective PRewrite relies critically on robust reward design:

  • Downstream task metrics: Primary scalar reward is usually exact match, accuracy, or automated similarity between generated and gold outputs (Kong et al., 2024, Li et al., 2023).
  • Multi-dimensional or composite rewards: For complex tasks (text-to-image, multi-turn reasoning), composite rewards such as λformatRformat+λgenRgen\lambda_\text{format} R_\text{format} + \lambda_\text{gen} R_\text{gen} are used, where RgenR_\text{gen} may reflect GenEval, PickScore, EditReward, or other relevant criteria (Wang et al., 1 Feb 2026).
  • Auxiliary shaping: Embedding-based or knowledge-graph-informed shaping terms are applied to smooth or regularize learning (Liu et al., 2024).
  • Hard gating: Task-consistency gates or format constraints prevent update propagation from invalid rewrites to maintain feasible prompt spaces (Wang et al., 11 Feb 2026).
  • Group normalization and retention: Reward normalization within prompt groups, retention of original prompt variants, and selective advantage assignment are essential for variance reduction and sample efficiency (Wang et al., 1 Feb 2026).

Zero-mean, unit-variance normalization, KL penalties, and replay buffers further improve stability, especially in cases of multi-reward or multi-stage training (Wang et al., 1 Feb 2026, Lin et al., 7 Oct 2025).

4. Specializations Across Modalities and Tasks

PRewrite is instantiated across a broad spectrum of modalities and application domains:

  • Text-to-image generation: PromptRL (Wang et al., 1 Feb 2026) embeds an LM-based rewriter within the RL optimization loop of a flow-matching generator. The LM policy generates diverse paraphrases and refinements, directly minimizing compositional errors and overfitting. Quantitative gains include GenEval=0.97, OCR accuracy=0.98, PickScore=24.05. Prompt retention and group-wise normalization enable a >2sSs \in \mathcal{S}0 reduction in required rollouts over flow-only RL.
  • Personalized text generation: PRewrite (Li et al., 2023) augments a sequence-to-sequence rewriter with both supervised bootstrapping and PPO-based RL, optimizing BLEU on personalized emails, reviews, and conversations. Gains of +3.6 to +8.1 BLEU over baselines are reported.
  • Dialogue control: RL-based prompt generators can steer black-box chat models with respect to emotion, topic, or intent by treating prompt generation as the policy and using API-accessible outputs as delayed rewards (Su et al., 2022).
  • Long-term planning and multi-turn optimization: Reinforced Prompt Optimization (RPO) employs episodic feedback and experience replay to handle multi-turn SQL, dialogue, and complex reasoning pipelines (Lin et al., 7 Oct 2025).
  • Instruction induction and template editing: Methods such as PACE apply multi-step, actor-critic RL loops to iteratively improve prompt structures for classification and generative tasks (Dong et al., 2023, Liu et al., 2024).
  • Plug-and-play and collaborative RL: Modular agents that iteratively refine prompts at each generation step (e.g., via diffusion latents or LLM-based "feedbackers") generalize prompt rewriting to arbitrary downstream black-box models (Liu et al., 2 Nov 2025, Lee et al., 1 Oct 2025).
  • Medical and domain-specific applications: EMPOWER hybridizes RL and evolutionary search with specialized medical terminology attention, yielding a 24.7% reduction in factual error and 19.6% enhancement in domain specificity (Chen et al., 25 Aug 2025).

5. Quantitative Benchmarks and Empirical Insights

Systematic evaluations reveal robust improvements in downstream metrics:

Method / Domain Main Metric Baseline PRewrite / RL-rewrite Absolute Gain
AG News (text classification)(Kong et al., 2024) accuracy 76.9% 85.2% +8.3%
Personalized Email (Li et al., 2023) BLEU 9.59 13.18 +3.59
Text-to-Image GenEval (Wang et al., 1 Feb 2026) GenEval 0.92 0.97 +0.05
Medical Factual Consistency (Chen et al., 25 Aug 2025) FCS 86.1% 91.4% +5.3%

Ablations demonstrate:

Interpretably, learned prompts are more human-editable and stylistically rich, avoiding degenerate or overfitted formulations.

6. Challenges, Limitations, and Extensions

Critical open challenges include:

  • Reward hacking and over-optimization: RL on fixed reward signals can encourage pathological solutions; approaches like PromptLoop attempt to mitigate this via latent feedback and stepwise rewrites (Lee et al., 1 Oct 2025).
  • Generalization and catastrophic forgetting: Prompt-centered RL can preserve generalization across domains and reduce forgetting compared to standard SFT, but is sensitive to the diversity-promoting mechanisms and task-alignment (Wang et al., 11 Feb 2026).
  • Variance and stability: Hard filtering, experience replay, and reward shaping are needed to tame high-variance signals intrinsic to prompt-level RL.
  • Domain-specificity: Clinical or safety-critical domains require multi-dimensional assessment, structure preservation, and semantic verification beyond standard text similarity or accuracy (Chen et al., 25 Aug 2025).
  • Computational efficiency: RL training and evaluation (e.g., running large LLMs as black-box rewarders) can be computationally burdensome; various parameter-efficient and plug-and-play adaptations are employed (Kong et al., 2024, Liu et al., 2 Nov 2025).

Extensions include knowledge-graph informed policy networks (Liu et al., 2024), plug-and-play framework design (Liu et al., 2 Nov 2025), experience-replay stabilization (Lin et al., 7 Oct 2025), and hybrid evolutionary-RL integration (Chen et al., 25 Aug 2025).

7. Significance, Outlook, and Comparative Impact

PRewrite paradigms have elevated prompt engineering from incremental, hand-tuned heuristics to scalable, interpretable, and model-agnostic optimization tasks. They have demonstrated:

  • State-of-the-art downstream performance with significantly improved sample efficiency (Wang et al., 1 Feb 2026).
  • Prompt generalization and robustness to diverse inputs and evolving models (Kong et al., 2024, Lin et al., 7 Oct 2025).
  • Universal applicability, from personalized generation to multi-turn, multi-modal, and medical reasoning.
  • The capacity, via reward shaping, to address classic RL trade-offs (e.g., diversity–alignment, precision–recall) in the context of prompt space.

Future directions include joint optimization of prompts and rewarders, finer-grained control via explicit policy networks over template segments, RL-based co-training of generator and rewriter modules, richer domain adaptation, and principled integration of human-in-the-loop evaluation for safety-critical or creative domains.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to PRewrite: Prompt Rewriting with Reinforcement Learning.