PRewrite: Prompt Rewriting with RL

Updated 9 March 2026

The paper introduces a novel PRewrite framework that uses reinforcement learning to optimize prompts, enhancing interpretability and performance.
It casts prompt rewriting as an MDP, employing state, action, and reward structures with methods like PPO and GRPO for stable learning.
Empirical results show significant gains in metrics like BLEU and accuracy across diverse tasks, from personalized text to text-to-image generation.

Prompt Rewriting with Reinforcement Learning (PRewrite) refers to a family of techniques that automate the optimization of input prompts for large language or vision-LLMs using reinforcement learning (RL) as the primary mechanism for search, evaluation, and improvement. This paradigm allows the discovery of more effective, interpretable, and human-editable prompts than those produced by conventional hand-engineering, directly targeting downstream performance metrics and aligning prompt policies with end-task desiderata.

1. Formalization and RL Problem Structure

Prompt rewriting is cast as a Markov Decision Process (MDP), commonly with the following instantiation:

State space $\mathcal{S}$ : Each state $s \in \mathcal{S}$ typically encodes the original under-optimized prompt (instruction, template, or segment graph), possibly augmented with context (e.g., input instance, demonstration, or user query) (Kong et al., 2024, Li et al., 2023, Liu et al., 2024).
Action space $\mathcal{A}$ : Actions correspond either to discrete token-level generation (for sequence-to-sequence rewriters) or higher-level edit operations such as INSERT, DELETE, SUBSTITUTE on structured prompt templates or in-context example sets (Liu et al., 2024).
Transition dynamics: Either deterministic concatenation (autoregressive rewriting), or application of discrete edit operations; the process may terminate after a fixed number of tokens or when a STOP token/edit is emitted.
Policy $\pi_\theta$ : Parameterized by model weights $\theta$ , often realized by a (partially or fully) frozen LLM with lightweight trainable heads for RL optimization (Kong et al., 2024).
Reward $R(s, a)$ : Defined as an externally measured, possibly non-differentiable metric computed on the downstream model’s output after applying the candidate prompt. Typical choices include task accuracy, EM, ROUGE, BLEU, human preference, or more domain-specific compositional and alignment metrics (Kong et al., 2024, Wang et al., 1 Feb 2026, Lee et al., 1 Oct 2025).
Objective: Learn $\theta^* = \arg\max_\theta \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)]$ , where $\tau$ is a full prompt rewrite.

This abstraction supports prompt-only optimization for black-box models, downstream RL-based reward maximization, and structured exploration of prompt spaces.

2. Policy Architectures and Learning Algorithms

Approaches to PRewrite utilize diverse architectures and RL algorithms:

Sequence-to-sequence policies: Most direct, as in (Kong et al., 2024, Li et al., 2023); the rewriter model generates a revised prompt autoregressively, with actions as next-token emission.
Graph-based or structured policies: The prompt is encoded as a labeled segment graph; actions are edit operations chosen via policy networks operating on graph embeddings (Liu et al., 2024).
Plug-and-play and collaborative agents: Co-training settings where a small RL-optimized LLM composes prompts that steer a larger environment model or diffusion model (Liu et al., 2 Nov 2025, Lee et al., 1 Oct 2025).

Optimization typically uses on-policy RL algorithms:

Proximal Policy Optimization (PPO): Commonly adopted for stability and variance reduction (Kong et al., 2024, Li et al., 2023, Lee et al., 1 Oct 2025).
Group Relative Policy Optimization (GRPO): A variant that employs group-wise normalization and advantage calculation for sample-efficient multi-candidate evaluation, central in flow-based text-to-image settings (Wang et al., 1 Feb 2026, Wang et al., 4 Sep 2025, Lee et al., 1 Oct 2025).
Actor-Critic and hybrid schemes: Leveraging both actor ("rewriter") and critic (reward estimator or LLM) to guide iterative prompt editing (Dong et al., 2023).

Supervised pretraining ("warm up") is often used to initialize policies and reduce the RL search space, followed by RL fine-tuning to exploit non-differentiable rewards (Li et al., 2023, Wang et al., 4 Sep 2025).

3. Reward Design and Variance Stabilization

Effective PRewrite relies critically on robust reward design:

Downstream task metrics: Primary scalar reward is usually exact match, accuracy, or automated similarity between generated and gold outputs (Kong et al., 2024, Li et al., 2023).
Multi-dimensional or composite rewards: For complex tasks (text-to-image, multi-turn reasoning), composite rewards such as $\lambda_\text{format} R_\text{format} + \lambda_\text{gen} R_\text{gen}$ are used, where $R_\text{gen}$ may reflect GenEval, PickScore, EditReward, or other relevant criteria (Wang et al., 1 Feb 2026).
Auxiliary shaping: Embedding-based or knowledge-graph-informed shaping terms are applied to smooth or regularize learning (Liu et al., 2024).
Hard gating: Task-consistency gates or format constraints prevent update propagation from invalid rewrites to maintain feasible prompt spaces (Wang et al., 11 Feb 2026).
Group normalization and retention: Reward normalization within prompt groups, retention of original prompt variants, and selective advantage assignment are essential for variance reduction and sample efficiency (Wang et al., 1 Feb 2026).

Zero-mean, unit-variance normalization, KL penalties, and replay buffers further improve stability, especially in cases of multi-reward or multi-stage training (Wang et al., 1 Feb 2026, Lin et al., 7 Oct 2025).

4. Specializations Across Modalities and Tasks

PRewrite is instantiated across a broad spectrum of modalities and application domains:

Text-to-image generation: PromptRL (Wang et al., 1 Feb 2026) embeds an LM-based rewriter within the RL optimization loop of a flow-matching generator. The LM policy generates diverse paraphrases and refinements, directly minimizing compositional errors and overfitting. Quantitative gains include GenEval=0.97, OCR accuracy=0.98, PickScore=24.05. Prompt retention and group-wise normalization enable a >2 $s \in \mathcal{S}$ 0 reduction in required rollouts over flow-only RL.
Personalized text generation: PRewrite (Li et al., 2023) augments a sequence-to-sequence rewriter with both supervised bootstrapping and PPO-based RL, optimizing BLEU on personalized emails, reviews, and conversations. Gains of +3.6 to +8.1 BLEU over baselines are reported.
Dialogue control: RL-based prompt generators can steer black-box chat models with respect to emotion, topic, or intent by treating prompt generation as the policy and using API-accessible outputs as delayed rewards (Su et al., 2022).
Long-term planning and multi-turn optimization: Reinforced Prompt Optimization (RPO) employs episodic feedback and experience replay to handle multi-turn SQL, dialogue, and complex reasoning pipelines (Lin et al., 7 Oct 2025).
Instruction induction and template editing: Methods such as PACE apply multi-step, actor-critic RL loops to iteratively improve prompt structures for classification and generative tasks (Dong et al., 2023, Liu et al., 2024).
Plug-and-play and collaborative RL: Modular agents that iteratively refine prompts at each generation step (e.g., via diffusion latents or LLM-based "feedbackers") generalize prompt rewriting to arbitrary downstream black-box models (Liu et al., 2 Nov 2025, Lee et al., 1 Oct 2025).
Medical and domain-specific applications: EMPOWER hybridizes RL and evolutionary search with specialized medical terminology attention, yielding a 24.7% reduction in factual error and 19.6% enhancement in domain specificity (Chen et al., 25 Aug 2025).

5. Quantitative Benchmarks and Empirical Insights

Systematic evaluations reveal robust improvements in downstream metrics:

Method / Domain	Main Metric	Baseline	PRewrite / RL-rewrite	Absolute Gain
AG News (text classification)(Kong et al., 2024)	accuracy	76.9%	85.2%	+8.3%
Personalized Email (Li et al., 2023)	BLEU	9.59	13.18	+3.59
Text-to-Image GenEval (Wang et al., 1 Feb 2026)	GenEval	0.92	0.97	+0.05
Medical Factual Consistency (Chen et al., 25 Aug 2025)	FCS	86.1%	91.4%	+5.3%

Ablations demonstrate:

The necessity of domain-aware reward shaping (e.g., removal of "summary" from personalized rewrite inputs decreases BLEU) (Li et al., 2023).
The impact of prompt retention and group-wise normalization on flow-based image models (Wang et al., 1 Feb 2026).
That diversity and alignment shaping (as in (Wang et al., 11 Feb 2026)) are jointly necessary for high in-domain performance and retention on generalization tasks.

Interpretably, learned prompts are more human-editable and stylistically rich, avoiding degenerate or overfitted formulations.

6. Challenges, Limitations, and Extensions

Critical open challenges include:

Reward hacking and over-optimization: RL on fixed reward signals can encourage pathological solutions; approaches like PromptLoop attempt to mitigate this via latent feedback and stepwise rewrites (Lee et al., 1 Oct 2025).
Generalization and catastrophic forgetting: Prompt-centered RL can preserve generalization across domains and reduce forgetting compared to standard SFT, but is sensitive to the diversity-promoting mechanisms and task-alignment (Wang et al., 11 Feb 2026).
Variance and stability: Hard filtering, experience replay, and reward shaping are needed to tame high-variance signals intrinsic to prompt-level RL.
Domain-specificity: Clinical or safety-critical domains require multi-dimensional assessment, structure preservation, and semantic verification beyond standard text similarity or accuracy (Chen et al., 25 Aug 2025).
Computational efficiency: RL training and evaluation (e.g., running large LLMs as black-box rewarders) can be computationally burdensome; various parameter-efficient and plug-and-play adaptations are employed (Kong et al., 2024, Liu et al., 2 Nov 2025).

Extensions include knowledge-graph informed policy networks (Liu et al., 2024), plug-and-play framework design (Liu et al., 2 Nov 2025), experience-replay stabilization (Lin et al., 7 Oct 2025), and hybrid evolutionary-RL integration (Chen et al., 25 Aug 2025).

7. Significance, Outlook, and Comparative Impact

PRewrite paradigms have elevated prompt engineering from incremental, hand-tuned heuristics to scalable, interpretable, and model-agnostic optimization tasks. They have demonstrated:

State-of-the-art downstream performance with significantly improved sample efficiency (Wang et al., 1 Feb 2026).
Prompt generalization and robustness to diverse inputs and evolving models (Kong et al., 2024, Lin et al., 7 Oct 2025).
Universal applicability, from personalized generation to multi-turn, multi-modal, and medical reasoning.
The capacity, via reward shaping, to address classic RL trade-offs (e.g., diversity–alignment, precision–recall) in the context of prompt space.

Future directions include joint optimization of prompts and rewarders, finer-grained control via explicit policy networks over template segments, RL-based co-training of generator and rewriter modules, richer domain adaptation, and principled integration of human-in-the-loop evaluation for safety-critical or creative domains.