RL²F: Reinforcement Learning with Language Feedback

Updated 11 March 2026

RL²F is a paradigm that augments classical reinforcement learning with natural language feedback to provide precise guidance and richer exploratory signals.
It integrates techniques like preference modeling, critique-based updates, and semantic rewards to enhance sample efficiency and policy alignment.
Applied across domains such as LLM post-training, robotics, and autonomous reasoning, RL²F achieves faster convergence and robust performance in complex tasks.

Reinforcement Learning with Language Feedback (RL²F) is a paradigm in which reinforcement learning agents leverage natural language as an explicit, actionable learning signal. Departing from classical RL formulations that rely exclusively on scalar, manually engineered rewards, RL²F integrates textual feedback—generated by humans or LLMs—to condition policy improvement, shape exploration, impose constraints, deliver dense gradient signals, or facilitate credit assignment. Techniques range from preference modeling and constraint enforcement to span-level update propagation, spanning domains as varied as LLM post-training, robotics, continuous control, embodied agents, and autonomous reasoning. RL²F unifies and generalizes RL from Human Feedback (RLHF), advanced preference learning, and recent efforts on fine-grained, interpretable policy optimization.

1. Problem Setting and Core Principles

RL²F generalizes the conventional Markov Decision Process (MDP) by augmenting each transition or trajectory with associated natural language feedback. The agent observes a state $s_t$ , executes action $a_t$ , receives environment reward $r_t$ , and additionally receives or generates language feedback $l_t$ or $c_t$ that encodes evaluative, corrective, or prescriptive information about the behavior or outcome. Formally, the interaction proceeds under an MDP or POMDP tuple extended with a language feedback channel (e.g., $\mathcal{M} = (\mathcal{S}, \mathcal{A}, P, R, \mathcal{L})$ ).

RL²F systems support multiple feedback modalities:

Preference or ranking signals: Pairwise or multi-way comparisons over rollouts/trajectories, often producing a reward model via inverse RL or Bradley–Terry-style cross-entropy (Jian et al., 21 Apr 2025).
Free-form critiques or structured corrections: Dense corrections at the span, token, or trajectory level (Wang et al., 28 May 2025, Zhou et al., 2024).
Task constraints and prohibitions: Language-discriminators detecting violations of specified behaviors, steering the policy within a constrained MDP (Zhang et al., 2020).
Exploratory scaffolding and refinement: Language-derived suggestions or aggregated critiques directly injected as off-policy scaffolds or prompts (Huang et al., 4 Mar 2026, Li et al., 18 Oct 2025).
Semantic alignment: Embedding environment states and goals in a language manifold, with reward defined by semantic similarity (Liang et al., 8 Aug 2025).

The agent's policy $\pi_\theta$ is thus conditioned or shaped, directly or indirectly, by language. Feedback can be supplied online during RL, retroactively from a dataset, or synthesized/abstracted through self-reflection, group dynamics, or historical summary (Li et al., 18 Oct 2025, Huang et al., 4 Mar 2026).

2. Representative Algorithms and Feedback Integration

RL²F encompasses diverse algorithms that adapt classical RL, preference-based methods, and modern LLM fine-tuning techniques to harness natural language feedback:

2.1 Policy Optimization with Preference and Critique Modeling

RL²F generalizes RLHF by training a reward model $R_\phi$ from preference or ranking data (human or LLM-generated), often using binary cross-entropy objectives over pairwise differences. The policy is then optimized, typically with PPO or its variants, to maximize expected return under $R_\phi$ :

$J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^T \gamma^t R_\phi(s_t, a_t)\right]$

(Sun, 2023, Jian et al., 21 Apr 2025).

Feedback modeling can be extended to sequence-to-sequence reward models, enabling the policy to receive dense, token- or span-level feedback by comparing generated sequences to language-corrected versions, alleviating sparse credit assignment and reward gaming (Zhou et al., 2024, Wang et al., 28 May 2025).

Frameworks like GOLF leverage group-level language feedback, aggregating external critiques and intra-group alternative attempts for each problem instance (Huang et al., 4 Mar 2026). Aggregated negative feedback is used to condition the policy to generate high-quality refinements, which are then injected as off-policy scaffolds into RL optimization. This mitigates the vanishing gradient problem in sparse-reward regions and yields substantial gains in sample efficiency and exploration.

2.3 Constraint-Augmented RL via Language Discriminators

Language can function as a constraint signal: natural-language feedback is distilled into a learned discriminator that predicts violation probabilities, which are penalized through a Lagrangian in the RL objective (Zhang et al., 2020):

$\max_{\pi_\theta} J_R(\pi_\theta) \quad \text{s.t.} \quad J_{C_\phi}(\pi_\theta) \leq \alpha$

The penalty is updated on a slower timescale, allowing soft, learned constraints that adapt to user preferences.

2.4 Token- and Span-Level Update Propagation

Text2Grad demonstrates RL²F in the context of policy optimization from span-annotated language critiques (Wang et al., 28 May 2025). Free-form text feedback is aligned to output spans; these alignments are converted to token-level pseudo-rewards and propagated through a natural language gradient:

$\nabla_\mathrm{NL}(f \to y) = \sum_t \delta_t \nabla_\theta \log \pi_\theta(y_t|x, y_{<t})$

This enables gradient updates targeted at specific problematic output fragments, yielding interpretable and highly localized policy refinement.

2.5 Semantic Reward and Language Conditioning

In continuous control and robotics, RL²F can fully replace explicit numerical rewards with semantic similarity between a language description of the agent's state and a goal description (Liang et al., 8 Aug 2025). Pretrained sentence encoders (e.g., SBERT) map both to the same latent space, and the cosine similarity functions as the instantaneous reward.

2.6 Mutual Enhancement and Bi-Directional Loops

Bi-directional RL–LLM feedback, as in AdaRefiner or collaborative planning (Zhang et al., 2023, Gu, 2024), creates an interleaved loop in which the agent supplies feedback to the LLM, which in turn adapts its guidance. The LLM teacher refines its own subgoal proposals based on comprehension scores derived from agent learning progress, supporting both adaptive prompt engineering and downstream policy improvement.

3. Applications and Empirical Outcomes

RL²F has been applied to a diverse range of domains, often establishing new state-of-the-art sample efficiency, final performance, or alignment characteristics.

LLM Alignment and Post-Training: RL²F frameworks improve LLM alignment over direct preference optimization and scalar-reward RLHF, notably reducing pathological behaviors such as excessive refusals or length bias and enabling dense credit assignment at the token level (Zhou et al., 2024, Wang et al., 28 May 2025).
Robotics and Embodied Agents: LF²F methods such as LAPP and Lafite-RL enable robots to acquire preference-driven, expressive behaviors and dexterous skills by leveraging LLM-generated trajectory-level preference labels or reward signals (Jian et al., 21 Apr 2025, Chu et al., 2023).
Continuous and Fluid Control: RL driven entirely by semantic language rewards performs comparably to traditional numerical reward engineering, facilitating goal transfer and human interpretability (Liang et al., 8 Aug 2025).
Exploration and Sparse Reward Regimes: Group-level refinement and language-guided off-policy scaffolding break exploration bottlenecks in sparse-reward reasoning, instruction following, and mathematical reasoning benchmarks (Huang et al., 4 Mar 2026, Li et al., 18 Oct 2025).
Interactive Recommendation and Constraint Satisfaction: Natural language can be harnessed not only as a reward but as a soft constraint, steering interactive RL systems away from known violations and undesired behaviors (Zhang et al., 2020).
Generalization and Adaptation: Diverse and informative language feedback—especially combining hindsight and foresight—improves generalization and rapid adaptation to new tasks in offline embodied RL (Xi et al., 2024, McCallum et al., 2023).

4. Experimental Protocols and Quantitative Findings

Empirical validation of RL²F consistently demonstrates accelerated convergence, higher peak metrics, and robustness to noisy or imperfect feedback across settings.

Sample Efficiency: GOLF achieves 2.2× faster convergence to baseline performance than solely scalar reward methods in LLM RL tasks (Huang et al., 4 Mar 2026); LAPP attains 2–5× faster learning than PPO or static LLM-rewarded baselines on robot locomotion (Jian et al., 21 Apr 2025).
Zero-Shot and OOD Performance: Seq2seq reward modeling and text-conditioned policies yield higher average GPT-4 win rates (up to 76.9%) and superior out-of-distribution accuracy compared to scalar-reward RLHF (Zhou et al., 2024).
Fine-Grained Policy Improvement: Span-annotated language gradients improve task metrics (e.g., ROUGE, BLEU, BERTScore, pass@1) and accelerate convergence over PPO (Wang et al., 28 May 2025).
Exploration and Pass@k: Group-level feedback boosts pass@k metrics in mathematical reasoning (+8.2% Pass@128), escapable from all-zero gradient traps in sparse reward environments (Huang et al., 4 Mar 2026).
Reduction of Policy Violations: Constraint-augmented RL²F reduces violation rates (e.g., negative sentiment in reviews) from 40% to 10.5% in text generation (Zhang et al., 2020).
Robustness to Feedback Noise: Confidence-weighted, potential-based shaping from LLM feedback remains stable under heavy label noise, outperforming direct reward baselines even with 40–50% ranking errors (Lin et al., 2024).

5. Methodological Considerations and Design Tradeoffs

Key axes for method design in RL²F include:

Feedback Granularity: Coarse (trajectory-level, scalar) versus fine (token/span-level, corrective language). Fine-grained signals afford denser credit assignment but may require higher annotation or computational cost.
Source and Quality of Feedback: Human vs. automated (LLM) judges, frequency of feedback, diversity, informativeness, and robustness to hallucination or noise. Multi-querying and confidence estimation mitigate unreliable LLM feedback (Lin et al., 2024).
Incorporation Mode: Whether feedback serves as direct reward shaping, constraint penalty, off-policy scaffolding, auxiliary prediction, or dense advantage estimation.
Policy and Reward Model Architecture: Classical RL algorithms (PPO, GRPO), Decision Transformers, value models for continuous/discrete domains, or transformer-based token-level policies and critics.
Online vs. Offline Feedback: Some frameworks (e.g., RLTF) explicitly support feedback only at training, with distilled or auxiliary modeling to enable test-time generalization (Song et al., 2 Feb 2026).
Prompting and Summarization: Design of prompts for LLM feedback, abstraction/retrieval from experience pools, and structured summarization strongly influence effectiveness, especially for inter-sample generalization (Li et al., 18 Oct 2025).

6. Limitations, Open Problems, and Future Directions

Despite empirical progress, several challenges persist:

Scalability of Feedback: Acquiring dense, high-quality human or LLM-generated feedback at scale, both for training dense reward models and for real-world closed-loop tasks (Wang et al., 28 May 2025).
Feedback Robustness: Sensitivity to judge (LLM or human) bias or hallucination is an open risk. Methods leveraging multi-query consensus, confidence weighting, and adversarial LLM feedback filtering mitigate this but are not foolproof (Lin et al., 2024, Huang et al., 4 Mar 2026).
Cost and Efficiency: Computational overhead from repeated feedback queries or large experience pools, especially in high-dimensional or multi-modal domains (Jian et al., 21 Apr 2025, Li et al., 18 Oct 2025).
Generalization and Transfer: Adapting RL²F to domains with visual inputs or limited language alignment remains difficult; current systems mainly embed state as text or use separate encoders (Xi et al., 2024).
Theoretical Foundations: Formal sample-complexity and convergence analyses for RL²F algorithms—especially with non-stationary, subjective, variable-granularity feedback—are ongoing (Song et al., 2 Feb 2026).

Future research directions highlighted include:

Joint training or co-adaptation of LLMs and RL agents in bi-directional settings, e.g., via continual adaptation of both guidance and policy (Gu, 2024, Zhang et al., 2023).
Multi-modal and hierarchical feedback, leveraging vision-LLMs and hierarchical RL (Liang et al., 8 Aug 2025, Xi et al., 2024).
Robust, automated, or semi-supervised feedback collection to reduce cost and improve scalability (Li et al., 18 Oct 2025).
Formal metrics of feedback informativeness and diversity, and their relationship to learning efficiency and generalization (Xi et al., 2024).
Extension of RL²F to safety-critical domains, multi-objective settings, and high-risk exploration (Liang et al., 8 Aug 2025, Wang et al., 28 May 2025).