Papers
Topics
Authors
Recent
2000 character limit reached

Reward Feedback Learning (ReFL)

Updated 7 December 2025
  • Reward Feedback Learning (ReFL) is an approach that uses human and proxy reward signals to fine-tune models and control policies for better alignment with human preferences.
  • It employs direct reward feedback loss integrated with traditional training objectives, optimizing both sequential decision-making and generative modeling tasks.
  • Empirical results in image/video synthesis and RL demonstrate that ReFL improves performance while providing theoretical guarantees and mitigating reward hacking.

Reward Feedback Learning (ReFL) refers to a broad, algorithmic paradigm in which reward or quality signals, typically provided by human feedback, learned proxy models, or high-level evaluators, are integrated directly into the training or fine-tuning of agents, generative models, or control policies. The fundamental goal of ReFL is to improve alignment between agent outputs—trajectories, images, or other model completions—and human preferences or task specifications, especially when hand-crafted or environmental rewards are sparse, misspecified, or unavailable. ReFL now encompasses a spectrum of approaches, affecting both sequential decision-making (reinforcement learning, imitation learning) and large-scale generative modeling (diffusion models, text-to-image/video synthesis).

1. Core Concepts and Definitions

In canonical RLHF (Reinforcement Learning from Human Feedback), reward signals are modeled from human preference data, such as pairwise trajectory or completion comparisons. ReFL generalizes this by incorporating feedback of various types—pairwise preferences, scalar evaluations, language-based descriptions, or proxy learned rewards—into model training via differentiable or optimization-based losses, often in non-traditional RL settings (Xu et al., 2023, Yang et al., 8 Oct 2024, Metz et al., 28 Feb 2025). The overarching procedure typically involves three stages:

  • Reward Modeling: Fit a scalar reward function (or "reward model") rϕr_\phi mapping agent outputs and contexts to R\mathbb{R}, using feedback datasets.
  • Feedback-Driven Optimization: Integrate rϕr_\phi into agent or generative model training as an explicit objective, loss component, or policy-shaping term.
  • Iterative or Alternating Training: Regularly retrain rϕr_\phi as new feedback or hard negatives are observed, optionally alternating with policy/model updates to prevent reward hacking (Wu et al., 23 May 2025).

A core distinction of ReFL is that the learned reward models are not merely used for evaluation or post-hoc filtering, but directly supply gradients or constraints within the main learning loop.

2. Methodological Families and Algorithms

ReFL approaches can be categorized by application domain and feedback integration mechanism:

A. Generative Models (Diffusion, Image/Video Synthesis)

  • Direct Reward-Fine-Tuning: For text-to-image diffusion (e.g., ImageReward), ReFL introduces a "reward feedback loss"—a differentiable penalty (e.g., λr(y,gθ(y))-\lambda r(y,g_\theta(y)))—which is backpropagated in conjunction with the standard denoising/reconstruction loss. Sampling a random late denoising step tt enables stable gradient estimates and prevents overfitting to the reward (Xu et al., 2023).
  • Latent-Space ReFL for Video: In large-scale video generation, Process Reward Feedback Learning (PRFL) avoids pixel-space reward bottlenecks by constructing reward models that operate directly in the latent space and at arbitrary denoising timesteps. The loss is λRϕ(xs,s,p)-\lambda R_\phi(\mathbf{x}_s,s,p), enabling both memory-efficient and temporally-rich supervision, with alternation against a supervised flow-matching objective for stability (Mi et al., 26 Nov 2025).
  • Fine-Grained Timestep Schemes: For image super-resolution and face restoration, ReFL components may be deployed piecewise: applying differentiable reward losses only during late denoising steps, and structure-preserving constraints (e.g., DWT LL-band) during early steps. Additional regularizers (e.g., Gram-KL distances, LPIPS, or KL from base parameters) prevent stylization artifacts or reward hacking (Sun et al., 4 Dec 2024, Wu et al., 23 May 2025).

B. Sequential Decision-Making (RLHF, RL from LLM Feedback)

  • Reward-Weighted Policy Optimization: Parametric reward models σψ(s)\sigma_\psi(s) (trained from pairwise feedback via cross-entropy) supply either direct rewards r(s)=σψ(s)r(s)=\sigma_\psi(s) or, for robustness, shape rewards as temporal differences, r(st)=σψ(st)σψ(st1)r'(s_t) = \sigma_\psi(s_t) - \sigma_\psi(s_{t-1}), to avoid reinforcing noisy LLM predictions (Lin et al., 22 Oct 2024).
  • Active Reward Querying and Sample-Efficient RL: High-theory frameworks such as ARL decouple environment exploration from reward querying, focusing human feedback on the most uncertain state-action pairs. This reduces total label complexity to O~(HdimR2)\widetilde{O}(H \dim_R^2), where dimR\dim_R is the function-class complexity, rather than the environment’s size or ϵ\epsilon-dependence (Kong et al., 2023).
  • Convex Offline Reward Learning: Linear-programming-based approaches invert the primal–dual optimality of RL to identify the entire set of rewards compatible with demonstrations and pairwise feedback, always yielding a convex polyhedron. Pairwise comparisons introduce linear constraints, and the final policy is guaranteed ϵ\epsilon-optimal as NN \to \infty (Kim et al., 20 May 2024).

Table 1: Representative Loss Formulations in ReFL

Approach Objective/Gradient Signal Reference
Diffusion ReFL λrϕ(output)-\lambda r_\phi(\text{output}) + denoising (Xu et al., 2023)
PRFL (latent) λRϕ(xs,s,p)-\lambda R_\phi(\mathbf{x}_s,s,p) + flow-matching (Mi et al., 26 Nov 2025)
RLHF (RL) Direct: r(s)=σψ(s)r(s)=\sigma_\psi(s) or shaped: σψ(st)σψ(st1)\sigma_\psi(s_t)-\sigma_\psi(s_{t-1}) (Lin et al., 22 Oct 2024)
LP-ReFL Optimize rr s.t. Bellman, demonstration, and feedback constraints (linear program) (Kim et al., 20 May 2024)
ARL Active reward queries + optimistic planning (Kong et al., 2023)

3. Feedback Modalities and Modeling Techniques

ReFL extends beyond binary preferences to a spectrum of feedback mechanisms (Metz et al., 28 Feb 2025):

  • Scalar Ratings ($1$ to $10$ evaluations)
  • Pairwise/Comparative (trajectory or cluster-level)
  • Demonstrations (full rollouts)
  • Corrections (demonstration as improvement over a given trajectory)
  • Descriptive Attributes or Cluster Rewards (via human or automated labeling)
  • Comparative Language Feedback (natural language descriptions of relative improvement) (Yang et al., 8 Oct 2024)

Reward modeling architectures match the feedback: mean-squared error for scalars, Bradley–Terry cross-entropy for preferences, contrastive or cross-modal embedding spaces for language feedback. Recent works utilize Masksemble/ensemble methods for calibrated uncertainty.

Environments and RL settings often leverage joint or ensemble reward architectures to exploit the complementarity of feedback types. Adaptive or uncertainty-weighted integration is a promising research direction.

4. Theoretical Guarantees and Robustness

Several studies provide non-asymptotic, high-probability guarantees for ReFL in active, offline, or sample-limited regimes:

  • LP-Based Guarantees: Offline LP ReFL recovers a reward within O(N1/2)O(N^{-1/2}) in \ell_\infty-norm of the true reward with high probability, yielding near-optimal policies (Kim et al., 20 May 2024).
  • Active Query Efficiency: ARL achieves ϵ\epsilon-optimality with O(dimR2)O(\dim_R^2) queries, far outpacing naive sampling approaches (Kong et al., 2023).
  • Partial Identifiability: When feedback is under-informative, the space of compatible rewards may be high-dimensional. Chebyshev-center- or minimax-based selection criteria, grounded in the downstream application’s loss geometry, improve robustness to feedback scarcity by minimizing worst-case planning error (Lazzati et al., 10 Jan 2025).
  • Noisy Feedback Tolerance: Confidence-weighted cross-entropy and potential-shaping schemes automatically degrade to zero reward when pairwise label noise causes preference confidence to approach $0.5$, preventing misleading reward shaping (Lin et al., 22 Oct 2024).

5. Applications and Empirical Outcomes

ReFL has been applied across a wide span of domains:

  • Text-to-Image Diffusion: Substantial improvements in human preference metrics, prompt adherence, and visual fidelity via direct reward-guided fine-tuning (Xu et al., 2023).
  • Text/Prompt-Conditioned Video Generation: Latent-space ReFL (PRFL) achieves both a 1.42×–1.49× speedup (memory and compute) and large (+46 to +56) increases in motion quality at 480–720P resolution, with human-preference win rates over pixel-space reward variants (Mi et al., 26 Nov 2025).
  • Image Super-Resolution and Blind Face Restoration: Reward-based fine-tuning yields significant gains in perceptual quality, aesthetic scores, and identity preservation, validated on standard benchmarks and ablative studies (Sun et al., 4 Dec 2024, Wu et al., 23 May 2025).
  • Language-Guided Reward Learning: Comparative language feedback accelerates reward model learning (cross-entropy decreases 30–50% faster), raising subjective user ratings by 23.9% and reducing per-query human time by 11.3% compared to preference-only approaches (Yang et al., 8 Oct 2024).
  • MuJoCo/High-Dimensional RL: Scalar, demonstration, and descriptive feedback types can match or outperform pairwise baselines depending on noise regime and environment, with empirical analyses revealing that the reward-function correlation to ground truth is neither necessary nor sufficient for RL success (Metz et al., 28 Feb 2025).

6. Limitations, Open Challenges, and Future Directions

While ReFL frameworks offer broad improvements and theoretical rigor, multiple open challenges remain:

  • Reward Hacking and Model Drift: Fixed reward models can be exploited by agents; dynamic re-training, additional structural regularization, or adversarial batch selection are active countermeasures (Wu et al., 23 May 2025, Sun et al., 4 Dec 2024).
  • Feedback Modality Selection: No universal superiority exists for any feedback type; adaptive, uncertainty-guided querying and integration are expected to be crucial for future systems (Metz et al., 28 Feb 2025).
  • Scalability and Richness of Reward Models: For video, high-dimensional, multi-aspect reward functions (covering semantics, dynamics, aesthetics) require large, diverse preference datasets and flexible, robust modeling (Mi et al., 26 Nov 2025).
  • Partial Identifiability: With limited or ambiguous feedback, the optimal robust output (policy, reward, ranking) may lie outside the feedback-constrained feasible set; formalizing this via worst-case error geometry is an active area (Lazzati et al., 10 Jan 2025).
  • Human-in-the-Loop Complexity: While active sampling and potential-based shaping reduce resource demands, large-scale real-world data collection, effective interfaces, and accurate noise models remain bottlenecks (Kong et al., 2023, Lin et al., 22 Oct 2024).
  • Conventional RLHF: Focuses predominantly on pairwise preference modeling and policy optimization, sometimes suffering from reward misspecification and sample inefficiency.
  • Maximum Likelihood IRL: Nonconvex, computationally heavy, and sensitive to model class; in contrast, LP-based ReFL methods enjoy tractable, convex optimization with sample-efficiency guarantees (Kim et al., 20 May 2024).
  • Classifier-Free Guidance/Architectural Conditioning: While sometimes labeled as “reward feedback,” architectural or inference-time modifications such as first-frame conditioning or classifier-free guidance are distinct from true ReFL mechanisms, which must inject explicit reward-driven gradients into training (Chen et al., 2023).

ReFL continues to evolve, bridging the gap between data-driven preference modeling, robust optimization, and the practical reality of aligning powerful models to nuanced human values and task requirements.


References

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Reward Feedback Learning (ReFL).