Papers
Topics
Authors
Recent
2000 character limit reached

Pref-GUIDE: Preference-Based RL Framework

Updated 23 November 2025
  • Pref-GUIDE is a reinforcement learning framework that converts real-time human scalar feedback into pairwise preference data for robust policy optimization.
  • It employs a moving window approach to generate O(n²) trajectory comparisons and aggregates multiple reward models via consensus voting to reduce noise.
  • Experimental evaluations show that Pref-GUIDE Voting outperforms baseline methods by achieving expert-level performance in 80–90% of evaluator runs.

Pref-GUIDE is a framework for reinforcement learning (RL) agents that leverages real-time human feedback through preference-based methods to enable continual policy improvement, particularly when explicit dense reward functions are unavailable or difficult to specify. By systematically converting scalar feedback signals into preference data, Pref-GUIDE constructs robust reward models that maintain agent learning efficacy even after the period of direct human oversight, thereby addressing key limitations of prior real-time feedback-based RL approaches (Ji et al., 10 Aug 2025).

1. Motivation and Problem Setting

In many RL domains, direct specification of task objectives via dense, environment-supplied rewards (renvr_{\text{env}}) is impractical or even impossible. Human-in-the-loop RL methods address this challenge by soliciting evaluative feedback from human observers during agent-environment interaction. In real-time setups, this feedback typically takes the form of a scalar signal f[1,1]f\in[-1,1] provided by a human concurrently with agent action selection for each short trajectory segment τk\tau_k. The dataset Dreal={(τi,fi)}i=1N\mathcal{D}_{\text{real}}=\{(\tau_i,f_i)\}_{i=1}^N accumulates these temporally localized evaluations.

This setup introduces core challenges:

  • Temporal inconsistency (feedback non-stationarity): Human evaluators shift their implicit criteria over time.
  • Noise and unreliability: Human attention, fatigue, and subjective bias contribute stochasticity, resulting in low signal-to-noise ratio in the scalar feedback.

The objective is to use Dreal\mathcal{D}_{\text{real}}, gathered in a finite human-in-the-loop phase, to train a reward model rθ(τ)r_\theta(\tau) suitable for continued policy optimization (“post-human” RL), where human feedback is unavailable.

2. Framework Architecture and Data Transformation

Scalar Feedback to Preference Data

Pref-GUIDE introduces a multi-stage data transformation pipeline:

  1. Collection: During agent-environment rollout, real-time scalar feedback is collected, yielding Dreal\mathcal{D}_{\text{real}}.
  2. Pref-GUIDE Individual: Each evaluator’s scalar feedback is locally converted into pairwise trajectory preferences. A moving window of length nn (e.g., n=10n=10) is employed—generating O(n2)O(n^2) trajectory pairs per window. Preference labels are assigned as follows:
    • fAfB<δ|f^A-f^B| < \delta: ambiguous, label y=0.5y=0.5 (no-preference margin δ=5%\delta=5\% of feedback range)
    • fA>fBf^A>f^B: strict preference for AA, y=1y=1
    • fA<fBf^A<f^B: strict preference for BB, y=0y=0
  3. Pref-GUIDE Voting: Reward models rθ(j)r_\theta^{(j)} trained from individual evaluators’ preferences are then aggregated: For each pair, each model “votes” for its preferred segment, and votes are averaged to yield a soft consensus label yvote[0,1]y_{\text{vote}}\in[0,1], reflecting both the majority direction and degree of agreement.
  4. Reward Model Training: The final reward model RθR_\theta is trained on the consensus preference dataset using a pairwise preference loss.

Pipeline Overview Table

Stage Input Output
Feedback Collection Trajectory, ff Dreal\mathcal{D}_{\text{real}}
Pref-GUIDE Individual Dreal\mathcal{D}_{\text{real}} (per user) Dpref(j)\mathcal{D}_{\text{pref}}^{(j)}, rθ(j)r_\theta^{(j)}
Pref-GUIDE Voting {rθ(j)}\{ r_\theta^{(j)} \} Dpref_vote\mathcal{D}_{\text{pref\_vote}}, RθR_\theta
Post-human RL RθR_\theta Policy optimization

3. Preference-Based Reward Modeling

The reward model accepts a temporally-localized trajectory segment: a stack of three visual observations plus the corresponding actions. This input is processed through a convolutional encoder followed by a multilayer perceptron (MLP), outputting a scalar reward prediction rθ(τ)r_\theta(\tau).

Training utilizes the Bradley–Terry pairwise preference loss. For any labeled trajectory pair (τiA,τiB,yi)(\tau_i^A, \tau_i^B, y_i), define Δi=rθ(τiA)rθ(τiB)\Delta_i = r_\theta(\tau_i^A) - r_\theta(\tau_i^B) with the logistic sigmoid σ(x)\sigma(x). The loss per pair is:

Lpref(θ;(τiA,τiB,yi))=yilogσ(Δi)(1yi)log(1σ(Δi)),L_{\text{pref}}(\theta; (\tau_i^A, \tau_i^B, y_i)) = -y_i\cdot\log\sigma(\Delta_i) - (1-y_i)\cdot\log(1-\sigma(\Delta_i)),

aggregated over all MM labeled pairs. Optional L2L_2 weight decay regularization is added.

Ambiguity filtering via the δ\delta-margin is critical; it reduces overfitting to noisy, marginal feedback through conservative labeling, and ablation studies show that omitting this margin reduces post-human phase mean return by 20% (Ji et al., 10 Aug 2025).

4. Policy Optimization and Learning Dynamics

Agent policy learning employs the Deep Deterministic Policy Gradient (DDPG) algorithm. The reward signal during the human-in-the-loop phase combines environmental reward and scalar human feedback (renv+αfr_{\text{env}}+\alpha f). In the post-human phase, DDPG receives rewards exclusively from the learned consensus reward model RθR_\theta:

  • rhuman(τ)=Rθ(τ)r_{\text{human}}(\tau) = R_\theta(\tau)
  • rtotal=renv+βrhumanr_{\text{total}} = r_{\text{env}} + \beta r_{\text{human}} (typically, β=1\beta=1 and only rhumanr_{\text{human}} is used when renvr_{\text{env}} is sparse)

Standard DDPG losses for actor πϕ\pi_\phi and critic QwQ_w are used, with the learned reward providing the optimization signal when human or environment feedback is unavailable.

5. Experimental Evaluation

Pref-GUIDE is evaluated on three RL environments: Bowling, Find Treasure, and Hide & Seek 1v1. Each features a period of human feedback (5–10 min) followed by a longer autonomous learning phase (15–50 min).

Performance is primarily measured by:

  • Mean episodic return, averaged over periodic evaluation rollouts.
  • Fraction of evaluators for whom the agent reaches specified performance milestones.

Post-human Phase Mean Return (Table 1):

Method Bowling (↑) Find Treasure (↑) Hide & Seek 1v1 (↑)
DDPG (sparse env.) 85 ± 12 60 ± 8 70 ± 9
DDPG Heuristic (dense) 180 ± 15 190 ± 12
GUIDE (scalar regression) 150 ± 18 140 ± 20 160 ± 17
Pref-GUIDE Individual 180 ± 15 170 ± 18 185 ± 16
Pref-GUIDE Voting 200 ± 10 210 ± 12 220 ± 11

At 50 minutes into the post-human phase, Pref-GUIDE Voting achieves expert-level performance in 80–90% of evaluator runs, compared to ~60% for Pref-GUIDE Individual and ~40% for scalar regression (GUIDE).

Ablation studies indicate that removing the moving window or no-preference margin substantially degrades performance (–25% and –20% return respectively). Consensus voting outperforms naive binary majority or pooled-data reward models by approximately 15% (Ji et al., 10 Aug 2025).

6. Comparative Analysis and Methodological Advantages

Pref-GUIDE's principal methodological advantages over scalar regression (GUIDE) include:

  • Temporal Consistency: The moving window approach ensures comparisons are confined to periods of approximately stable human criteria.
  • Data Efficiency: Each scalar feedback point yields O(n2)O(n^2) pairwise samples per window, greatly increasing training signal density.
  • Noise Resilience: The δ\delta-margin label reduces sensitivity to stochastic, low-confidence evaluator responses.
  • Population Robustness: Pref-GUIDE Voting aggregates across user populations, mitigating idiosyncratic biases and attenuating evaluator-specific noise.

These elements collectively produce robust reward models that demonstrably generalize beyond both scalar regression baselines and expert-crafted dense reward signals, as evidenced in empirical benchmarks.

7. Limitations and Future Directions

Pref-GUIDE currently employs fixed settings for window length and no-preference margin; research into adaptive or learned parameterization could enhance robustness across diverse feedback regimes. The visual encoder used in rθr_\theta is fixed during reward model training; joint fine-tuning may enable richer representation learning. Extending Pref-GUIDE to integrate richer human feedback modalities (e.g., verbal annotations) or scale to multi-agent/robotics domains is an open avenue for subsequent exploration (Ji et al., 10 Aug 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Pref-GUIDE.