Pref-GUIDE: Preference-Based RL Framework

Updated 23 November 2025

Pref-GUIDE is a reinforcement learning framework that converts real-time human scalar feedback into pairwise preference data for robust policy optimization.
It employs a moving window approach to generate O(n²) trajectory comparisons and aggregates multiple reward models via consensus voting to reduce noise.
Experimental evaluations show that Pref-GUIDE Voting outperforms baseline methods by achieving expert-level performance in 80–90% of evaluator runs.

Pref-GUIDE is a framework for reinforcement learning (RL) agents that leverages real-time human feedback through preference-based methods to enable continual policy improvement, particularly when explicit dense reward functions are unavailable or difficult to specify. By systematically converting scalar feedback signals into preference data, Pref-GUIDE constructs robust reward models that maintain agent learning efficacy even after the period of direct human oversight, thereby addressing key limitations of prior real-time feedback-based RL approaches (Ji et al., 10 Aug 2025).

1. Motivation and Problem Setting

In many RL domains, direct specification of task objectives via dense, environment-supplied rewards ( $r_{\text{env}}$ ) is impractical or even impossible. Human-in-the-loop RL methods address this challenge by soliciting evaluative feedback from human observers during agent-environment interaction. In real-time setups, this feedback typically takes the form of a scalar signal $f\in[-1,1]$ provided by a human concurrently with agent action selection for each short trajectory segment $\tau_k$ . The dataset $\mathcal{D}_{\text{real}}=\{(\tau_i,f_i)\}_{i=1}^N$ accumulates these temporally localized evaluations.

This setup introduces core challenges:

Temporal inconsistency (feedback non-stationarity): Human evaluators shift their implicit criteria over time.
Noise and unreliability: Human attention, fatigue, and subjective bias contribute stochasticity, resulting in low signal-to-noise ratio in the scalar feedback.

The objective is to use $\mathcal{D}_{\text{real}}$ , gathered in a finite human-in-the-loop phase, to train a reward model $r_\theta(\tau)$ suitable for continued policy optimization (“post-human” RL), where human feedback is unavailable.

2. Framework Architecture and Data Transformation

Scalar Feedback to Preference Data

Pref-GUIDE introduces a multi-stage data transformation pipeline:

Collection: During agent-environment rollout, real-time scalar feedback is collected, yielding $\mathcal{D}_{\text{real}}$ .
Pref-GUIDE Individual: Each evaluator’s scalar feedback is locally converted into pairwise trajectory preferences. A moving window of length $n$ $n$ (e.g., $n=10$ $n = 10$ ) is employed—generating $O(n^2)$ $O (n^{2})$ trajectory pairs per window. Preference labels are assigned as follows:
- $|f^A-f^B| < \delta$ : ambiguous, label $y=0.5$ (no-preference margin $\delta=5\%$ of feedback range)
- $f^A>f^B$ : strict preference for $A$ , $y=1$
- $f^A<f^B$ : strict preference for $B$ , $y=0$
Pref-GUIDE Voting: Reward models $r_\theta^{(j)}$ trained from individual evaluators’ preferences are then aggregated: For each pair, each model “votes” for its preferred segment, and votes are averaged to yield a soft consensus label $y_{\text{vote}}\in[0,1]$ , reflecting both the majority direction and degree of agreement.
Reward Model Training: The final reward model $R_\theta$ is trained on the consensus preference dataset using a pairwise preference loss.

Pipeline Overview Table

Stage	Input	Output
Feedback Collection	Trajectory, $f$	$\mathcal{D}_{\text{real}}$
Pref-GUIDE Individual	$\mathcal{D}_{\text{real}}$ (per user)	$\mathcal{D}_{\text{pref}}^{(j)}$ , $r_\theta^{(j)}$
Pref-GUIDE Voting	$\{ r_\theta^{(j)} \}$	$\mathcal{D}_{\text{pref\_vote}}$ , $R_\theta$
Post-human RL	$R_\theta$	Policy optimization

3. Preference-Based Reward Modeling

The reward model accepts a temporally-localized trajectory segment: a stack of three visual observations plus the corresponding actions. This input is processed through a convolutional encoder followed by a multilayer perceptron (MLP), outputting a scalar reward prediction $r_\theta(\tau)$ .

Training utilizes the Bradley–Terry pairwise preference loss. For any labeled trajectory pair $(\tau_i^A, \tau_i^B, y_i)$ , define $\Delta_i = r_\theta(\tau_i^A) - r_\theta(\tau_i^B)$ with the logistic sigmoid $\sigma(x)$ . The loss per pair is:

$L_{\text{pref}}(\theta; (\tau_i^A, \tau_i^B, y_i)) = -y_i\cdot\log\sigma(\Delta_i) - (1-y_i)\cdot\log(1-\sigma(\Delta_i)),$

aggregated over all $M$ labeled pairs. Optional $L_2$ weight decay regularization is added.

Ambiguity filtering via the $\delta$ -margin is critical; it reduces overfitting to noisy, marginal feedback through conservative labeling, and ablation studies show that omitting this margin reduces post-human phase mean return by 20% (Ji et al., 10 Aug 2025).

4. Policy Optimization and Learning Dynamics

Agent policy learning employs the Deep Deterministic Policy Gradient (DDPG) algorithm. The reward signal during the human-in-the-loop phase combines environmental reward and scalar human feedback ( $r_{\text{env}}+\alpha f$ ). In the post-human phase, DDPG receives rewards exclusively from the learned consensus reward model $R_\theta$ :

$r_{\text{human}}(\tau) = R_\theta(\tau)$
$r_{\text{total}} = r_{\text{env}} + \beta r_{\text{human}}$ (typically, $\beta=1$ and only $r_{\text{human}}$ is used when $r_{\text{env}}$ is sparse)

Standard DDPG losses for actor $\pi_\phi$ and critic $Q_w$ are used, with the learned reward providing the optimization signal when human or environment feedback is unavailable.

5. Experimental Evaluation

Pref-GUIDE is evaluated on three RL environments: Bowling, Find Treasure, and Hide & Seek 1v1. Each features a period of human feedback (5–10 min) followed by a longer autonomous learning phase (15–50 min).

Performance is primarily measured by:

Mean episodic return, averaged over periodic evaluation rollouts.
Fraction of evaluators for whom the agent reaches specified performance milestones.

Post-human Phase Mean Return (Table 1):

Method	Bowling (↑)	Find Treasure (↑)	Hide & Seek 1v1 (↑)
DDPG (sparse env.)	85 ± 12	60 ± 8	70 ± 9
DDPG Heuristic (dense)	—	180 ± 15	190 ± 12
GUIDE (scalar regression)	150 ± 18	140 ± 20	160 ± 17
Pref-GUIDE Individual	180 ± 15	170 ± 18	185 ± 16
Pref-GUIDE Voting	200 ± 10	210 ± 12	220 ± 11

At 50 minutes into the post-human phase, Pref-GUIDE Voting achieves expert-level performance in 80–90% of evaluator runs, compared to ~60% for Pref-GUIDE Individual and ~40% for scalar regression (GUIDE).

Ablation studies indicate that removing the moving window or no-preference margin substantially degrades performance (–25% and –20% return respectively). Consensus voting outperforms naive binary majority or pooled-data reward models by approximately 15% (Ji et al., 10 Aug 2025).

6. Comparative Analysis and Methodological Advantages

Pref-GUIDE's principal methodological advantages over scalar regression (GUIDE) include:

Temporal Consistency: The moving window approach ensures comparisons are confined to periods of approximately stable human criteria.
Data Efficiency: Each scalar feedback point yields $O(n^2)$ pairwise samples per window, greatly increasing training signal density.
Noise Resilience: The $\delta$ -margin label reduces sensitivity to stochastic, low-confidence evaluator responses.
Population Robustness: Pref-GUIDE Voting aggregates across user populations, mitigating idiosyncratic biases and attenuating evaluator-specific noise.

These elements collectively produce robust reward models that demonstrably generalize beyond both scalar regression baselines and expert-crafted dense reward signals, as evidenced in empirical benchmarks.

7. Limitations and Future Directions

Pref-GUIDE currently employs fixed settings for window length and no-preference margin; research into adaptive or learned parameterization could enhance robustness across diverse feedback regimes. The visual encoder used in $r_\theta$ is fixed during reward model training; joint fine-tuning may enable richer representation learning. Extending Pref-GUIDE to integrate richer human feedback modalities (e.g., verbal annotations) or scale to multi-agent/robotics domains is an open avenue for subsequent exploration (Ji et al., 10 Aug 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Pref-GUIDE: Continual Policy Learning from Real-Time Human Feedback via Preference-Based Learning (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Pref-GUIDE.

Pref-GUIDE: Preference-Based RL Framework

1. Motivation and Problem Setting

2. Framework Architecture and Data Transformation

Scalar Feedback to Preference Data

Pipeline Overview Table

3. Preference-Based Reward Modeling

4. Policy Optimization and Learning Dynamics

5. Experimental Evaluation

6. Comparative Analysis and Methodological Advantages

7. Limitations and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Pref-GUIDE: Preference-Based RL Framework

1. Motivation and Problem Setting

2. Framework Architecture and Data Transformation

Scalar Feedback to Preference Data

Pipeline Overview Table

3. Preference-Based Reward Modeling

4. Policy Optimization and Learning Dynamics

5. Experimental Evaluation

6. Comparative Analysis and Methodological Advantages

7. Limitations and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research