Pref-GUIDE: Preference-Based RL Framework
- Pref-GUIDE is a reinforcement learning framework that converts real-time human scalar feedback into pairwise preference data for robust policy optimization.
- It employs a moving window approach to generate O(n²) trajectory comparisons and aggregates multiple reward models via consensus voting to reduce noise.
- Experimental evaluations show that Pref-GUIDE Voting outperforms baseline methods by achieving expert-level performance in 80–90% of evaluator runs.
Pref-GUIDE is a framework for reinforcement learning (RL) agents that leverages real-time human feedback through preference-based methods to enable continual policy improvement, particularly when explicit dense reward functions are unavailable or difficult to specify. By systematically converting scalar feedback signals into preference data, Pref-GUIDE constructs robust reward models that maintain agent learning efficacy even after the period of direct human oversight, thereby addressing key limitations of prior real-time feedback-based RL approaches (Ji et al., 10 Aug 2025).
1. Motivation and Problem Setting
In many RL domains, direct specification of task objectives via dense, environment-supplied rewards () is impractical or even impossible. Human-in-the-loop RL methods address this challenge by soliciting evaluative feedback from human observers during agent-environment interaction. In real-time setups, this feedback typically takes the form of a scalar signal provided by a human concurrently with agent action selection for each short trajectory segment . The dataset accumulates these temporally localized evaluations.
This setup introduces core challenges:
- Temporal inconsistency (feedback non-stationarity): Human evaluators shift their implicit criteria over time.
- Noise and unreliability: Human attention, fatigue, and subjective bias contribute stochasticity, resulting in low signal-to-noise ratio in the scalar feedback.
The objective is to use , gathered in a finite human-in-the-loop phase, to train a reward model suitable for continued policy optimization (“post-human” RL), where human feedback is unavailable.
2. Framework Architecture and Data Transformation
Scalar Feedback to Preference Data
Pref-GUIDE introduces a multi-stage data transformation pipeline:
- Collection: During agent-environment rollout, real-time scalar feedback is collected, yielding .
- Pref-GUIDE Individual: Each evaluator’s scalar feedback is locally converted into pairwise trajectory preferences. A moving window of length (e.g., ) is employed—generating trajectory pairs per window. Preference labels are assigned as follows:
- : ambiguous, label (no-preference margin of feedback range)
- : strict preference for ,
- : strict preference for ,
- Pref-GUIDE Voting: Reward models trained from individual evaluators’ preferences are then aggregated: For each pair, each model “votes” for its preferred segment, and votes are averaged to yield a soft consensus label , reflecting both the majority direction and degree of agreement.
- Reward Model Training: The final reward model is trained on the consensus preference dataset using a pairwise preference loss.
Pipeline Overview Table
| Stage | Input | Output |
|---|---|---|
| Feedback Collection | Trajectory, | |
| Pref-GUIDE Individual | (per user) | , |
| Pref-GUIDE Voting | , | |
| Post-human RL | Policy optimization |
3. Preference-Based Reward Modeling
The reward model accepts a temporally-localized trajectory segment: a stack of three visual observations plus the corresponding actions. This input is processed through a convolutional encoder followed by a multilayer perceptron (MLP), outputting a scalar reward prediction .
Training utilizes the Bradley–Terry pairwise preference loss. For any labeled trajectory pair , define with the logistic sigmoid . The loss per pair is:
aggregated over all labeled pairs. Optional weight decay regularization is added.
Ambiguity filtering via the -margin is critical; it reduces overfitting to noisy, marginal feedback through conservative labeling, and ablation studies show that omitting this margin reduces post-human phase mean return by 20% (Ji et al., 10 Aug 2025).
4. Policy Optimization and Learning Dynamics
Agent policy learning employs the Deep Deterministic Policy Gradient (DDPG) algorithm. The reward signal during the human-in-the-loop phase combines environmental reward and scalar human feedback (). In the post-human phase, DDPG receives rewards exclusively from the learned consensus reward model :
- (typically, and only is used when is sparse)
Standard DDPG losses for actor and critic are used, with the learned reward providing the optimization signal when human or environment feedback is unavailable.
5. Experimental Evaluation
Pref-GUIDE is evaluated on three RL environments: Bowling, Find Treasure, and Hide & Seek 1v1. Each features a period of human feedback (5–10 min) followed by a longer autonomous learning phase (15–50 min).
Performance is primarily measured by:
- Mean episodic return, averaged over periodic evaluation rollouts.
- Fraction of evaluators for whom the agent reaches specified performance milestones.
Post-human Phase Mean Return (Table 1):
| Method | Bowling (↑) | Find Treasure (↑) | Hide & Seek 1v1 (↑) |
|---|---|---|---|
| DDPG (sparse env.) | 85 ± 12 | 60 ± 8 | 70 ± 9 |
| DDPG Heuristic (dense) | — | 180 ± 15 | 190 ± 12 |
| GUIDE (scalar regression) | 150 ± 18 | 140 ± 20 | 160 ± 17 |
| Pref-GUIDE Individual | 180 ± 15 | 170 ± 18 | 185 ± 16 |
| Pref-GUIDE Voting | 200 ± 10 | 210 ± 12 | 220 ± 11 |
At 50 minutes into the post-human phase, Pref-GUIDE Voting achieves expert-level performance in 80–90% of evaluator runs, compared to ~60% for Pref-GUIDE Individual and ~40% for scalar regression (GUIDE).
Ablation studies indicate that removing the moving window or no-preference margin substantially degrades performance (–25% and –20% return respectively). Consensus voting outperforms naive binary majority or pooled-data reward models by approximately 15% (Ji et al., 10 Aug 2025).
6. Comparative Analysis and Methodological Advantages
Pref-GUIDE's principal methodological advantages over scalar regression (GUIDE) include:
- Temporal Consistency: The moving window approach ensures comparisons are confined to periods of approximately stable human criteria.
- Data Efficiency: Each scalar feedback point yields pairwise samples per window, greatly increasing training signal density.
- Noise Resilience: The -margin label reduces sensitivity to stochastic, low-confidence evaluator responses.
- Population Robustness: Pref-GUIDE Voting aggregates across user populations, mitigating idiosyncratic biases and attenuating evaluator-specific noise.
These elements collectively produce robust reward models that demonstrably generalize beyond both scalar regression baselines and expert-crafted dense reward signals, as evidenced in empirical benchmarks.
7. Limitations and Future Directions
Pref-GUIDE currently employs fixed settings for window length and no-preference margin; research into adaptive or learned parameterization could enhance robustness across diverse feedback regimes. The visual encoder used in is fixed during reward model training; joint fine-tuning may enable richer representation learning. Extending Pref-GUIDE to integrate richer human feedback modalities (e.g., verbal annotations) or scale to multi-agent/robotics domains is an open avenue for subsequent exploration (Ji et al., 10 Aug 2025).