Single-Trajectory RLHF: Efficiency & Regularization

Updated 11 January 2026

Single-trajectory RLHF is an approach that processes human feedback in a single pass, reducing memory and computational costs compared to traditional methods.
It leverages one-pass online mirror descent for rapid reward model updates and applies a contextual preference bandit formulation for LLMs and generative models.
The method incorporates distributional regularization to prevent reward hacking, with empirical results showing competitive performance at lower training costs.

Single-trajectory Reinforcement Learning from Human Feedback (RLHF) refers to algorithmic frameworks in which the learning agent processes human preference feedback in a single, sequential data pass, or reoptimizes reward and/or policy parameters given just the latest interaction, thus dispensing with the need to repeatedly revisit historical data. This paradigm enables constant-time, memory-efficient streaming RLHF, and is distinct from conventional RLHF frameworks that require growing storage and computational costs tied to all past samples. Single-trajectory RLHF methodologies have been developed both for LLMs via contextual preference bandit formulations and for generative models such as diffusion and consistency models. Recent advancements offer theoretical guarantees, algorithmic innovations, and empirical results demonstrating competitiveness or superiority over traditional, multi-pass, or policy-gradient-based RLHF approaches (Li et al., 11 Feb 2025, Shekhar et al., 8 Mar 2025).

1. Formalization and Contextual Preference Bandit Structure

Single-trajectory RLHF for LLMs is formalized as a contextual preference bandit problem. Given a context (prompt) space $\mathcal{X}$ and response (action) space $\mathcal{A}$ , the learner per round selects a triple $(x_t, a_t, a'_t) \in \mathcal{X} \times \mathcal{A} \times \mathcal{A}$ and receives a binary preference $y_t \in \{0, 1\}$ , indicating which response was preferred by a human or a preference model. The feedback is modeled by the Bradley–Terry (BT) preference model, assuming a latent reward function $r(x, a) = \phi(x, a)^\top\theta^*$ , with $\phi$ a fixed feature map and $\theta^*$ bounded in $\ell_2$ -norm.

The central objective is identification of a policy $\pi: \mathcal{X} \to \mathcal{A}$ maximizing the expected reward $J(\pi) = \mathbb{E}_{x \sim \rho}[r(x, \pi(x))]$ , where $\rho$ is the prompt distribution. For evaluation, the suboptimality gap (passive, active learning) or cumulative regret (online adaptation) is tracked as the primary metric (Li et al., 11 Feb 2025).

In the generative modeling context, as in ROCM, the RLHF objective over a single trajectory $\tau$ (typically the sequence of latents or generation steps) for a consistency model is: $\mathcal{L}_{\text{RLHF}}(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)] - \beta\, \mathcal{D}(\pi_\theta \| \pi_{\theta_{\rm ref}})$ where $R(\tau)$ is the scalar reward from a human-preference model and $\mathcal{D}$ denotes an $f$ -divergence regularizer measured against a reference (base) policy (Shekhar et al., 8 Mar 2025).

2. One-Pass Online Mirror Descent and Reward Model Optimization

The defining algorithmic feature is the use of one-pass online mirror descent (OMD) for reward modeling and policy/value parameter updates. At each round $t$ , the loss is the single-sample logistic loss: $\ell_t(\theta) = -y_t \log \sigma(z_t^\top \theta) - (1-y_t)\log(1 - \sigma(z_t^\top \theta))$ with $z_t = \phi(x_t, a_t) - \phi(x_t, a_t')$ . The OMD update is

$\tilde{\theta}_{t+1} = \arg\min_{\theta \in \Theta} \big\{ \langle g_t(\tilde{\theta}_t), \theta\rangle + \frac{1}{2\eta}\|\theta - \tilde{\theta}_t\|_{\hat{H}_t}^2 \big\}$

where $g_t$ is the gradient, and $\hat{H}_t$ is an accumulated (local) Hessian. Crucially, the update step admits a closed-form procedure: $\tilde{\theta}'_{t+1} = \tilde{\theta}_t - \eta \hat{H}_t^{-1} g_t(\tilde{\theta}_t)$ requiring only current sufficient statistics $(\tilde{\theta}_t, \hat{H}_t)$ and not historical samples.

Standard MLE fitting would consume $O(t)$ compute and memory per round; one-pass OMD restricts both to $O(1)$ , amortized per iteration, using only the most recent gradient and Hessian information (Li et al., 11 Feb 2025).

In ROCM, the direct reward maximization framework enables single-trajectory optimization by propagating gradients (by reparameterization) through the entire sequence of generation steps. All terms are first-order and admit low-variance estimation, distinguishing the method from REINFORCE/PPO and other zero-order policy gradient schemes (Shekhar et al., 8 Mar 2025).

3. Distributional Regularization and Prevention of Reward Hacking

Distributional regularization is integral to single-trajectory RLHF in generative models. ROCM incorporates $f$ -divergences, such as forward/reverse KL, Jensen–Shannon, Hellinger squared, and Fisher, between the current policy and a fixed reference policy at each step of the generation trajectory. The regularizer is weighted by a hyperparameter $\beta$ ; cross-validation determines suitable $\beta$ to ensure the regularization term is roughly an order of magnitude smaller than the reward signal.

The choice of divergence impacts both stabilization of training and generalization performance. Closed-form formulas are available for several divergences in the Gaussian case, but Monte Carlo estimation is sometimes necessary (e.g., JS for Gaussians). Regularized single-trajectory optimization mitigates reward hacking by discouraging the policy from departing excessively from the baseline, thereby avoiding collapse on specific metrics despite rising proxy reward scores (Shekhar et al., 8 Mar 2025).

4. Theoretical Guarantees, Complexities, and Proof Insights

The one-pass OMD-based single-trajectory RLHF admits formal guarantees on statistical and computational efficiency. Confidence sets are established such that, with high probability,

$\|\tilde{\theta}_t - \theta^*\|_{H_t} \leq \beta_t$

where $H_t$ accumulates Hessian information and $\beta_t$ scales as $\tilde{O}(\sqrt{d})$ . For passive data collection, the suboptimality gap is

$\operatorname{SubOpt}(\pi_{T+1}) \leq \tilde{O}(\sqrt{d} \| \mathbb{E}_x[\phi(x, \pi^*(x))]\|_{H_{T+1}^{-1}})$

with improved dependence on the reward model condition number relative to prior works. Each iteration, including matrix updates and projections, can be implemented using Hessian–vector products and conjugate-gradient methods in $O(d)$ or $O(d^2)$ time.

The proof technique fundamentally relies on the reduction to OMD with a tailored local norm, exploiting self-concordance of the logistic loss and bounding estimation error via concentration inequalities, Bregman divergence telescoping, and scenario-specific rounding strategies for the different deployment settings (passive, active, deployment-time adaptation) (Li et al., 11 Feb 2025).

5. Deployment Paradigms, Extensions, and Limitations

Single-trajectory RLHF supports several data-generation and deployment scenarios:

Passive data collection: Data arise from a fixed historical log; single-pass OMD enables efficient assimilation of such logs without revisiting prior interactions.
Active data collection: The algorithm selectively queries (context, action, action-prime) triples to maximize information gain, i.e., via uncertainty scoring.
Deployment-time adaptation: Contexts arrive online, and the learner dynamically balances exploitation and exploration per prompt for continual improvement.

The only retained state is the current parameter vector and Hessian accumulator; past raw data are discarded after a single use. This architecture supports streaming or edge deployment settings.

Limitations include reliance on a fixed and known feature map $\phi(x, a)$ ; theoretical extensions to dynamic $\phi(\cdot)$ , as encountered in fine-tuning or online representation learning, remain open. The logistic BT preference model is currently assumed; extensions to alternative preference models (e.g., Plackett–Luce) are possible but involve nontrivial modifications.

Practical realizations utilize fast linear-algebra operations for efficiency, and, in active/data-efficient variants, rejection sampling to approximate uncertainty metrics (Li et al., 11 Feb 2025).

6. Empirical Results and Comparative Analysis

Evaluation of single-trajectory RLHF algorithms demonstrates statistical and computational efficiency. Experiments on Llama-3-8B-Instruct and Qwen2.5-7B-Instruct with Ultrafeedback-binarized and Mixture2 datasets validate the method’s statistical guarantees and effectiveness. For consistency models, ROCM demonstrates competitive or superior performance on multiple reward models and metrics (PickScore, HPSv2, CLIPScore, BLIPScore, Aesthetic Score, ImageReward) compared to baselines such as PPO and DDPO. Human evaluation attests to stronger overall preference and visual appeal.

Training efficiency advantages are realized, with ROCM achieving target quality within approximately 15 GPU-hours—less than policy-gradient methods. Ablation indicates that regularization is critical: unregularized single-trajectory RLHF achieves high proxy reward but suffers metric collapse, underscoring the necessity of appropriate divergence control to counteract reward hacking (Li et al., 11 Feb 2025, Shekhar et al., 8 Mar 2025).

PDF Markdown Chat (Pro)

References (2)

Provably Efficient Online RLHF with One-Pass Reward Modeling (2025)

ROCM: RLHF on consistency models (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Single-Trajectory RLHF.