Direct Reward Optimization (DRO)
- DRO is a framework in reinforcement learning where policies are optimized using cumulative trajectory rewards instead of per-step signals.
- It employs regularized least squares to estimate rewards from occupancy measures, addressing the reduced statistical resolution of episodic feedback.
- The approach supports both known and unknown transition settings, providing regret guarantees and linking trajectory feedback to practical RLHF applications.
Direct Reward Optimization (DRO) in reinforcement learning refers to policy optimization in settings where the agent does not have access to per-step rewards, but only observes a scalar signal representing the cumulative reward (trajectory feedback) over an entire episode. This setup, motivated by practical constraints on reward annotation, fundamentally impacts algorithm design and theoretical guarantees, distinguishing DRO from standard RL protocols that rely on granular, stepwise reward feedback (Efroni et al., 2020).
1. Formalization: Trajectory Feedback and Problem Definition
Direct Reward Optimization within the trajectory feedback paradigm is instantiated on an episodic Markov Decision Process (MDP) , where:
- denote finite state and action sets, with cardinalities and .
- is the transition probability, which may be unknown.
- is the unknown expected per-step reward function.
- is the known episodic horizon.
In each episode , the agent selects a policy , samples a trajectory under , and observes only the total reward , in contrast to the standard RL feedback model that reveals at each timestep. The regret over episodes is
where .
This framework compels algorithms to reconstruct per-step rewards and optimize policies based on coarse, trajectory-level feedback (Efroni et al., 2020).
2. Reward Estimation: Regularized Least Squares
Given direct reward signals only at the episode level, reward estimation is addressed by exploiting the linear relation between trajectory returns and occupancy measures. After episodes, for each episode , the empirical visitation frequencies and observed trajectory return are collected. These form the data matrix (with rows ) and return vector .
The regularized least-squares estimator for the unknown reward vector is
where is the regularization parameter. This estimator benefits from concentration results ensuring, with high probability, the estimation error is bounded by a sequence (Efroni et al., 2020).
3. Policy Optimization Algorithms: Known and Unknown Transitions
Known Transitions (OFUL-type Algorithm):
If is known, policy selection can be framed as an optimism-in-the-face-of-uncertainty linear bandit over the occupancy measure set : After executing and observing , data matrices are updated. The regret over episodes has the bound
Unknown Transitions (UCBVI-TS Hybrid Algorithm):
When is unknown, the policy is chosen by constructing plug-in transition estimates , adding Gaussian (Thompson-Sampling style) perturbation to the reward estimator, as well as an optimistic transition bonus . The perturbed reward is
The policy is selected by solving the MDP with reward and empirical transitions via dynamic programming. Regret in this setting scales as (Efroni et al., 2020).
4. Theoretical Guarantees and Comparative Regret Bounds
Direct Reward Optimization under trajectory feedback leads to provable increases in regret compared to standard per-step RL, reflecting the loss of statistical resolution in reward signals:
- Per-step RL (minimax):
- Trajectory feedback with known : (matched by OFUL-type algorithms up to logs)
- Trajectory feedback with unknown :
A key driver of this degradation is the reduction to a single scalar feedback per episode, which inflates the effective noise scale by in linear bandit analysis. When transitions are unknown, an extra term arises from transition estimation.
A plausible implication is that DRO via trajectory feedback should be preferred in circumstances where per-step reward annotation is not feasible, but practitioners must accept increased regret scaling—particularly in high-dimensional () regimes (Efroni et al., 2020).
5. Practical Implementation and Guidelines
Instantiating DRO with trajectory feedback requires:
- Initializing statistics: , counts , , .
- For each episode :
- Estimate empirically from the visited trajectory.
- Update , .
- Compute .
- If is unknown, build from empirical counts.
- For policy selection: compute bonus , draw perturbation , define optimistic/perturbed , and solve the empirical MDP by dynamic programming to obtain .
Key assumptions: Known horizon , sub-Gaussian or bounded reward noise, stationary transitions, positive regularization parameter , and access to an exact dynamic programming solver.
Rarely-switching variants, which reduce the frequency of covariance updates, can lower per-episode computational cost with minor regret penalties (Efroni et al., 2020).
6. Contextual Significance and Relationship to RLHF
Direct Reward Optimization using trajectory feedback is especially relevant for settings where granular expert reward annotation is unavailable, such as single-trajectory RL with human feedback. The methodology links the sequential RLHF paradigm to a linear bandit perspective, where the action space's dimension is and learning is based on aggregate signals per trajectory.
This suggests that, despite statistical inefficiency, DRO with trajectory feedback is essential for domains where only episodic, scalar annotations are available. The established regret bounds and algorithmic frameworks provide a foundation for RLHF procedures constrained by limited feedback fidelity (Efroni et al., 2020).