Direct Reward Optimization (DRO)

Updated 11 January 2026

DRO is a framework in reinforcement learning where policies are optimized using cumulative trajectory rewards instead of per-step signals.
It employs regularized least squares to estimate rewards from occupancy measures, addressing the reduced statistical resolution of episodic feedback.
The approach supports both known and unknown transition settings, providing regret guarantees and linking trajectory feedback to practical RLHF applications.

Direct Reward Optimization (DRO) in reinforcement learning refers to policy optimization in settings where the agent does not have access to per-step rewards, but only observes a scalar signal representing the cumulative reward (trajectory feedback) over an entire episode. This setup, motivated by practical constraints on reward annotation, fundamentally impacts algorithm design and theoretical guarantees, distinguishing DRO from standard RL protocols that rely on granular, stepwise reward feedback (Efroni et al., 2020).

1. Formalization: Trajectory Feedback and Problem Definition

Direct Reward Optimization within the trajectory feedback paradigm is instantiated on an episodic Markov Decision Process (MDP) $M = (\mathcal{S}, \mathcal{A}, P, r, H)$ , where:

$\mathcal{S}, \mathcal{A}$ denote finite state and action sets, with cardinalities $S$ and $A$ .
$P(s'|s,a)$ is the transition probability, which may be unknown.
$r(s,a) \in [0,1]$ is the unknown expected per-step reward function.
$H$ is the known episodic horizon.

In each episode $k$ , the agent selects a policy $\pi_k$ , samples a trajectory $\tau_k = \{(s_h^k, a_h^k)\}_{h=1}^H$ under $P$ , and observes only the total reward $R_k = \sum_{h=1}^H r(s_h^k, a_h^k)$ , in contrast to the standard RL feedback model that reveals $r(s_h^k, a_h^k)$ at each timestep. The regret over $K$ episodes is

$\mathcal{R}(K) = \sum_{k=1}^K [V^*_1(s_1^k) - V_1^{\pi_k}(s_1^k)],$

where $V_1^\pi(s) = \mathbb{E}_{\tau \sim \pi}[R(\tau) \mid s_1 = s]$ .

This framework compels algorithms to reconstruct per-step rewards and optimize policies based on coarse, trajectory-level feedback (Efroni et al., 2020).

2. Reward Estimation: Regularized Least Squares

Given direct reward signals only at the episode level, reward estimation is addressed by exploiting the linear relation between trajectory returns and occupancy measures. After $k-1$ episodes, for each episode $i$ , the empirical visitation frequencies $\hat{d}_i(s,a)$ and observed trajectory return $\hat{V}_i$ are collected. These form the data matrix $D_{k-1}$ (with rows $\hat{d}_i^\top$ ) and return vector $Y_{k-1} = (\hat{V}_i)_i$ .

The regularized least-squares estimator for the unknown reward vector $r$ is

$A_{k-1} = D_{k-1}^\top D_{k-1} + \lambda I, \quad \hat{r}_{k-1} = A_{k-1}^{-1} D_{k-1}^\top Y_{k-1},$

where $\lambda > 0$ is the regularization parameter. This estimator benefits from concentration results ensuring, with high probability, the estimation error $\lVert r - \hat{r}_k \rVert_{A_k}$ is bounded by a sequence $\beta_k = O(\sqrt{SAH\log(...)/\delta})$ (Efroni et al., 2020).

3. Policy Optimization Algorithms: Known and Unknown Transitions

Known Transitions (OFUL-type Algorithm):

If $P$ is known, policy selection can be framed as an optimism-in-the-face-of-uncertainty linear bandit over the occupancy measure set $\{d_\pi\}$ : $\pi_k \leftarrow \arg\max_{\pi} \left[d_\pi^\top \hat{r}_{k-1} + \beta_{k-1} \lVert d_\pi \rVert_{A_{k-1}^{-1}}\right].$ After executing $\pi_k$ and observing $(\hat{d}_k, \hat{V}_k)$ , data matrices are updated. The regret over $K$ episodes has the bound

$\mathcal{R}(K) \leq O\left(SAH\sqrt{K\log(KH/\delta)}\right).$

Unknown Transitions (UCBVI-TS Hybrid Algorithm):

When $P$ is unknown, the policy is chosen by constructing plug-in transition estimates $\bar{P}_{k-1}$ , adding Gaussian (Thompson-Sampling style) perturbation $\xi_k \sim \mathcal{N}(0, v_k^2 A_{k-1}^{-1})$ to the reward estimator, as well as an optimistic transition bonus $b_{k-1}^{pv}(s,a) = O(H/\sqrt{n_{k-1}(s,a)})$ . The perturbed reward is

$\tilde{r}^b_k(s,a) = \hat{r}_{k-1}(s,a) + \xi_k(s,a) + b^{pv}_{k-1}(s,a).$

The policy is selected by solving the MDP with reward $\tilde{r}^b_k$ and empirical transitions $\bar{P}_{k-1}$ via dynamic programming. Regret in this setting scales as $\tilde{O}(S^2A^{3/2}H^{3/2}\sqrt{K})$ (Efroni et al., 2020).

4. Theoretical Guarantees and Comparative Regret Bounds

Direct Reward Optimization under trajectory feedback leads to provable increases in regret compared to standard per-step RL, reflecting the loss of statistical resolution in reward signals:

Per-step RL (minimax): $\Theta(\sqrt{SAH T})$
Trajectory feedback with known $P$ : $\Theta(SA \sqrt{HK})$ (matched by OFUL-type algorithms up to logs)
Trajectory feedback with unknown $P$ : $\tilde{O}(S^2A^{3/2}H^{3/2}\sqrt{K})$

A key driver of this degradation is the reduction to a single scalar feedback per episode, which inflates the effective noise scale by $\sqrt{H}$ in linear bandit analysis. When transitions are unknown, an extra $\sqrt{S}$ term arises from transition estimation.

A plausible implication is that DRO via trajectory feedback should be preferred in circumstances where per-step reward annotation is not feasible, but practitioners must accept increased regret scaling—particularly in high-dimensional ( $S,A$ ) regimes (Efroni et al., 2020).

5. Practical Implementation and Guidelines

Instantiating DRO with trajectory feedback requires:

Initializing statistics: $\lambda = \Theta(H)$ , counts $n_0(s,a)=0$ , $A_0 = \lambda I$ , $Y_0 = 0$ .
For each episode $k$ $k$ :
1. Estimate $\hat{d}_k(s,a)$ empirically from the visited trajectory.
2. Update $A_k \leftarrow A_{k-1} + \hat{d}_k\hat{d}_k^\top$ , $Y_k \leftarrow Y_{k-1} + \hat{d}_k\hat{V}_k$ .
3. Compute $\hat{r}_k = A_k^{-1}Y_k$ .
4. If $P$ is unknown, build $\bar{P}_k(s'|s,a)$ from empirical counts.
5. For policy selection: compute bonus $b^{pv}_k(s,a)$ , draw perturbation $\xi_{k+1}$ , define optimistic/perturbed $\tilde{r}^b_{k+1}$ , and solve the empirical MDP by dynamic programming to obtain $\pi_{k+1}$ .
Key assumptions: Known horizon $H$ , sub-Gaussian or bounded reward noise, stationary transitions, positive regularization parameter $\lambda$ , and access to an exact dynamic programming solver.

Rarely-switching variants, which reduce the frequency of covariance updates, can lower per-episode computational cost with minor regret penalties (Efroni et al., 2020).

6. Contextual Significance and Relationship to RLHF

Direct Reward Optimization using trajectory feedback is especially relevant for settings where granular expert reward annotation is unavailable, such as single-trajectory RL with human feedback. The methodology links the sequential RLHF paradigm to a linear bandit perspective, where the action space's dimension is $SA$ and learning is based on aggregate signals per trajectory.

This suggests that, despite statistical inefficiency, DRO with trajectory feedback is essential for domains where only episodic, scalar annotations are available. The established regret bounds and algorithmic frameworks provide a foundation for RLHF procedures constrained by limited feedback fidelity (Efroni et al., 2020).

PDF Markdown Chat (Pro)

References (1)

Reinforcement Learning with Trajectory Feedback (2020)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Direct Reward Optimization (DRO).