Papers
Topics
Authors
Recent
2000 character limit reached

Direct Reward Optimization (DRO)

Updated 11 January 2026
  • DRO is a framework in reinforcement learning where policies are optimized using cumulative trajectory rewards instead of per-step signals.
  • It employs regularized least squares to estimate rewards from occupancy measures, addressing the reduced statistical resolution of episodic feedback.
  • The approach supports both known and unknown transition settings, providing regret guarantees and linking trajectory feedback to practical RLHF applications.

Direct Reward Optimization (DRO) in reinforcement learning refers to policy optimization in settings where the agent does not have access to per-step rewards, but only observes a scalar signal representing the cumulative reward (trajectory feedback) over an entire episode. This setup, motivated by practical constraints on reward annotation, fundamentally impacts algorithm design and theoretical guarantees, distinguishing DRO from standard RL protocols that rely on granular, stepwise reward feedback (Efroni et al., 2020).

1. Formalization: Trajectory Feedback and Problem Definition

Direct Reward Optimization within the trajectory feedback paradigm is instantiated on an episodic Markov Decision Process (MDP) M=(S,A,P,r,H)M = (\mathcal{S}, \mathcal{A}, P, r, H), where:

  • S,A\mathcal{S}, \mathcal{A} denote finite state and action sets, with cardinalities SS and AA.
  • P(ss,a)P(s'|s,a) is the transition probability, which may be unknown.
  • r(s,a)[0,1]r(s,a) \in [0,1] is the unknown expected per-step reward function.
  • HH is the known episodic horizon.

In each episode kk, the agent selects a policy πk\pi_k, samples a trajectory τk={(shk,ahk)}h=1H\tau_k = \{(s_h^k, a_h^k)\}_{h=1}^H under PP, and observes only the total reward Rk=h=1Hr(shk,ahk)R_k = \sum_{h=1}^H r(s_h^k, a_h^k), in contrast to the standard RL feedback model that reveals r(shk,ahk)r(s_h^k, a_h^k) at each timestep. The regret over KK episodes is

R(K)=k=1K[V1(s1k)V1πk(s1k)],\mathcal{R}(K) = \sum_{k=1}^K [V^*_1(s_1^k) - V_1^{\pi_k}(s_1^k)],

where V1π(s)=Eτπ[R(τ)s1=s]V_1^\pi(s) = \mathbb{E}_{\tau \sim \pi}[R(\tau) \mid s_1 = s].

This framework compels algorithms to reconstruct per-step rewards and optimize policies based on coarse, trajectory-level feedback (Efroni et al., 2020).

2. Reward Estimation: Regularized Least Squares

Given direct reward signals only at the episode level, reward estimation is addressed by exploiting the linear relation between trajectory returns and occupancy measures. After k1k-1 episodes, for each episode ii, the empirical visitation frequencies d^i(s,a)\hat{d}_i(s,a) and observed trajectory return V^i\hat{V}_i are collected. These form the data matrix Dk1D_{k-1} (with rows d^i\hat{d}_i^\top) and return vector Yk1=(V^i)iY_{k-1} = (\hat{V}_i)_i.

The regularized least-squares estimator for the unknown reward vector rr is

Ak1=Dk1Dk1+λI,r^k1=Ak11Dk1Yk1,A_{k-1} = D_{k-1}^\top D_{k-1} + \lambda I, \quad \hat{r}_{k-1} = A_{k-1}^{-1} D_{k-1}^\top Y_{k-1},

where λ>0\lambda > 0 is the regularization parameter. This estimator benefits from concentration results ensuring, with high probability, the estimation error rr^kAk\lVert r - \hat{r}_k \rVert_{A_k} is bounded by a sequence βk=O(SAHlog(...)/δ)\beta_k = O(\sqrt{SAH\log(...)/\delta}) (Efroni et al., 2020).

3. Policy Optimization Algorithms: Known and Unknown Transitions

Known Transitions (OFUL-type Algorithm):

If PP is known, policy selection can be framed as an optimism-in-the-face-of-uncertainty linear bandit over the occupancy measure set {dπ}\{d_\pi\}: πkargmaxπ[dπr^k1+βk1dπAk11].\pi_k \leftarrow \arg\max_{\pi} \left[d_\pi^\top \hat{r}_{k-1} + \beta_{k-1} \lVert d_\pi \rVert_{A_{k-1}^{-1}}\right]. After executing πk\pi_k and observing (d^k,V^k)(\hat{d}_k, \hat{V}_k), data matrices are updated. The regret over KK episodes has the bound

R(K)O(SAHKlog(KH/δ)).\mathcal{R}(K) \leq O\left(SAH\sqrt{K\log(KH/\delta)}\right).

Unknown Transitions (UCBVI-TS Hybrid Algorithm):

When PP is unknown, the policy is chosen by constructing plug-in transition estimates Pˉk1\bar{P}_{k-1}, adding Gaussian (Thompson-Sampling style) perturbation ξkN(0,vk2Ak11)\xi_k \sim \mathcal{N}(0, v_k^2 A_{k-1}^{-1}) to the reward estimator, as well as an optimistic transition bonus bk1pv(s,a)=O(H/nk1(s,a))b_{k-1}^{pv}(s,a) = O(H/\sqrt{n_{k-1}(s,a)}). The perturbed reward is

r~kb(s,a)=r^k1(s,a)+ξk(s,a)+bk1pv(s,a).\tilde{r}^b_k(s,a) = \hat{r}_{k-1}(s,a) + \xi_k(s,a) + b^{pv}_{k-1}(s,a).

The policy is selected by solving the MDP with reward r~kb\tilde{r}^b_k and empirical transitions Pˉk1\bar{P}_{k-1} via dynamic programming. Regret in this setting scales as O~(S2A3/2H3/2K)\tilde{O}(S^2A^{3/2}H^{3/2}\sqrt{K}) (Efroni et al., 2020).

4. Theoretical Guarantees and Comparative Regret Bounds

Direct Reward Optimization under trajectory feedback leads to provable increases in regret compared to standard per-step RL, reflecting the loss of statistical resolution in reward signals:

  • Per-step RL (minimax): Θ(SAHT)\Theta(\sqrt{SAH T})
  • Trajectory feedback with known PP: Θ(SAHK)\Theta(SA \sqrt{HK}) (matched by OFUL-type algorithms up to logs)
  • Trajectory feedback with unknown PP: O~(S2A3/2H3/2K)\tilde{O}(S^2A^{3/2}H^{3/2}\sqrt{K})

A key driver of this degradation is the reduction to a single scalar feedback per episode, which inflates the effective noise scale by H\sqrt{H} in linear bandit analysis. When transitions are unknown, an extra S\sqrt{S} term arises from transition estimation.

A plausible implication is that DRO via trajectory feedback should be preferred in circumstances where per-step reward annotation is not feasible, but practitioners must accept increased regret scaling—particularly in high-dimensional (S,AS,A) regimes (Efroni et al., 2020).

5. Practical Implementation and Guidelines

Instantiating DRO with trajectory feedback requires:

  • Initializing statistics: λ=Θ(H)\lambda = \Theta(H), counts n0(s,a)=0n_0(s,a)=0, A0=λIA_0 = \lambda I, Y0=0Y_0 = 0.
  • For each episode kk:

    1. Estimate d^k(s,a)\hat{d}_k(s,a) empirically from the visited trajectory.
    2. Update AkAk1+d^kd^kA_k \leftarrow A_{k-1} + \hat{d}_k\hat{d}_k^\top, YkYk1+d^kV^kY_k \leftarrow Y_{k-1} + \hat{d}_k\hat{V}_k.
    3. Compute r^k=Ak1Yk\hat{r}_k = A_k^{-1}Y_k.
    4. If PP is unknown, build Pˉk(ss,a)\bar{P}_k(s'|s,a) from empirical counts.
    5. For policy selection: compute bonus bkpv(s,a)b^{pv}_k(s,a), draw perturbation ξk+1\xi_{k+1}, define optimistic/perturbed r~k+1b\tilde{r}^b_{k+1}, and solve the empirical MDP by dynamic programming to obtain πk+1\pi_{k+1}.
  • Key assumptions: Known horizon HH, sub-Gaussian or bounded reward noise, stationary transitions, positive regularization parameter λ\lambda, and access to an exact dynamic programming solver.

Rarely-switching variants, which reduce the frequency of covariance updates, can lower per-episode computational cost with minor regret penalties (Efroni et al., 2020).

6. Contextual Significance and Relationship to RLHF

Direct Reward Optimization using trajectory feedback is especially relevant for settings where granular expert reward annotation is unavailable, such as single-trajectory RL with human feedback. The methodology links the sequential RLHF paradigm to a linear bandit perspective, where the action space's dimension is SASA and learning is based on aggregate signals per trajectory.

This suggests that, despite statistical inefficiency, DRO with trajectory feedback is essential for domains where only episodic, scalar annotations are available. The established regret bounds and algorithmic frameworks provide a foundation for RLHF procedures constrained by limited feedback fidelity (Efroni et al., 2020).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Direct Reward Optimization (DRO).