Papers
Topics
Authors
Recent
Search
2000 character limit reached

TD-Error Occupancy Ratios in RL

Updated 7 January 2026
  • The paper derives TD-error-based occupancy ratios from the dual of an f-divergence regularized reinforcement learning objective, linking Bellman residuals to sample prioritization.
  • It employs convex duality to transform the optimization problem and yields closed-form expressions—exponential for KL divergence and linear for Pearson chi-squared divergence.
  • Empirical evaluations with the ROER algorithm in continuous control settings showcase improved sample efficiency and robust transfer in challenging environments.

TD-error-based occupancy ratios formalize the connection between temporal-difference (TD) error and sample prioritization within experience replay, grounded in the duality of regularized reinforcement learning (RL) objectives. These ratios arise naturally from considering the optimal transformation of a replay buffer’s off-policy data distribution into an on-policy occupancy measure under ff-divergence regularization. The resulting prioritization schemes are analytically derived, parameterizing sample importance via nonlinear functions of the Bellman TD residual, and provide a principled alternative to heuristic prioritization methods found in prioritized experience replay (PER) (Li et al., 2024).

1. The ff-divergence-regularized RL Objective

The foundation is a max-return objective penalized by an ff-divergence between the occupancy distribution dd and the replay buffer distribution dDd^{\mathcal D}: maxd0,s,ad(s,a)=1{E(s,a)d[r(s,a)]βDf(ddD)},\max_{d \geq 0,\, \sum_{s,a} d(s,a)=1} \left\{\, \mathbb{E}_{(s,a)\sim d}[r(s,a)] - \beta\, D_f(d||d^{\mathcal D}) \right\}, where β>0\beta>0 and Df(ddD)=E(s,a)dD[f(d(s,a)/dD(s,a))]D_f(d||d^{\mathcal D}) = \mathbb{E}_{(s,a)\sim d^{\mathcal D}}[f(d(s,a)/d^{\mathcal D}(s,a))]. This formulation encourages the learned occupancy measure dd^* to remain close to dDd^{\mathcal D} when β\beta is large, and to recover the unconstrained RL dual in the β0\beta \to 0 regime.

The ff-divergence penalty serves as a regularizer controlling the tradeoff between leveraging off-policy data and adhering to the desired on-policy distribution.

2. Convex Dual and the TD-based Change of Variables

Employing the Fenchel duality principle, the regularized objective is transformed via the Fenchel–Young identity: f(w)=supx{wxf(x)},f(w) = \sup_{x} \{ w x - f_*(x) \}, where ff_* is the convex conjugate of ff. The dualized form yields an optimization objective over QQ-functions rather than over the unknown dd^*. A critical Bellman change-of-variable enforces consistency between xx and the QQ-function: x(s,a)=BQ(s,a)Q(s,a)β=δQ(s,a)β,x(s,a) = \frac{\mathcal B^* Q(s,a) - Q(s,a)}{\beta} = \frac{\delta_Q(s,a)}{\beta}, where BQ\mathcal B^*Q is the Bellman operator and δQ\delta_Q is the one-step Bellman residual. This direct parameterization ties sample importance to the deviation from Bellman consistency.

3. Closed-form Occupancy Ratios via Fenchel Conjugate

At the saddle point of the dual form, the optimal density (occupancy) ratio satisfies

r(s,a)=d(s,a)dD(s,a)=f(δQ(s,a)β),r^*(s,a) = \frac{d^*(s,a)}{d^{\mathcal D}(s,a)} = f_*'\left( \frac{\delta_{Q^*}(s,a)}{\beta} \right),

where ff_*' is the derivative of the convex conjugate.

Divergence f(w)f(w) f(y)f_*(y) f(y)f_*'(y) Occupancy Ratio Expression
KL wlogww \log w ey1e^y-1 eye^y r(s,a)=exp(δ/β)r^{*}(s,a) = \exp(\delta/\beta)
Pearson χ2\chi^2 12(w1)2\frac{1}{2}(w-1)^2 12y2+y\frac{1}{2}y^2 + y y+1y+1 r(s,a)=1+δ/βr^{*}(s,a) = 1 + \delta / \beta

For the KL divergence, TD-error-based occupancies become exponential functions of the TD-error; for Pearson χ2\chi^2, a linear relationship is recovered which is similar to PER-style priorities.

4. Emergence of TD-error in Sampling Weights

The TD residual δQ\delta_Q appears as the key change-of-variable that eliminates the direct, infeasible optimization over dd^* and enforces the critic’s Bellman consistency. At stationarity in the dual: f(x(s,a))=d(s,a)dD(s,a),x(s,a)=δQ(s,a)β,f_*'\bigl( x(s,a) \bigr ) = \frac{d^*(s,a)}{d^{\mathcal D}(s,a)}, \quad x(s,a) = \frac{\delta_Q(s,a)}{\beta}, demonstrating that the occupancy-ratio is determined by the one-step Bellman residual. This identification analytically grounds the empirical observation that TD-error magnitudes indicate where off-policy distributions diverge most from the optimal on-policy occupancy.

5. Prioritization Schemes and the ROER Algorithm

TD-error-based occupancy ratios induce a principled sample-resampling scheme for experience replay. For the KL divergence, the sample priority update rule is: wnew(s,a)=(1λ)wold(s,a)+λexp(δ(s,a)β),w_{\rm new}(s,a) = (1 - \lambda) w_{\rm old}(s,a) + \lambda\, \exp\left( \frac{\delta(s,a)}{\beta} \right), with update rate λ(0,1]\lambda \in (0,1]. Samples are drawn proportionally to w(s,a)w(s,a) for learning, and priorities are incrementally tracked to reflect changes in δ\delta.

The Regularized Optimal Experience Replay (ROER) pipeline consists of:

  • Value network VϕV_\phi to estimate regularized TD-errors,
  • Critic QθQ_\theta trained on weighted Bellman error losses,
  • Sample priorities shaped by the closed-form occupancy ratios,
  • Policy πψ\pi_\psi updated as in SAC,
  • Replay buffer stores both transition tuples and their evolving priorities.

6. Empirical Performance and Relevance

ROER, instantiated with KL-divergence weighting, was evaluated with the Soft Actor-Critic (SAC) algorithm in continuous control settings (MuJoCo, DM Control). The scheme outperformed baselines in 6/11 tasks and remained competitive in the others. Notably, during offline-to-online fine-tuning, ROER achieved superior results in the challenging Antmaze environment, where baseline methods failed (Li et al., 2024).

A plausible implication is that TD-error-based occupancy ratios yield analytically grounded prioritizations that adapt more flexibly as the agent’s policy and value networks evolve, improving sample efficiency and facilitating robust transfer in difficult tasks.

7. Summary and Significance

TD-error-based occupancy ratios provide a mathematically principled foundation for sample prioritization in experience replay, deriving the central role of the TD residual from the dual of a regularized RL objective. This approach unifies density-ratio correction, prioritization, and off-policy optimization through convex-analytic duality and provides closed-form weighting schemes directly linked to Bellman error. Such connections extend and clarify the empirical success of PER while furnishing robust, theoretically justified alternatives that improve RL performance across diverse domains (Li et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TD-error-based Occupancy Ratios.