TD-Error Occupancy Ratios in RL

Updated 7 January 2026

The paper derives TD-error-based occupancy ratios from the dual of an f-divergence regularized reinforcement learning objective, linking Bellman residuals to sample prioritization.
It employs convex duality to transform the optimization problem and yields closed-form expressions—exponential for KL divergence and linear for Pearson chi-squared divergence.
Empirical evaluations with the ROER algorithm in continuous control settings showcase improved sample efficiency and robust transfer in challenging environments.

TD-error-based occupancy ratios formalize the connection between temporal-difference (TD) error and sample prioritization within experience replay, grounded in the duality of regularized reinforcement learning (RL) objectives. These ratios arise naturally from considering the optimal transformation of a replay buffer’s off-policy data distribution into an on-policy occupancy measure under $f$ -divergence regularization. The resulting prioritization schemes are analytically derived, parameterizing sample importance via nonlinear functions of the Bellman TD residual, and provide a principled alternative to heuristic prioritization methods found in prioritized experience replay (PER) (Li et al., 2024).

1. The $f$ -divergence-regularized RL Objective

The foundation is a max-return objective penalized by an $f$ -divergence between the occupancy distribution $d$ and the replay buffer distribution $d^{\mathcal D}$ : $\max_{d \geq 0,\, \sum_{s,a} d(s,a)=1} \left\{\, \mathbb{E}_{(s,a)\sim d}[r(s,a)] - \beta\, D_f(d||d^{\mathcal D}) \right\},$ where $\beta>0$ and $D_f(d||d^{\mathcal D}) = \mathbb{E}_{(s,a)\sim d^{\mathcal D}}[f(d(s,a)/d^{\mathcal D}(s,a))]$ . This formulation encourages the learned occupancy measure $d^*$ to remain close to $d^{\mathcal D}$ when $\beta$ is large, and to recover the unconstrained RL dual in the $\beta \to 0$ regime.

The $f$ -divergence penalty serves as a regularizer controlling the tradeoff between leveraging off-policy data and adhering to the desired on-policy distribution.

2. Convex Dual and the TD-based Change of Variables

Employing the Fenchel duality principle, the regularized objective is transformed via the Fenchel–Young identity: $f(w) = \sup_{x} \{ w x - f_*(x) \},$ where $f_*$ is the convex conjugate of $f$ . The dualized form yields an optimization objective over $Q$ -functions rather than over the unknown $d^*$ . A critical Bellman change-of-variable enforces consistency between $x$ and the $Q$ -function: $x(s,a) = \frac{\mathcal B^* Q(s,a) - Q(s,a)}{\beta} = \frac{\delta_Q(s,a)}{\beta},$ where $\mathcal B^*Q$ is the Bellman operator and $\delta_Q$ is the one-step Bellman residual. This direct parameterization ties sample importance to the deviation from Bellman consistency.

3. Closed-form Occupancy Ratios via Fenchel Conjugate

At the saddle point of the dual form, the optimal density (occupancy) ratio satisfies

$r^*(s,a) = \frac{d^*(s,a)}{d^{\mathcal D}(s,a)} = f_*'\left( \frac{\delta_{Q^*}(s,a)}{\beta} \right),$

where $f_*'$ is the derivative of the convex conjugate.

Divergence	$f(w)$	$f_*(y)$	$f_*'(y)$	Occupancy Ratio Expression
KL	$w \log w$	$e^y-1$	$e^y$	$r^{*}(s,a) = \exp(\delta/\beta)$
Pearson $\chi^2$	$\frac{1}{2}(w-1)^2$	$\frac{1}{2}y^2 + y$	$y+1$	$r^{*}(s,a) = 1 + \delta / \beta$

For the KL divergence, TD-error-based occupancies become exponential functions of the TD-error; for Pearson $\chi^2$ , a linear relationship is recovered which is similar to PER-style priorities.

4. Emergence of TD-error in Sampling Weights

The TD residual $\delta_Q$ appears as the key change-of-variable that eliminates the direct, infeasible optimization over $d^*$ and enforces the critic’s Bellman consistency. At stationarity in the dual: $f_*'\bigl( x(s,a) \bigr ) = \frac{d^*(s,a)}{d^{\mathcal D}(s,a)}, \quad x(s,a) = \frac{\delta_Q(s,a)}{\beta},$ demonstrating that the occupancy-ratio is determined by the one-step Bellman residual. This identification analytically grounds the empirical observation that TD-error magnitudes indicate where off-policy distributions diverge most from the optimal on-policy occupancy.

5. Prioritization Schemes and the ROER Algorithm

TD-error-based occupancy ratios induce a principled sample-resampling scheme for experience replay. For the KL divergence, the sample priority update rule is: $w_{\rm new}(s,a) = (1 - \lambda) w_{\rm old}(s,a) + \lambda\, \exp\left( \frac{\delta(s,a)}{\beta} \right),$ with update rate $\lambda \in (0,1]$ . Samples are drawn proportionally to $w(s,a)$ for learning, and priorities are incrementally tracked to reflect changes in $\delta$ .

The Regularized Optimal Experience Replay (ROER) pipeline consists of:

Value network $V_\phi$ to estimate regularized TD-errors,
Critic $Q_\theta$ trained on weighted Bellman error losses,
Sample priorities shaped by the closed-form occupancy ratios,
Policy $\pi_\psi$ updated as in SAC,
Replay buffer stores both transition tuples and their evolving priorities.

6. Empirical Performance and Relevance

ROER, instantiated with KL-divergence weighting, was evaluated with the Soft Actor-Critic (SAC) algorithm in continuous control settings (MuJoCo, DM Control). The scheme outperformed baselines in 6/11 tasks and remained competitive in the others. Notably, during offline-to-online fine-tuning, ROER achieved superior results in the challenging Antmaze environment, where baseline methods failed (Li et al., 2024).

A plausible implication is that TD-error-based occupancy ratios yield analytically grounded prioritizations that adapt more flexibly as the agent’s policy and value networks evolve, improving sample efficiency and facilitating robust transfer in difficult tasks.

7. Summary and Significance

TD-error-based occupancy ratios provide a mathematically principled foundation for sample prioritization in experience replay, deriving the central role of the TD residual from the dual of a regularized RL objective. This approach unifies density-ratio correction, prioritization, and off-policy optimization through convex-analytic duality and provides closed-form weighting schemes directly linked to Bellman error. Such connections extend and clarify the empirical success of PER while furnishing robust, theoretically justified alternatives that improve RL performance across diverse domains (Li et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

ROER: Regularized Optimal Experience Replay (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TD-error-based Occupancy Ratios.