TD-Error Occupancy Ratios in RL
- The paper derives TD-error-based occupancy ratios from the dual of an f-divergence regularized reinforcement learning objective, linking Bellman residuals to sample prioritization.
- It employs convex duality to transform the optimization problem and yields closed-form expressions—exponential for KL divergence and linear for Pearson chi-squared divergence.
- Empirical evaluations with the ROER algorithm in continuous control settings showcase improved sample efficiency and robust transfer in challenging environments.
TD-error-based occupancy ratios formalize the connection between temporal-difference (TD) error and sample prioritization within experience replay, grounded in the duality of regularized reinforcement learning (RL) objectives. These ratios arise naturally from considering the optimal transformation of a replay buffer’s off-policy data distribution into an on-policy occupancy measure under -divergence regularization. The resulting prioritization schemes are analytically derived, parameterizing sample importance via nonlinear functions of the Bellman TD residual, and provide a principled alternative to heuristic prioritization methods found in prioritized experience replay (PER) (Li et al., 2024).
1. The -divergence-regularized RL Objective
The foundation is a max-return objective penalized by an -divergence between the occupancy distribution and the replay buffer distribution : where and . This formulation encourages the learned occupancy measure to remain close to when is large, and to recover the unconstrained RL dual in the regime.
The -divergence penalty serves as a regularizer controlling the tradeoff between leveraging off-policy data and adhering to the desired on-policy distribution.
2. Convex Dual and the TD-based Change of Variables
Employing the Fenchel duality principle, the regularized objective is transformed via the Fenchel–Young identity: where is the convex conjugate of . The dualized form yields an optimization objective over -functions rather than over the unknown . A critical Bellman change-of-variable enforces consistency between and the -function: where is the Bellman operator and is the one-step Bellman residual. This direct parameterization ties sample importance to the deviation from Bellman consistency.
3. Closed-form Occupancy Ratios via Fenchel Conjugate
At the saddle point of the dual form, the optimal density (occupancy) ratio satisfies
where is the derivative of the convex conjugate.
| Divergence | Occupancy Ratio Expression | |||
|---|---|---|---|---|
| KL | ||||
| Pearson |
For the KL divergence, TD-error-based occupancies become exponential functions of the TD-error; for Pearson , a linear relationship is recovered which is similar to PER-style priorities.
4. Emergence of TD-error in Sampling Weights
The TD residual appears as the key change-of-variable that eliminates the direct, infeasible optimization over and enforces the critic’s Bellman consistency. At stationarity in the dual: demonstrating that the occupancy-ratio is determined by the one-step Bellman residual. This identification analytically grounds the empirical observation that TD-error magnitudes indicate where off-policy distributions diverge most from the optimal on-policy occupancy.
5. Prioritization Schemes and the ROER Algorithm
TD-error-based occupancy ratios induce a principled sample-resampling scheme for experience replay. For the KL divergence, the sample priority update rule is: with update rate . Samples are drawn proportionally to for learning, and priorities are incrementally tracked to reflect changes in .
The Regularized Optimal Experience Replay (ROER) pipeline consists of:
- Value network to estimate regularized TD-errors,
- Critic trained on weighted Bellman error losses,
- Sample priorities shaped by the closed-form occupancy ratios,
- Policy updated as in SAC,
- Replay buffer stores both transition tuples and their evolving priorities.
6. Empirical Performance and Relevance
ROER, instantiated with KL-divergence weighting, was evaluated with the Soft Actor-Critic (SAC) algorithm in continuous control settings (MuJoCo, DM Control). The scheme outperformed baselines in 6/11 tasks and remained competitive in the others. Notably, during offline-to-online fine-tuning, ROER achieved superior results in the challenging Antmaze environment, where baseline methods failed (Li et al., 2024).
A plausible implication is that TD-error-based occupancy ratios yield analytically grounded prioritizations that adapt more flexibly as the agent’s policy and value networks evolve, improving sample efficiency and facilitating robust transfer in difficult tasks.
7. Summary and Significance
TD-error-based occupancy ratios provide a mathematically principled foundation for sample prioritization in experience replay, deriving the central role of the TD residual from the dual of a regularized RL objective. This approach unifies density-ratio correction, prioritization, and off-policy optimization through convex-analytic duality and provides closed-form weighting schemes directly linked to Bellman error. Such connections extend and clarify the empirical success of PER while furnishing robust, theoretically justified alternatives that improve RL performance across diverse domains (Li et al., 2024).