Design of the TDClip mean estimator

Investigate and identify superior design choices for the estimator of the average absolute temporal-difference error over the replay memory used by TDClip in Predictive Prioritized Experience Replay (PPER), including assessing alternatives to ordinary importance sampling and determining estimator properties that improve clipping thresholds and stability under non-stationary training dynamics.

Background

TDClip is introduced as a statistical clipping mechanism for TD errors in Predictive Prioritized Experience Replay (PPER). It relies on an online estimate of the mean absolute TD error across the replay memory to set adaptive lower and upper clipping thresholds. The paper presents an update rule for this estimator and discusses its role in stabilizing priority distributions.

In the appendix, the authors note that the estimator currently uses ordinary importance sampling (OIS) but could be replaced by weighted importance sampling (WIS). While simple tests suggested OIS performed better on stationary distributions without increasing variance, the authors explicitly state that the choice of estimator remains open and warrants further investigation.

References

Note that each ${\hat \mu}_n$ is obtained with ordinary importance sampling (OIS) technique \citep{IS2014}, which can be replaced by weighted importance sampling (WIS). However, our simple tests with the estimator (\Eqref{eq:appendix:AdaClip update rule}) on stationary distributions suggested using OIS rather than WIS since the former performed better than the latter in those tests, without increasing the variance. Of course, it is open to investigate more on better design choices of the estimate ${\hat \mu}_n$ for TDClip.

Predictive PER: Balancing Priority and Diversity towards Stable Deep Reinforcement Learning  (2011.13093 - Lee et al., 2020) in Appendix, Section "Details on TDClip", Subsection "Estimating the Average Absolute TD-error in the replay memory"