Design of the TDClip mean estimator
Investigate and identify superior design choices for the estimator of the average absolute temporal-difference error over the replay memory used by TDClip in Predictive Prioritized Experience Replay (PPER), including assessing alternatives to ordinary importance sampling and determining estimator properties that improve clipping thresholds and stability under non-stationary training dynamics.
References
Note that each ${\hat \mu}_n$ is obtained with ordinary importance sampling (OIS) technique \citep{IS2014}, which can be replaced by weighted importance sampling (WIS). However, our simple tests with the estimator (\Eqref{eq:appendix:AdaClip update rule}) on stationary distributions suggested using OIS rather than WIS since the former performed better than the latter in those tests, without increasing the variance. Of course, it is open to investigate more on better design choices of the estimate ${\hat \mu}_n$ for TDClip.