Relative error comparison: behavioral temporal distances via Monte Carlo vs TD compounding errors in Q-learning
Determine whether, in goal-conditioned reinforcement learning and temporal distance estimation, the estimation error arising from learning behavioral successor distances under the dataset policy using Monte Carlo contrastive methods is larger or smaller than the error introduced by temporal-difference learning’s compounding bootstrapping when using Q-learning to estimate optimal successor distances, particularly in offline settings.
References
Despite the fact that Monte Carlo methods do not estimate optimal temporal distances, they often outperform their Q-learning counterparts, suggesting that it is at least unclear whether the errors from learning the behavioral (rather than optimal) temporal distance are larger or smaller than those introduced by TD learning's compounding errors.