Tradeoff in offline RL: do Q-learning’s compounding errors outweigh the benefits of learning behavioral values?
Ascertain whether, in offline reinforcement learning, the compounding errors inherent to temporal-difference Q-learning outweigh the benefits of learning the behavioral value function Q^β with 1-step (Monte Carlo) methods rather than the optimal value function Q^*.
References
However, their strong performance over the years continues to suggest that it remains an open question whether the compounding errors of Q-learning outweigh the benefits from learning the behavioral value function, rather than the value function of the optimal policy.
                — Offline Goal-conditioned Reinforcement Learning with Quasimetric Representations
                
                (2509.20478 - Myers et al., 24 Sep 2025) in Section 2.2 (Offline Reinforcement Learning)