On the Curses of Future and History in Future-dependent Value Functions for Off-policy Evaluation (2402.14703v2)
Abstract: We study off-policy evaluation (OPE) in partially observable environments with complex observations, with the goal of developing estimators whose guarantee avoids exponential dependence on the horizon. While such estimators exist for MDPs and POMDPs can be converted to history-based MDPs, their estimation errors depend on the state-density ratio for MDPs which becomes history ratios after conversion, an exponential object. Recently, Uehara et al. [2022a] proposed future-dependent value functions as a promising framework to address this issue, where the guarantee for memoryless policies depends on the density ratio over the latent state space. However, it also depends on the boundedness of the future-dependent value function and other related quantities, which we show could be exponential-in-length and thus erasing the advantage of the method. In this paper, we discover novel coverage assumptions tailored to the structure of POMDPs, such as outcome coverage and belief coverage, which enable polynomial bounds on the aforementioned quantities. As a side product, our analyses also lead to the discovery of new algorithms with complementary properties.
- Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learning, 2008.
- Augmented balancing weights as linear regression. arXiv preprint arXiv:2304.14545, 2023.
- Lower bounds for learning in revealing pomdps. arXiv preprint arXiv:2302.01333, 2023.
- Information-theoretic considerations in batch reinforcement learning. In Proceedings of the 36th International Conference on Machine Learning, pp. 1042–1051, 2019.
- Chen, Y.-C. A short note on the median-of-means estimator, 2020.
- Sbeed: Convergent reinforcement learning with nonlinear function approximation. In International Conference on Machine Learning, pp. 1133–1142, 2018.
- Minimax-optimal off-policy evaluation with linear function approximation. In International Conference on Machine Learning, pp. 2701–2709. PMLR, 2020.
- Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6:503–556, 2005.
- Covariate shift by kernel mean matching. Dataset shift in machine learning, 3(4):5, 2009.
- Doubly Robust Off-policy Value Evaluation for Reinforcement Learning. In Proceedings of the 33rd International Conference on Machine Learning, volume 48, pp. 652–661, 2016.
- Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pp. 2137–2143. PMLR, 2020.
- Planning and acting in partially observable stochastic domains. Artificial intelligence, 101(1-2):99–134, 1998.
- Batch policy learning under constraints. In International Conference on Machine Learning, pp. 3703–3712, 2019.
- Lerasle, M. Lecture notes: Selected topics on robust statistical learning theory. arXiv preprint arXiv:1908.10761, 2019.
- Breaking the curse of horizon: Infinite-horizon off-policy estimation. In Advances in Neural Information Processing Systems, pp. 5356–5366, 2018.
- When is partially observable reinforcement learning not scary? In Conference on Learning Theory, pp. 5175–5220. PMLR, 2022.
- Learning hidden markov models using conditional samples. In The Thirty Sixth Annual Conference on Learning Theory, pp. 2014–2066. PMLR, 2023.
- Munos, R. Performance bounds in l_p-norm for approximate value iteration. SIAM journal on control and optimization, 46(2):541–561, 2007.
- Finite-time bounds for fitted value iteration. Journal of Machine Learning Research, 9(May):815–857, 2008.
- Dualdice: Behavior-agnostic estimation of discounted stationary distribution corrections. Advances in Neural Information Processing Systems, 32, 2019.
- A complete characterization of linear estimators for offline policy evaluation. Journal of Machine Learning Research, 24(284):1–50, 2023.
- Eligibility traces for off-policy policy evaluation. In Proceedings of the Seventeenth International Conference on Machine Learning, pp. 759–766, 2000.
- Hybrid rl: Using both offline and online data can make rl efficient. In The Eleventh International Conference on Learning Representations, 2022.
- Minimax Weight and Q-Function Learning for Off-Policy Evaluation. In Proceedings of the 37th International Conference on Machine Learning, pp. 1023–1032, 2020.
- Future-dependent value-based off-policy evaluation in pomdps. arXiv preprint arXiv:2207.13081, 2022.
- Q* approximation schemes for batch reinforcement learning: A theoretical comparison. In Conference on Uncertainty in Artificial Intelligence, pp. 550–559. PMLR, 2020.
- Batch value-function approximation with only realizability. In International Conference on Machine Learning, pp. 11404–11413. PMLR, 2021.
- Towards optimal off-policy evaluation for reinforcement learning with marginalized importance sampling. Advances in Neural Information Processing Systems, 32, 2019.
- Bellman-consistent pessimism for offline reinforcement learning. arXiv preprint arXiv:2106.06926, 2021.
- Towards instance-optimal offline reinforcement learning with pessimism. Advances in neural information processing systems, 34:4065–4078, 2021.
- Near-optimal offline reinforcement learning with linear representation: Leveraging variance information with pessimism. arXiv preprint arXiv:2203.05804, 2022.
- Analysis of kernel mean matching under covariate shift. In Proceedings of the 29th International Coference on International Conference on Machine Learning, pp. 1147–1154, 2012.
- Offline reinforcement learning with realizability and single-policy concentrability. In Conference on Learning Theory, pp. 2730–2775. PMLR, 2022.