Multiple-policy Evaluation via Density Estimation (2404.00195v2)
Abstract: We study the multiple-policy evaluation problem where we are given a set of $K$ policies and the goal is to evaluate their performance (expected total reward over a fixed horizon) to an accuracy $\epsilon$ with probability at least $1-\delta$. We propose an algorithm named $\mathrm{CAESAR}$ for this problem. Our approach is based on computing an approximate optimal offline sampling distribution and using the data sampled from it to perform the simultaneous estimation of the policy values. $\mathrm{CAESAR}$ has two phases. In the first we produce coarse estimates of the visitation distributions of the target policies at a low order sample complexity rate that scales with $\tilde{O}(\frac{1}{\epsilon})$. In the second phase, we approximate the optimal offline sampling distribution and compute the importance weighting ratios for all target policies by minimizing a step-wise quadratic loss function inspired by the DualDICE \cite{nachum2019dualdice} objective. Up to low order and logarithmic terms $\mathrm{CAESAR}$ achieves a sample complexity $\tilde{O}\left(\frac{H4}{\epsilon2}\sum_{h=1}H\max_{k\in[K]}\sum_{s,a}\frac{(d_h{\pik}(s,a))2}{\mu*_h(s,a)}\right)$, where $d{\pi}$ is the visitation distribution of policy $\pi$, $\mu*$ is the optimal sampling distribution, and $H$ is the horizon.
- Taming the monster: A fast and simple algorithm for contextual bandits. In International Conference on Machine Learning, pages 1638–1646. PMLR, 2014.
- Contextual bandit algorithms with supervised learning guarantees. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 19–26. JMLR Workshop and Conference Proceedings, 2011.
- Coindice: Off-policy confidence interval estimation. Advances in neural information processing systems, 33:9398–9411, 2020.
- Multiple-policy high-confidence policy evaluation. In International Conference on Artificial Intelligence and Statistics, pages 9470–9487. PMLR, 2023.
- Policy certificates: Towards accountable reinforcement learning. In International Conference on Machine Learning, pages 1507–1516. PMLR, 2019.
- Minimax-optimal off-policy evaluation with linear function approximation. In International Conference on Machine Learning, pages 2701–2709. PMLR, 2020.
- More robust doubly robust off-policy evaluation. In International Conference on Machine Learning, pages 1447–1456. PMLR, 2018.
- Non-asymptotic confidence intervals of off-policy evaluation: Primal and dual bounds. arXiv preprint arXiv:2103.05741, 2021.
- Batch mode reinforcement learning based on the synthesis of artificial trajectories. Annals of operations research, 208:383–416, 2013.
- David A Freedman. On tail probabilities for martingales. the Annals of Probability, pages 100–118, 1975.
- Bootstrapping with models: Confidence intervals for off-policy evaluation. In Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems, pages 538–546, 2017.
- Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization. In Proceedings of the 24th Annual Conference on Learning Theory, pages 421–436. JMLR Workshop and Conference Proceedings, 2011.
- Minimax value interval for off-policy evaluation and policy optimization. Advances in Neural Information Processing Systems, 33:2747–2758, 2020.
- Doubly robust off-policy value evaluation for reinforcement learning. In International conference on machine learning, pages 652–661. PMLR, 2016.
- Toward minimax off-policy value estimation. In Artificial Intelligence and Statistics, pages 608–616. PMLR, 2015.
- Breaking the curse of horizon: Infinite-horizon off-policy estimation. Advances in neural information processing systems, 31, 2018.
- Stanislav Minsker. Efficient median of means estimator. In The Thirty Sixth Annual Conference on Learning Theory, pages 5925–5933. PMLR, 2023.
- Dualdice: Behavior-agnostic estimation of discounted stationary distribution corrections. Advances in neural information processing systems, 32, 2019.
- Reinforcement learning: An introduction. MIT press, 2018.
- Policy gradient methods for reinforcement learning with function approximation. Advances in neural information processing systems, 12, 1999.
- Agent based decision support system using reinforcement learning under emergency circumstances. In Advances in Natural Computation: First International Conference, ICNC 2005, Changsha, China, August 27-29, 2005, Proceedings, Part I 1, pages 888–892. Springer, 2005.
- Is long horizon reinforcement learning more difficult than short horizon reinforcement learning? arXiv preprint arXiv:2005.00527, 2020.
- Towards optimal off-policy evaluation for reinforcement learning with marginalized importance sampling. Advances in neural information processing systems, 32, 2019.
- The role of coverage in online reinforcement learning. ArXiv, abs/2210.04157, 2022. URL https://api.semanticscholar.org/CorpusID:252780137.
- Asymptotically efficient off-policy evaluation for tabular reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pages 3948–3958. PMLR, 2020.
- Near-optimal provable uniform convergence in offline policy evaluation for reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pages 1567–1575. PMLR, 2021.
- Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. In International Conference on Machine Learning, pages 7304–7312. PMLR, 2019.