Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
134 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Multiple-policy Evaluation via Density Estimation (2404.00195v2)

Published 29 Mar 2024 in cs.LG and cs.AI

Abstract: We study the multiple-policy evaluation problem where we are given a set of $K$ policies and the goal is to evaluate their performance (expected total reward over a fixed horizon) to an accuracy $\epsilon$ with probability at least $1-\delta$. We propose an algorithm named $\mathrm{CAESAR}$ for this problem. Our approach is based on computing an approximate optimal offline sampling distribution and using the data sampled from it to perform the simultaneous estimation of the policy values. $\mathrm{CAESAR}$ has two phases. In the first we produce coarse estimates of the visitation distributions of the target policies at a low order sample complexity rate that scales with $\tilde{O}(\frac{1}{\epsilon})$. In the second phase, we approximate the optimal offline sampling distribution and compute the importance weighting ratios for all target policies by minimizing a step-wise quadratic loss function inspired by the DualDICE \cite{nachum2019dualdice} objective. Up to low order and logarithmic terms $\mathrm{CAESAR}$ achieves a sample complexity $\tilde{O}\left(\frac{H4}{\epsilon2}\sum_{h=1}H\max_{k\in[K]}\sum_{s,a}\frac{(d_h{\pik}(s,a))2}{\mu*_h(s,a)}\right)$, where $d{\pi}$ is the visitation distribution of policy $\pi$, $\mu*$ is the optimal sampling distribution, and $H$ is the horizon.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (27)
  1. Taming the monster: A fast and simple algorithm for contextual bandits. In International Conference on Machine Learning, pages 1638–1646. PMLR, 2014.
  2. Contextual bandit algorithms with supervised learning guarantees. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 19–26. JMLR Workshop and Conference Proceedings, 2011.
  3. Coindice: Off-policy confidence interval estimation. Advances in neural information processing systems, 33:9398–9411, 2020.
  4. Multiple-policy high-confidence policy evaluation. In International Conference on Artificial Intelligence and Statistics, pages 9470–9487. PMLR, 2023.
  5. Policy certificates: Towards accountable reinforcement learning. In International Conference on Machine Learning, pages 1507–1516. PMLR, 2019.
  6. Minimax-optimal off-policy evaluation with linear function approximation. In International Conference on Machine Learning, pages 2701–2709. PMLR, 2020.
  7. More robust doubly robust off-policy evaluation. In International Conference on Machine Learning, pages 1447–1456. PMLR, 2018.
  8. Non-asymptotic confidence intervals of off-policy evaluation: Primal and dual bounds. arXiv preprint arXiv:2103.05741, 2021.
  9. Batch mode reinforcement learning based on the synthesis of artificial trajectories. Annals of operations research, 208:383–416, 2013.
  10. David A Freedman. On tail probabilities for martingales. the Annals of Probability, pages 100–118, 1975.
  11. Bootstrapping with models: Confidence intervals for off-policy evaluation. In Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems, pages 538–546, 2017.
  12. Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization. In Proceedings of the 24th Annual Conference on Learning Theory, pages 421–436. JMLR Workshop and Conference Proceedings, 2011.
  13. Minimax value interval for off-policy evaluation and policy optimization. Advances in Neural Information Processing Systems, 33:2747–2758, 2020.
  14. Doubly robust off-policy value evaluation for reinforcement learning. In International conference on machine learning, pages 652–661. PMLR, 2016.
  15. Toward minimax off-policy value estimation. In Artificial Intelligence and Statistics, pages 608–616. PMLR, 2015.
  16. Breaking the curse of horizon: Infinite-horizon off-policy estimation. Advances in neural information processing systems, 31, 2018.
  17. Stanislav Minsker. Efficient median of means estimator. In The Thirty Sixth Annual Conference on Learning Theory, pages 5925–5933. PMLR, 2023.
  18. Dualdice: Behavior-agnostic estimation of discounted stationary distribution corrections. Advances in neural information processing systems, 32, 2019.
  19. Reinforcement learning: An introduction. MIT press, 2018.
  20. Policy gradient methods for reinforcement learning with function approximation. Advances in neural information processing systems, 12, 1999.
  21. Agent based decision support system using reinforcement learning under emergency circumstances. In Advances in Natural Computation: First International Conference, ICNC 2005, Changsha, China, August 27-29, 2005, Proceedings, Part I 1, pages 888–892. Springer, 2005.
  22. Is long horizon reinforcement learning more difficult than short horizon reinforcement learning? arXiv preprint arXiv:2005.00527, 2020.
  23. Towards optimal off-policy evaluation for reinforcement learning with marginalized importance sampling. Advances in neural information processing systems, 32, 2019.
  24. The role of coverage in online reinforcement learning. ArXiv, abs/2210.04157, 2022. URL https://api.semanticscholar.org/CorpusID:252780137.
  25. Asymptotically efficient off-policy evaluation for tabular reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pages 3948–3958. PMLR, 2020.
  26. Near-optimal provable uniform convergence in offline policy evaluation for reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pages 1567–1575. PMLR, 2021.
  27. Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. In International Conference on Machine Learning, pages 7304–7312. PMLR, 2019.

Summary

  • The paper introduces CAESAR, a novel algorithm that evaluates multiple RL policies simultaneously using optimal sampling and density estimation techniques.
  • It efficiently computes an optimal offline sampling distribution by leveraging coarse visitation estimates to reduce sample complexity compared to naive methods.
  • Theoretical results guarantee non-asymptotic sample efficiency, paving the way for faster development and refinement of RL policies in practical applications.

Multiple-policy Evaluation via Density Estimation

Introduction to Multiple-policy Evaluation

Policy evaluation, a central problem within Reinforcement Learning (RL), seeks to estimate the expected total rewards from following a given policy. This is essential both for assessing the performance of existing policies and for guiding the development of new ones. The multiple-policy evaluation scenario, where the objective is to evaluate the performance of not just one, but a set of KK target policies, presents an interesting challenge. The naive approach of applying single-policy evaluation methods KK times does not leverage the potential overlap in the policies' behavior, leading to inefficiencies. This paper introduces a novel algorithm, CAESAR, which targets this gap by proposing an efficient means of evaluating multiple policies simultaneously.

Key Innovations of CAESAR

Efficient Computation of Optimal Sampling Distribution

CAESAR operates in two main phases, the first of which involves generating coarse estimates of the visitation distributions of each target policy using a sample complexity that scales with O~(1ϵ)\tilde{O}(\frac{1}{\epsilon}). These estimations then inform the computation of an optimal offline sampling distribution. Notably, this distribution is approximated to ensure that it lies within the convex hull of the target policies' visitation distributions, facilitating feasible sample generation for the estimation process. Through this, CAESAR leverages similarities among target policies to ensure sample efficiency.

Importance Weighting for Policy Evaluation

Building on the established sampling distribution, CAESAR employs a novel application of importance weighting for multi-policy evaluation. Inspired by DualDICE, the algorithm minimizes a step-wise quadratic loss function to estimate importance weighting ratios accurately. Essentially, it tailors the density estimation technique for the finite-horizon, tabular MDP settings, enabling the accurate estimation of policy values with non-asymptotic sample complexity guarantees.

Theoretical Contributions

The paper establishes a finite sample complexity result for the problem of multi-policy evaluation, showcasing how CAESAR significantly outperforms naive uniform sampling over target policies. Specifically, under certain conditions, it achieves a sample complexity of O~(H4ϵ2h=1Hmaxk[K]s,a(dhπk(s,a))2μh(s,a))\tilde{O}\left(\frac{H^4}{\epsilon^2}\sum_{h=1}^H\max_{k\in[K]}\sum_{s,a}\frac{(d_h^{\pi^k}(s,a))^2}{\mu^*_h(s,a)}\right). Importantly, it also demonstrates that the estimated sampling distribution, derived from coarse estimations of the visitation distributions, approaches the efficiency of an optimal distribution. This establishes a theoretical foundation for sample-efficient multi-policy evaluation without requiring additional complexity introduced by an extensive search across all deterministic policies.

Practical Implications and Future Directions

CAESAR paves the way for more effective and efficient multi-policy evaluations, crucial for scenarios where multiple policies (e.g., resulting from different configurations or hyperparameters) must be assessed concurrently. This can significantly speed up the iterative process of RL algorithm development and policy refinement, especially in domains where data collection is expensive or time-consuming.

However, the developed methodology introduces new questions and potential research directions. For instance, the approach's dependence on the horizon H4H^4 suggests that further refinements could yield more scalable solutions, especially for problems with long horizons. Additionally, exploring how reward-dependent sample complexities can further optimize the evaluation process remains an open area, particularly in sparse reward environments where focusing on significant state-action pairs could lead to efficiency gains.

Conclusion

The CAESAR algorithm presents a significant step forward in the domain of multiple-policy evaluation, providing a methodologically sound and theoretically backed approach to efficiently estimate the performance of multiple policies. Its development not only addresses an existing gap in the literature but also opens avenues for further research into more efficient and effective policy evaluation methods within RL.

X Twitter Logo Streamline Icon: https://streamlinehq.com