Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Decision Making in Non-Stationary Environments with Policy-Augmented Search (2401.03197v2)

Published 6 Jan 2024 in cs.AI and cs.LG

Abstract: Sequential decision-making under uncertainty is present in many important problems. Two popular approaches for tackling such problems are reinforcement learning and online search (e.g., Monte Carlo tree search). While the former learns a policy by interacting with the environment (typically done before execution), the latter uses a generative model of the environment to sample promising action trajectories at decision time. Decision-making is particularly challenging in non-stationary environments, where the environment in which an agent operates can change over time. Both approaches have shortcomings in such settings -- on the one hand, policies learned before execution become stale when the environment changes and relearning takes both time and computational effort. Online search, on the other hand, can return sub-optimal actions when there are limitations on allowed runtime. In this paper, we introduce \textit{Policy-Augmented Monte Carlo tree search} (PA-MCTS), which combines action-value estimates from an out-of-date policy with an online search using an up-to-date model of the environment. We prove theoretical results showing conditions under which PA-MCTS selects the one-step optimal action and also bound the error accrued while following PA-MCTS as a policy. We compare and contrast our approach with AlphaZero, another hybrid planning approach, and Deep Q Learning on several OpenAI Gym environments. Through extensive experiments, we show that under non-stationary settings with limited time constraints, PA-MCTS outperforms these baselines.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (29)
  1. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. URL https://www.tensorflow.org/. Software available from tensorflow.org.
  2. Or forum—a pomdp approach to personalize mammography screening decisions. Operations Research, 60(5):1019–1034, 2012.
  3. Safe reinforcement learning with scene decomposition for navigating complex urban environments. In 2019 IEEE Intelligent Vehicles Symposium (IV), pages 1469–1476. IEEE, 2019.
  4. OpenAI Gym. arXiv preprint arXiv:1606.01540, 2016.
  5. A survey of Monte Carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in Games, 4(1):1–43, 2012.
  6. Non-stationary reinforcement learning: The blessing of (more) optimism. Available at SSRN 3397818, 2019.
  7. Hidden-mode Markov decision processes for nonstationary sequential decision making. In Sequence Learning, pages 264–287. Springer, 2000.
  8. Rémi Coulom. Efficient selectivity and backup operators in Monte-Carlo tree search. In International Conference on Computers and Games (CG), pages 72–83. Springer, 2006.
  9. Combining q-learning and search with amortized value estimates. In International Conference on Learning Representations, 2019.
  10. Garud N Iyengar. Robust dynamic programming. Mathematics of Operations Research, 30(2):257–280, 2005.
  11. Lessons learned from the chameleon testbed. In USENIX Annual Technical Conference. USENIX Association, July 2020.
  12. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  13. Algorithms for decision making. MIT Press, 2022.
  14. Bandit based Monte-Carlo planning. In 17th European Conference on Machine Learning (ECML), pages 282–293. Springer, 2006.
  15. Non-stationary Markov decision processes, a worst-case approach using model-based reinforcement learning. Advances in Neural Information Processing Systems, 32:7216–7225, 2019.
  16. Leandro L Minku. Transfer learning in non-stationary environments. In Learning from Data Streams in Evolving Environments, pages 13–37. Springer, 2019.
  17. A decision theoretic framework for emergency responder dispatch. In Conference on Autonomous Agents and Multiagent Systems (AAMAS), pages 588–596, 2018.
  18. An online decision-theoretic pipeline for responder dispatch. In ACM/IEEE International Conference on Cyber-Physical Systems, pages 185–196, 2019.
  19. Variational regret bounds for reinforcement learning. In 35th Uncertainty in Artificial Intelligence Conference, volume 115, pages 81–90, 2020.
  20. Hierarchical planning for dynamic resource allocation in smart and connected communities. ACM Transactions on Cyber-Physical Systems, 2021.
  21. Markovian decision processes with uncertain transition probabilities. Operations Research, 21(3):728–740, 1973.
  22. Lifelong machine learning systems: Beyond learning algorithms. In 2013 AAAI Spring Symposium Series, pages 49–55, 2013.
  23. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484–489, January 2016.
  24. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362(6419):1140–1144, 2018a. doi: 10.1126/science.aar6404. URL https://www.science.org/doi/abs/10.1126/science.aar6404.
  25. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362(6419):1140–1144, 2018b.
  26. Deep reinforcement learning with double Q-learning. In 30th AAAI Conference on Artificial Intelligence (AAAI), volume 30, pages 2094–2100, 2016.
  27. Python reference manual. Centrum voor Wiskunde en Informatica Amsterdam, 1995.
  28. Markov decision processes with imprecise transition probabilities. Operations Research, 42(4):739–749, 1994.
  29. Kyle Hollins Wray. Abstractions in reasoning for long-term autonomy. PhD thesis, University of Massachusetts Libraries, 2019.
Citations (3)

Summary

We haven't generated a summary for this paper yet.