Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Act as You Learn: Adaptive Decision-Making in Non-Stationary Markov Decision Processes (2401.01841v3)

Published 3 Jan 2024 in cs.AI and cs.LG

Abstract: A fundamental (and largely open) challenge in sequential decision-making is dealing with non-stationary environments, where exogenous environmental conditions change over time. Such problems are traditionally modeled as non-stationary Markov decision processes (NSMDP). However, existing approaches for decision-making in NSMDPs have two major shortcomings: first, they assume that the updated environmental dynamics at the current time are known (although future dynamics can change); and second, planning is largely pessimistic, i.e., the agent acts safely'' to account for the non-stationary evolution of the environment. We argue that both these assumptions are invalid in practice -- updated environmental conditions are rarely known, and as the agent interacts with the environment, it can learn about the updated dynamics and avoid being pessimistic, at least in states whose dynamics it is confident about. We present a heuristic search algorithm called \textit{Adaptive Monte Carlo Tree Search (ADA-MCTS)} that addresses these challenges. We show that the agent can learn the updated dynamics of the environment over time and then act as it learns, i.e., if the agent is in a region of the state space about which it has updated knowledge, it can avoid being pessimistic. To quantifyupdated knowledge,'' we disintegrate the aleatoric and epistemic uncertainty in the agent's updated belief and show how the agent can use these estimates for decision-making. We compare the proposed approach with the multiple state-of-the-art approaches in decision-making across multiple well-established open-source problems and empirically show that our approach is faster and highly adaptive without sacrificing safety.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (19)
  1. Szilárd Aradi. Survey of deep reinforcement learning for motion planning of autonomous vehicles. IEEE Transactions on Intelligent Transportation Systems, 23(2):740–759, 2020.
  2. A review of incident prediction, resource allocation, and dispatch models for emergency management. Accident Analysis & Prevention, 165:106501, 2022.
  3. Safety-assured speculative planning with adaptive prediction. CoRR, abs/2307.11876, 2023. doi: 10.48550/ARXIV.2307.11876. URL https://doi.org/10.48550/arXiv.2307.11876.
  4. Dynamic simplex: Balancing safety and performance in autonomous cyber physical systems. In Sayan Mitra, Nalini Venkatasubramanian, Abhishek Dubey, Lu Feng, Mahsa Ghasemi, and Jonathan Sprinkle, editors, Proceedings of the ACM/IEEE 14th International Conference on Cyber-Physical Systems, ICCPS 2023, (with CPS-IoT Week 2023), San Antonio, TX, USA, May 9-12, 2023, pages 177–186. ACM, 2023. doi: 10.1145/3576841.3585934. URL https://doi.org/10.1145/3576841.3585934.
  5. Markovian decision processes with uncertain transition probabilities. Operations Research, 21(3):728–740, 1973.
  6. Markov decision processes with imprecise transition probabilities. Operations Research, 42(4):739–749, 1994.
  7. Safety-assured design and adaptation of learning-enabled autonomous systems. In ASPDAC ’21: 26th Asia and South Pacific Design Automation Conference, Tokyo, Japan, January 18-21, 2021, pages 753–760. ACM, 2021. doi: 10.1145/3394885.3431623. URL https://doi.org/10.1145/3394885.3431623.
  8. Enforcing hard constraints with soft barriers: Safe reinforcement learning in unknown stochastic environments. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 36593–36604. PMLR, 2023. URL https://proceedings.mlr.press/v202/wang23as.html.
  9. Lifelong robot learning. Robotics and autonomous systems, 15(1-2):25–46, 1995.
  10. Non-stationary markov decision processes, a worst-case approach using model-based reinforcement learning. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett, editors, Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 7214–7223, 2019. URL https://proceedings.neurips.cc/paper/2019/hash/859b00aec8885efc83d1541b52a1220d-Abstract.html.
  11. Planning in stochastic environments with a learned model. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. URL https://openreview.net/forum?id=X6D9bAHhBQ1.
  12. Algorithms for decision making. MIT press, 2022.
  13. Bootstrapping: A nonparametric approach to statistical inference. Number 95. sage, 1993.
  14. Variational inference: A review for statisticians. Journal of the American statistical Association, 112(518):859–877, 2017.
  15. Bayesian approach for neural networks—review and case studies. Neural networks, 14(3):257–274, 2001.
  16. What uncertainties do we need in bayesian deep learning for computer vision? In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 5574–5584, 2017. URL https://proceedings.neurips.cc/paper/2017/hash/2650d6089a6d640c5e85b2b88265dc2b-Abstract.html.
  17. Robust and efficient transfer learning with hidden parameter markov decision processes. In Satinder Singh and Shaul Markovitch, editors, Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA, pages 4949–4950. AAAI Press, 2017. doi: 10.1609/aaai.v31i1.11065. URL https://doi.org/10.1609/aaai.v31i1.11065.
  18. OpenAI Gym. arXiv preprint arXiv:1606.01540, 2016.
  19. Garud N Iyengar. Robust dynamic programming. Mathematics of Operations Research, 30(2):257–280, 2005.
Citations (1)

Summary

  • The paper introduces ADA-MCTS, a heuristic algorithm that effectively learns and adapts to changing dynamics in non-stationary MDPs.
  • It distinguishes between epistemic and aleatoric uncertainty, enabling the method to dynamically balance risk-averse and reward-seeking strategies.
  • Extensive experiments in open-source environments show ADA-MCTS outperforming state-of-the-art baselines even with limited environmental information.

Overview

In sequential decision-making, agents must often navigate complex environments that can change unpredictably over time. A challenge posed by such dynamic contexts is not only how an agent can learn about these changes but also how it can make decisions concurrently. This paper addresses this challenge within the framework of non-stationary Markov decision processes (NSMDPs), specifically overcoming the limitations of existing methods that either assume current environmental conditions are known or adopt a uniformly conservative approach to uncertainty.

Ada-MCTS: A Heuristic Approach

The authors introduce Adaptive Monte Carlo Tree Search (ADA-MCTS), which escalates the capability of agents to learn and adapt to updated environmental dynamics. Unlike traditional approaches, ADA-MCTS does not operate under the assumption of fully known current conditions or maintain a consistently risk-averse positioning. As the agent gains more knowledge about parts of the state space through interaction, ADA-MCTS empowers the agent to make informed decisions tailored to varying levels of uncertainty.

To discern between uncertainties attributable to lack of data (epistemic) and those inherent to the stochasticity of the environment (aleatoric), the algorithm distinguishes and leverages these two types of uncertainty. This facilitation allows for a dynamic adjustment of the agent's decision-making strategy, switching between risk-averse and reward-seeking behaviors when appropriate.

Experimental Validation

ADA-MCTS was extensively evaluated within several well-established open-source environments, showcasing its ability to outperform state-of-the-art methods. The algorithm proved superior to competitive baselines, especially notable since these baselines were at times provided with more information than ADA-MCTS, specifically access to the true dynamics of the environment.

Could the success of ADA-MCTS be ascribable to particular components of the algorithm? An ablation paper confirmed this hypothesis, revealing the pivotal role played by the careful interplay between risk-averse exploration and knowledge transfer from preceding models.

Conclusion

The empirical findings demonstrate ADA-MCTS as a robust algorithm for sequential decision-making in uncertain environments. Its adaptability makes it a valuable tool, not just for theoretical studies, but with practical applications ranging from autonomous driving to resource management. By adjusting its strategy according to the level of knowledge about the environment, ADA-MCTS pushes the envelope for artificial intelligence systems operating in real-world scenarios where change is the only constant.

X Twitter Logo Streamline Icon: https://streamlinehq.com