Metareasoning in uncertain environments: a meta-BAMDP framework (2408.01253v2)
Abstract: \textit{Reasoning} may be viewed as an algorithm $P$ that makes a choice of an action $a* \in \mathcal{A}$, aiming to optimize some outcome. However, executing $P$ itself bears costs (time, energy, limited capacity, etc.) and needs to be considered alongside explicit utility obtained by making the choice in the underlying decision problem. Finding the right $P$ can itself be framed as an optimization problem over the space of reasoning processes $P$, generally referred to as \textit{metareasoning}. Conventionally, human metareasoning models assume that the agent knows the transition and reward distributions of the underlying MDP. This paper generalizes such models by proposing a meta Bayes-Adaptive MDP (meta-BAMDP) framework to handle metareasoning in environments with unknown reward/transition distributions, which encompasses a far larger and more realistic set of planning problems that humans and AI systems face. As a first step, we apply the framework to Bernoulli bandit tasks. Owing to the meta problem's complexity, our solutions are necessarily approximate. However, we introduce two novel theorems that significantly enhance the tractability of the problem, enabling stronger approximations that are robust within a range of assumptions grounded in realistic human decision-making scenarios. These results offer a resource-rational perspective and a normative framework for understanding human exploration under cognitive constraints, as well as providing experimentally testable predictions about human behavior in Bernoulli Bandit tasks.
- Principles of metareasoning. Artificial Intelligence, 49(1):361–395, 1991. ISSN 0004-3702. doi: https://doi.org/10.1016/0004-3702(91)90015-C. URL https://www.sciencedirect.com/science/article/pii/000437029190015C.
- Selecting computations: Theory and applications. arXiv preprint arXiv:1408.2048, 2014.
- Robert E Mercer and JR Sampson. Adaptive search using a reproductive meta-plan. Kybernetes, 7(3):215–228, 1978.
- Comparing parameter tuning methods for evolutionary algorithms. In 2009 IEEE Congress on Evolutionary Computation, pages 399–406, 2009. doi: 10.1109/CEC.2009.4982974.
- A survey of automatic parameter tuning methods for metaheuristics. IEEE transactions on evolutionary computation, 24(2):201–216, 2019.
- Rational metareasoning and the plasticity of cognitive control. PLoS computational biology, 14(4):e1006043, 2018.
- Rational use of cognitive resources in human planning. Nature Human Behaviour, 6(8):1112–1125, 2022.
- Algorithm selection by rational metareasoning as a model of human strategy selection. Advances in neural information processing systems, 27, 2014.
- Meta dynamic programming. In NeurIPS Workshop on Metacognition in the Age of AI: Challenges and Opportunities, 2021.
- Richard S Sutton. Dyna, an integrated architecture for learning, planning, and reacting. ACM Sigart Bulletin, 2(4):160–163, 1991.
- Metareasoning for planning under uncertainty. arXiv preprint arXiv:1505.00399, 2015.
- Michael O’Gordon Duff. Optimal Learning: Computational procedures for Bayes-adaptive Markov decision processes. University of Massachusetts Amherst, 2002.
- Forgetful bayes and myopic planning: Human learning and decision-making in a bandit setting. Advances in neural information processing systems, 26, 2013.
- A bayesian analysis of human decision-making on bandit problems. Journal of Mathematical Psychology, 53(3):168–179, 2009. ISSN 0022-2496. doi: https://doi.org/10.1016/j.jmp.2008.11.002. URL https://www.sciencedirect.com/science/article/pii/S0022249608001090. Special Issue: Dynamic Decision Making.
- Humans adaptively resolve the explore-exploit dilemma under cognitive constraints: Evidence from a multi-armed bandit task. Cognition, 229:105233, 2022. ISSN 0010-0277. doi: https://doi.org/10.1016/j.cognition.2022.105233. URL https://www.sciencedirect.com/science/article/pii/S0010027722002219.
- Time pressure changes how people explore and respond to uncertainty. Scientific reports, 12(1):4122, 2022.
- Humans use directed and random exploration to solve the explore–exploit dilemma. Journal of experimental psychology: General, 143(6):2074, 2014.
- Reinforcement learning: An introduction. MIT press, 2018.
- Bonsai trees in your head: how the pavlovian system sculpts goal-directed choices by pruning decision trees. PLoS computational biology, 8(3):e1002410, 2012.
- A knowledge-gradient policy for sequential information collection. SIAM Journal on Control and Optimization, 47(5):2410–2439, 2008.