A Decision-Language Model (DLM) for Dynamic Restless Multi-Armed Bandit Tasks in Public Health (2402.14807v4)
Abstract: Restless multi-armed bandits (RMAB) have demonstrated success in optimizing resource allocation for large beneficiary populations in public health settings. Unfortunately, RMAB models lack flexibility to adapt to evolving public health policy priorities. Concurrently, LLMs have emerged as adept automated planners across domains of robotic control and navigation. In this paper, we propose a Decision LLM (DLM) for RMABs, enabling dynamic fine-tuning of RMAB policies in public health settings using human-language commands. We propose using LLMs as automated planners to (1) interpret human policy preference prompts, (2) propose reward functions as code for a multi-agent RMAB environment, and (3) iterate on the generated reward functions using feedback from grounded RMAB simulations. We illustrate the application of DLM in collaboration with ARMMAN, an India-based non-profit promoting preventative care for pregnant mothers, that currently relies on RMAB policies to optimally allocate health worker calls to low-resource populations. We conduct a technology demonstration in simulation using the Gemini Pro model, showing DLM can dynamically shape policy outcomes using only human prompts as input.
- Deep reinforcement learning at the edge of the statistical precipice. Advances in neural information processing systems, 34:29304–29320, 2021.
- ARMMAN. Assessing the impact of mobile-based intervention on health literacy among pregnant women in urban india, 2019. Accessed: 2022-08-12.
- Priority setting of health interventions: the need for multi-criteria decision analysis. Cost effectiveness and resource allocation, 4(1):1–9, 2006.
- Assessing exposure to kilkari: a big data analysis of a large maternal mobile messaging service across 13 states in india. BMJ Global Health, 6(Suppl 5):e005213, 2021.
- The perils of trial-and-error reward design: misdesign through overfitting and invalid task specifications. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 5920–5929, 2023.
- Mobile health technology to improve maternal health awareness in tribal populations: mobile for mothers. Journal of the American Medical Informatics Association, 28(11):2467–2474, 2021.
- Reinforcement learning can be more efficient with multiple rewards. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 6948–6967. PMLR, 23–29 Jul 2023.
- Strategies to improve treatment coverage in community-based public health programs: A systematic review of the literature. PLoS Neglected Tropical Diseases, 12, 2018.
- Guiding pretraining in reinforcement learning with large language models. arXiv preprint arXiv:2302.06692, 2023.
- Fairness through awareness. In Proceedings of the 3rd Innovations in Theoretical Computer Science Conference, ITCS ’12, page 214–226, New York, NY, USA, 2012. Association for Computing Machinery.
- Some indexable families of restless bandit problems. Advances in Applied Probability, 38(3):643–672, 2006.
- General notions of indexability for queueing control and asset management. The Annals of Applied Probability, 2011.
- Informal care in times of a public health crisis: Objective burden, subjective burden and quality of life of caregivers in the netherlands during the covid‐19 pandemic. Health & Social Care in the Community, 2022.
- HelpMum. Helpmum: Preventing maternal mortality in nigeria, 2023. Accessed: 2023-12-30.
- Generative adversarial imitation learning. Advances in neural information processing systems, 29, 2016.
- Deep reward shaping from demonstrations. In 2017 International Joint Conference on Neural Networks (IJCNN), pages 510–517. IEEE, 2017.
- Restless and uncertain: Robust policies for restless bandits via deep multi-agent reinforcement learning. In Uncertainty in Artificial Intelligence, pages 990–1000. PMLR, 2022.
- Robust planning over restless groups: engagement interventions for a large-scale maternal telehealth program. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, 2023.
- High-quality health systems in the sustainable development goals era: time for a revolution. The Lancet global health, 6(11):e1196–e1252, 2018.
- Why maternal mortality in the world remains tragedy in low-income countries and shame for high-income ones: will sustainable development goals (sdg) help? Journal of Perinatal Medicine, 51(2):170–181, 2023.
- Reward design with language models. In The Eleventh International Conference on Learning Representations, 2022.
- Pre-trained language models for interactive decision-making. Advances in Neural Information Processing Systems, 35:31199–31212, 2022.
- Auto mc-reward: Automated dense reward design with large language models for minecraft. arXiv preprint arXiv:2312.09238, 2023.
- Eureka: Human-level reward design via coding large language models. In The Twelfth International Conference on Learning Representations, 2024.
- Maja J Mataric. Reward functions for accelerated learning. In Machine learning proceedings 1994, pages 181–189. Elsevier, 1994.
- Field study in deploying restless multi-armed bandits: Assisting non-profits in improving maternal and child health. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 12017–12025, 2022.
- Neurwin: Neural whittle index network for restless bandits via deep rl. Advances in Neural Information Processing Systems, 34:828–839, 2021.
- Policy invariance under reward transformations: Theory and application to reward shaping. In Icml, volume 99, pages 278–287. Citeseer, 1999.
- Selective intervention planning using restless multi-armed bandits to improve maternal and child health outcomes. Proceedings of the 2023 International Conference on Autonomous Agents and Multiagent Systems, page 1312–1320, 2023.
- World Health Organization et al. Strategies towards ending preventable maternal mortality (epmm). World Health Organization, 2015.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Lm-nav: Robotic navigation with large pre-trained models of language, vision, and action. In Conference on Robot Learning, pages 492–504. PMLR, 2023.
- Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems, 36, 2024.
- Where do rewards come from. In Proceedings of the annual conference of the cognitive science society, pages 2601–2606. Cognitive Science Society, 2009.
- Intrinsically motivated reinforcement learning: An evolutionary perspective. IEEE Transactions on Autonomous Mental Development, 2(2):70–82, 2010.
- Reward design via online gradient ascent. Advances in Neural Information Processing Systems, 23, 2010.
- Reinforcement learning: An introduction. MIT press, 2018.
- Milind Tambe. Ai for social impact: Results from deployments for public health and conversation. Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2022.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
- Department of Economic United Nations and Social Affairs Sustainable Development. Transforming our world: the 2030 agenda for sustainable development, 2015.
- Chatgpt for robotics: Design principles and model abilities. Microsoft Auton. Syst. Robot. Res, 2:20, 2023.
- Scalable decision-focused learning in restless multi-armed bandits with application to maternal and child health. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 12138–12146, 2023.
- On an index policy for restless bandits. Journal of applied probability, 27(3):637–648, 1990.
- Peter Whittle. Restless bandits: Activity allocation in a changing world. Journal of applied probability, 25(A):287–298, 1988.
- What can learned intrinsic rewards capture? In International Conference on Machine Learning, pages 11436–11446. PMLR, 2020.