Sequential Decision Making with Expert Demonstrations under Unobserved Heterogeneity (2404.07266v2)
Abstract: We study the problem of online sequential decision-making given auxiliary demonstrations from experts who made their decisions based on unobserved contextual information. These demonstrations can be viewed as solving related but slightly different problems than what the learner faces. This setting arises in many application domains, such as self-driving cars, healthcare, and finance, where expert demonstrations are made using contextual information, which is not recorded in the data available to the learning agent. We model the problem as zero-shot meta-reinforcement learning with an unknown distribution over the unobserved contextual variables and a Bayesian regret minimization objective, where the unobserved variables are encoded as parameters with an unknown prior. We propose the Experts-as-Priors algorithm (ExPerior), an empirical Bayes approach that utilizes expert data to establish an informative prior distribution over the learner's decision-making problem. This prior distribution enables the application of any Bayesian approach for online decision-making, such as posterior sampling. We demonstrate that our strategy surpasses existing behaviour cloning, online, and online-offline baselines for multi-armed bandits, Markov decision processes (MDPs), and partially observable MDPs, showcasing the broad reach and utility of ExPerior in using expert demonstrations across different decision-making setups.
- Efficient online reinforcement learning with offline data. arXiv preprint arXiv:2302.02948.
- A survey of meta-reinforcement learning. arXiv preprint arXiv:2301.08028.
- Off-policy Evaluation in Infinite-Horizon Reinforcement Learning with Latent Confounders. In Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, pages 1999–2007. PMLR. ISSN: 2640-3498.
- Techniques of variational analysis.
- Nonparametric Discrete Choice Models With Unobserved Heterogeneity. Journal of Business & Economic Statistics, 28(2):291–307. Publisher: [American Statistical Association, Taylor & Francis, Ltd.].
- Scaling data-driven robotics with reward sketching and batch reinforcement learning. arXiv preprint arXiv:1909.12200.
- Empirical bayes: Past, present and future. Journal of the American Statistical Association, 95(452):1286–1289.
- Meta-learning with stochastic linear bandits. In International Conference on Machine Learning, pages 1360–1370. PMLR.
- Multi-task and meta-learning with sparse linear bandits. In Uncertainty in Artificial Intelligence, pages 1692–1702. PMLR.
- Data-driven planning via imitation learning. The International Journal of Robotics Research, 37(13-14):1632–1672.
- Maximum entropy density estimation with generalized regularization and an application to species distribution modeling. Journal of Machine Learning Research.
- Langevin dqn. arXiv preprint arXiv:2002.07282.
- Meta-reinforcement learning of structured exploration strategies. Advances in neural information processing systems, 31.
- Contextual markov decision processes. arXiv preprint arXiv:1502.02259.
- Leveraging demonstrations to improve online learning: Quality matters. arXiv preprint arXiv:2302.03319.
- Bridging imitation and online reinforcement learning: An optimistic tale. arXiv preprint arXiv:2303.11369.
- Deep q-learning from demonstrations. In Proceedings of the AAAI conference on artificial intelligence.
- Provable and practical: Efficient exploration in reinforcement learning via langevin monte carlo. In The Twelfth International Conference on Learning Representations.
- Zero-shot meta-learning for small-scale data from human subjects. In 2023 IEEE 11th International Conference on Healthcare Informatics (ICHI), pages 311–320. IEEE.
- Minimax-Optimal Policy Learning Under Unobserved Confounding. Management Science, 67(5):2870–2890.
- Accelerating exploration with unlabeled prior data. Advances in Neural Information Processing Systems, 36.
- Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971.
- Concentration inequalities for the empirical distribution of discrete distributions: beyond the method of types. Information and Inference: A Journal of the IMA, 9(4):813–850.
- Guided meta-policy search. Advances in Neural Information Processing Systems, 32.
- Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602.
- Learning to adapt in dynamic, real-world environments through meta-reinforcement learning. arXiv preprint arXiv:1803.11347.
- Awac: Accelerating online reinforcement learning with offline datasets. arXiv preprint arXiv:2006.09359.
- Overcoming exploration in reinforcement learning with demonstrations. In 2018 IEEE international conference on robotics and automation (ICRA), pages 6292–6299. IEEE.
- Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems, 31.
- Deep exploration via bootstrapped dqn. Advances in neural information processing systems, 29.
- Behaviour suite for reinforcement learning. arXiv preprint arXiv:1908.03568.
- (more) efficient reinforcement learning via posterior sampling. Advances in Neural Information Processing Systems, 26.
- Deep exploration via randomized value functions. J. Mach. Learn. Res., 20(124):1–62.
- Approximate thompson sampling via epistemic neural networks. In Uncertainty in Artificial Intelligence, pages 1586–1595. PMLR.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
- Learning complex dexterous manipulation with deep reinforcement learning and demonstrations. arXiv preprint arXiv:1709.10087.
- Efficient off-policy meta-reinforcement learning via probabilistic context variables. In International conference on machine learning, pages 5331–5340. PMLR.
- Rockafellar, R. T. (1997). Convex analysis, volume 11. Princeton university press.
- An information-theoretic analysis of thompson sampling. The Journal of Machine Learning Research, 17(1):2442–2471.
- A tutorial on thompson sampling. Foundations and Trends® in Machine Learning, 11(1):1–96.
- TGRL: An algorithm for teacher guided reinforcement learning. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J., editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 31077–31093. PMLR.
- Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489.
- Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815.
- Deterministic policy gradient algorithms. In International conference on machine learning, pages 387–395. Pmlr.
- Hybrid rl: Using both offline and online data can make rl efficient. arXiv preprint arXiv:2210.06718.
- Sequence model imitation learning with unobserved contexts. Advances in Neural Information Processing Systems, 35:17665–17676.
- Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817.
- Meta-learning for generalized zero-shot learning. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 6062–6069.
- Leveraging offline data in online reinforcement learning. In International Conference on Machine Learning, pages 35300–35338. PMLR.
- Impossibly good experts and how to follow them. In The Eleventh International Conference on Learning Representations.
- Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML-11), pages 681–688.
- Meta-inverse reinforcement learning with probabilistic context variables. Advances in neural information processing systems, 32.
- Causal imitation learning with unobserved confounders. Advances in neural information processing systems, 33:12263–12274.
- Watch, try, learn: Meta-learning from demonstrations and reward. arXiv preprint arXiv:1906.03352.
- Vahid Balazadeh (8 papers)
- Keertana Chidambaram (3 papers)
- Viet Nguyen (13 papers)
- Rahul G. Krishnan (45 papers)
- Vasilis Syrgkanis (106 papers)