Efficient Online Learning with Offline Datasets for Infinite Horizon MDPs: A Bayesian Approach (2310.11531v2)
Abstract: In this paper, we study the problem of efficient online reinforcement learning in the infinite horizon setting when there is an offline dataset to start with. We assume that the offline dataset is generated by an expert but with unknown level of competence, i.e., it is not perfect and not necessarily using the optimal policy. We show that if the learning agent models the behavioral policy (parameterized by a competence parameter) used by the expert, it can do substantially better in terms of minimizing cumulative regret, than if it doesn't do that. We establish an upper bound on regret of the exact informed PSRL algorithm that scales as $\tilde{O}(\sqrt{T})$. This requires a novel prior-dependent regret analysis of Bayesian online learning algorithms for the infinite horizon setting. We then propose the Informed RLSVI algorithm to efficiently approximate the iPSRL algorithm.
- A. Agarwal and T. Zhang. Model-based RL with optimistic posterior sampling: Structural conditions and sample complexity. arXiv preprint arXiv:2206.07659, 2022.
- S. Agrawal and R. Jia. Optimistic posterior sampling for reinforcement learning: worst-case regret bounds. Advances in Neural Information Processing Systems, 30, 2017.
- A. Argenson and G. Dulac-Arnold. Model-based offline planning. arXiv preprint arXiv:2008.05556, 2020.
- Near-optimal regret bounds for reinforcement learning. Advances in neural information processing systems, 21, 2008.
- Efficient online reinforcement learning with offline data. arXiv preprint arXiv:2302.02948, 2023.
- Imitation learning by estimating expertise of demonstrators. Proceedings of the 39th International Conference on Machine Learning, 162:1732–1748, 2022.
- D. Bertsekas. Dynamic programming and optimal control: Volume II, volume 2. Athena scientific, 1995.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6, 2005.
- Planning to practice: Efficient online fine-tuning by composing goals in latent space. arXiv preprint arXiv:2205.08129, 2022.
- Finetuning offline world models in the real world. arXiv preprint arXiv:2310.16029, 2023.
- MoDem: Accelerating visual model-based reinforcement learning with demonstrations. arXiv preprint arXiv:2212.05698, 2022.
- Leveraging demonstrations to improve online learning: Quality matters. arXiv preprint arXiv:2302.03319, 2023.
- Deep Q-learning from demonstrations. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32(1), 2018.
- Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
- Imitation bootstrapped reinforcement learning. arXiv preprint arXiv:2311.02198, 2023.
- Is pessimism provably efficient for offline RL? In International Conference on Machine Learning, pages 5084–5096. PMLR, 2021.
- Offline reinforcement learning with implicit Q-learning. arXiv preprint arXiv:2110.06169, 2021.
- Conservative Q-learning for offline reinforcement learning. Advances in Neural Information Processing Systems, 33:1179–1191, 2020.
- When should we prefer offline reinforcement learning over behavioral cloning? arXiv preprint arXiv:2204.05618, 2022.
- Offline-to-online reinforcement learning via balanced replay and pessimistic Q-ensemble. In Conference on Robot Learning, pages 1702–1712. PMLR, 2022.
- D. A. Levin and Y. Peres. Markov chains and mixing times, volume 107. American Mathematical Soc., 2017.
- Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020.
- AWAC: Accelerating online reinforcement learning with offline datasets. arXiv preprint arXiv:2006.09359, 2020.
- (More) efficient reinforcement learning via posterior sampling. Advances in Neural Information Processing Systems, 26, 2013.
- Deep exploration via randomized value functions. J. Mach. Learn. Res., 20(124):1–62, 2019.
- Learning unknown Markov decision processes: A Thompson sampling approach. Advances in neural information processing systems, 30, 2017.
- Bridging offline reinforcement learning and imitation learning: A tale of pessimism. Advances in Neural Information Processing Systems, 34:11702–11716, 2021.
- D. Russo and B. Van Roy. An information-theoretic analysis of Thompson sampling. The Journal of Machine Learning Research, 17(1):2442–2471, 2016.
- A tutorial on Thompson sampling. Foundations and Trends® in Machine Learning, 11(1):1–96, 2018.
- S. Schaal. Learning from demonstration. Advances in neural information processing systems, 9, 1996.
- Online and offline reinforcement learning by planning with a learned model. Advances in Neural Information Processing Systems, 34:27580–27591, 2021.
- Hybrid RL: Using both offline and online data can make RL efficient. arXiv preprint arXiv:2210.06718, 2022.
- Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239, 2022.
- M. Uehara and W. Sun. Pessimistic model-based offline reinforcement learning under partial coverage. arXiv preprint arXiv:2107.06226, 2021.
- Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817, 2017.
- A. Wagenmaker and A. Pacchiano. Leveraging offline data in online reinforcement learning. arXiv preprint arXiv:2211.04974, 2022.
- Safe exploration for efficient policy evaluation and comparison. arXiv preprint arXiv:2202.13234, 2022.
- Inequalities for the L1 deviation of the empirical distribution. Hewlett-Packard Labs, Tech. Rep, 2003.
- Bellman-consistent pessimism for offline reinforcement learning. Advances in neural information processing systems, 34:6683–6694, 2021a.
- Policy finetuning: Bridging sample-efficient offline and online reinforcement learning. Advances in neural information processing systems, 34:27395–27407, 2021b.
- Adaptive policy learning for offline-to-online reinforcement learning. arXiv preprint arXiv:2303.07693, 2023.