On the Model-Misspecification in Reinforcement Learning (2306.10694v2)
Abstract: The success of reinforcement learning (RL) crucially depends on effective function approximation when dealing with complex ground-truth models. Existing sample-efficient RL algorithms primarily employ three approaches to function approximation: policy-based, value-based, and model-based methods. However, in the face of model misspecification (a disparity between the ground-truth and optimal function approximators), it is shown that policy-based approaches can be robust even when the policy function approximation is under a large locally-bounded misspecification error, with which the function class may exhibit a $\Omega(1)$ approximation error in specific states and actions, but remains small on average within a policy-induced state distribution. Yet it remains an open question whether similar robustness can be achieved with value-based and model-based approaches, especially with general function approximation. To bridge this gap, in this paper we present a unified theoretical framework for addressing model misspecification in RL. We demonstrate that, through meticulous algorithm design and sophisticated analysis, value-based and model-based methods employing general function approximation can achieve robustness under local misspecification error bounds. In particular, they can attain a regret bound of $\widetilde{O}\left(\text{poly}(d H)(\sqrt{K} + K\zeta) \right)$, where $d$ represents the complexity of the function class, $H$ is the episode length, $K$ is the total number of episodes, and $\zeta$ denotes the local bound for misspecification error. Furthermore, we propose an algorithmic framework that can achieve the same order of regret bound without prior knowledge of $\zeta$, thereby enhancing its practical applicability.
- Pc-pg: Policy cover directed exploration for provable policy gradient learning. Advances in neural information processing systems, 33:13399–13412, 2020a.
- Flambe: Structural complexity and representation learning of low rank MDPs. Advances in neural information processing systems, 33:20095–20107, 2020b.
- On the theory of policy gradient methods: Optimality, approximation, and distribution shift. J. Mach. Learn. Res., 22(98):1–76, 2021.
- Provable benefits of representational transfer in reinforcement learning. In The Thirty Sixth Annual Conference on Learning Theory, pp. 2114–2187. PMLR, 2023.
- Model-based reinforcement learning with value-targeted regression. In International Conference on Machine Learning, pp. 463–474. PMLR, 2020.
- Bellman, R. E. Dynamic programming. Princeton university press, 2010.
- Provably efficient exploration in policy optimization. In International Conference on Machine Learning, pp. 1283–1294. PMLR, 2020.
- A general framework for sample-efficient function approximation in reinforcement learning. arXiv preprint arXiv:2209.15634, 2022.
- Robust stochastic linear contextual bandits under adversarial attacks. In International Conference on Artificial Intelligence and Statistics, pp. 7111–7123. PMLR, 2022.
- Bilinear classes: A structural framework for provable generalization in rl. In International Conference on Machine Learning, pp. 2826–2836. PMLR, 2021.
- Is a good representation sufficient for sample efficient reinforcement learning? arXiv preprint arXiv:1910.03016, 2019a.
- Provably efficient q-learning with function approximation via distribution shift error checking oracle. Advances in Neural Information Processing Systems, 32, 2019b.
- Minimax-optimal off-policy evaluation with linear function approximation. In International Conference on Machine Learning, pp. 2701–2709. PMLR, 2020.
- Provably correct optimization and exploration with non-linear policies. In International Conference on Machine Learning, pp. 3263–3273. PMLR, 2021.
- Beyond ucb: Optimal and efficient contextual bandits with regression oracles. In International Conference on Machine Learning, pp. 3199–3210. PMLR, 2020.
- Adapting to misspecification in contextual bandits. Advances in Neural Information Processing Systems, 33:11478–11489, 2020a.
- Instance-dependent complexity of contextual bandits and reinforcement learning: A disagreement-based perspective. arXiv preprint arXiv:2010.03104, 2020b.
- Freedman, D. A. On tail probabilities for martingales. the Annals of Probability, pp. 100–118, 1975.
- Misspecified linear bandits. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 31, 2017.
- Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. In 2017 IEEE international conference on robotics and automation (ICRA), pp. 3389–3396. IEEE, 2017.
- Better algorithms for stochastic bandits with adversarial corruptions. In Proceedings of the Annual Conference on Learning Theory, 2019.
- Nearly minimax optimal reinforcement learning for linear markov decision processes. arXiv preprint arXiv:2212.06132, 2022.
- Feature-based q-learning for two-player stochastic games. arXiv preprint arXiv:1906.00423, 2019.
- Model-based reinforcement learning with value-targeted regression. In Learning for Dynamics and Control, pp. 666–686. PMLR, 2020.
- Contextual decision processes with low bellman rank are pac-learnable. In International Conference on Machine Learning, pp. 1704–1713. PMLR, 2017.
- Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pp. 2137–2143. PMLR, 2020.
- Bellman eluder dimension: New rich classes of rl problems, and sample-efficient algorithms. Advances in neural information processing systems, 34:13406–13418, 2021.
- Online sub-sampling for reinforcement learning with general function approximation. arXiv preprint arXiv:2106.07203, 2021.
- Learning with good feature representations in bandits and in RL with a generative model. In International Conference on Machine Learning, pp. 5662–5670. PMLR, 2020.
- Low-switching policy gradient with exploration via online sensitivity sampling. arXiv preprint arXiv:2306.09554, 2023.
- One objective to rule them all: A maximization objective fusing estimation and planning for exploration. arXiv preprint arXiv:2305.18258, 2023.
- Stochastic bandits robust to adversarial corruptions. In Proceedings of the Annual ACM SIGACT Symposium on Theory of Computing, 2018.
- Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
- Human-level control through deep reinforcement learning. nature, 518(7540):529–533, 2015.
- Sample complexity of reinforcement learning using linearly combined model ensembles. In International Conference on Artificial Intelligence and Statistics, pp. 2010–2020. PMLR, 2020.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- Puterman, M. L. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.
- Eluder dimension and the sample complexity of optimistic exploration. Advances in Neural Information Processing Systems, 26, 2013.
- Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016.
- A parameter-free algorithm for misspecified linear contextual bandits. In International Conference on Artificial Intelligence and Statistics, pp. 3367–3375. PMLR, 2021.
- Improved algorithms for misspecified linear markov decision processes. In International Conference on Artificial Intelligence and Statistics, pp. 4723–4746. PMLR, 2022.
- On reward-free reinforcement learning with linear function approximation. Advances in neural information processing systems, 33:17816–17826, 2020a.
- Reinforcement learning with general value function approximation: Provably efficient approach via bounded eluder dimension. Advances in Neural Information Processing Systems, 33:6123–6135, 2020b.
- Optimism in reinforcement learning with generalized linear function approximation. arXiv preprint arXiv:1912.04136, 2019.
- A model selection approach for corruption robust reinforcement learning. In International Conference on Algorithmic Learning Theory, pp. 1043–1096. PMLR, 2022.
- Sample-optimal parametric q-learning using linearly additive features. In International Conference on Machine Learning, pp. 6995–7004. PMLR, 2019.
- Reinforcement learning in feature space: Matrix bandit, kernels, and regret bound. In International Conference on Machine Learning, pp. 10746–10756. PMLR, 2020.
- Corruption-robust algorithms with uncertainty weighting for nonlinear contextual bandits and markov decision processes. In International Conference on Machine Learning, pp. 39834–39863. PMLR, 2023.
- Limiting extrapolation in linear approximate value iteration. Advances in Neural Information Processing Systems, 32, 2019.
- Cautiously optimistic policy optimization and exploration with linear function approximation. In Conference on Learning Theory, pp. 4473–4525. PMLR, 2021.
- Optimal horizon-free reward-free exploration for linear mixture mdps. arXiv preprint arXiv:2303.10165, 2023a.
- Reward-free model-based reinforcement learning with linear function approximation. Advances in Neural Information Processing Systems, 34:1582–1593, 2021.
- On the interplay between misspecification and sub-optimality gap in linear contextual bandits. arXiv preprint arXiv:2303.09390, 2023b.
- Linear contextual bandits with adversarial corruptions. arXiv preprint arXiv:2110.12615, 2021.
- A posterior sampling framework for interactive decision making. arXiv preprint arXiv:2211.01962, 2022.
- Computationally efficient horizon-free reinforcement learning for linear mixture mdps. Advances in neural information processing systems, 35:36337–36349, 2022.
- Nearly minimax optimal reinforcement learning for linear mixture markov decision processes. In Conference on Learning Theory, pp. 4532–4576. PMLR, 2021.