Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

On the Model-Misspecification in Reinforcement Learning (2306.10694v2)

Published 19 Jun 2023 in cs.LG

Abstract: The success of reinforcement learning (RL) crucially depends on effective function approximation when dealing with complex ground-truth models. Existing sample-efficient RL algorithms primarily employ three approaches to function approximation: policy-based, value-based, and model-based methods. However, in the face of model misspecification (a disparity between the ground-truth and optimal function approximators), it is shown that policy-based approaches can be robust even when the policy function approximation is under a large locally-bounded misspecification error, with which the function class may exhibit a $\Omega(1)$ approximation error in specific states and actions, but remains small on average within a policy-induced state distribution. Yet it remains an open question whether similar robustness can be achieved with value-based and model-based approaches, especially with general function approximation. To bridge this gap, in this paper we present a unified theoretical framework for addressing model misspecification in RL. We demonstrate that, through meticulous algorithm design and sophisticated analysis, value-based and model-based methods employing general function approximation can achieve robustness under local misspecification error bounds. In particular, they can attain a regret bound of $\widetilde{O}\left(\text{poly}(d H)(\sqrt{K} + K\zeta) \right)$, where $d$ represents the complexity of the function class, $H$ is the episode length, $K$ is the total number of episodes, and $\zeta$ denotes the local bound for misspecification error. Furthermore, we propose an algorithmic framework that can achieve the same order of regret bound without prior knowledge of $\zeta$, thereby enhancing its practical applicability.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (57)
  1. Pc-pg: Policy cover directed exploration for provable policy gradient learning. Advances in neural information processing systems, 33:13399–13412, 2020a.
  2. Flambe: Structural complexity and representation learning of low rank MDPs. Advances in neural information processing systems, 33:20095–20107, 2020b.
  3. On the theory of policy gradient methods: Optimality, approximation, and distribution shift. J. Mach. Learn. Res., 22(98):1–76, 2021.
  4. Provable benefits of representational transfer in reinforcement learning. In The Thirty Sixth Annual Conference on Learning Theory, pp.  2114–2187. PMLR, 2023.
  5. Model-based reinforcement learning with value-targeted regression. In International Conference on Machine Learning, pp.  463–474. PMLR, 2020.
  6. Bellman, R. E. Dynamic programming. Princeton university press, 2010.
  7. Provably efficient exploration in policy optimization. In International Conference on Machine Learning, pp.  1283–1294. PMLR, 2020.
  8. A general framework for sample-efficient function approximation in reinforcement learning. arXiv preprint arXiv:2209.15634, 2022.
  9. Robust stochastic linear contextual bandits under adversarial attacks. In International Conference on Artificial Intelligence and Statistics, pp.  7111–7123. PMLR, 2022.
  10. Bilinear classes: A structural framework for provable generalization in rl. In International Conference on Machine Learning, pp.  2826–2836. PMLR, 2021.
  11. Is a good representation sufficient for sample efficient reinforcement learning? arXiv preprint arXiv:1910.03016, 2019a.
  12. Provably efficient q-learning with function approximation via distribution shift error checking oracle. Advances in Neural Information Processing Systems, 32, 2019b.
  13. Minimax-optimal off-policy evaluation with linear function approximation. In International Conference on Machine Learning, pp.  2701–2709. PMLR, 2020.
  14. Provably correct optimization and exploration with non-linear policies. In International Conference on Machine Learning, pp.  3263–3273. PMLR, 2021.
  15. Beyond ucb: Optimal and efficient contextual bandits with regression oracles. In International Conference on Machine Learning, pp.  3199–3210. PMLR, 2020.
  16. Adapting to misspecification in contextual bandits. Advances in Neural Information Processing Systems, 33:11478–11489, 2020a.
  17. Instance-dependent complexity of contextual bandits and reinforcement learning: A disagreement-based perspective. arXiv preprint arXiv:2010.03104, 2020b.
  18. Freedman, D. A. On tail probabilities for martingales. the Annals of Probability, pp.  100–118, 1975.
  19. Misspecified linear bandits. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 31, 2017.
  20. Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. In 2017 IEEE international conference on robotics and automation (ICRA), pp.  3389–3396. IEEE, 2017.
  21. Better algorithms for stochastic bandits with adversarial corruptions. In Proceedings of the Annual Conference on Learning Theory, 2019.
  22. Nearly minimax optimal reinforcement learning for linear markov decision processes. arXiv preprint arXiv:2212.06132, 2022.
  23. Feature-based q-learning for two-player stochastic games. arXiv preprint arXiv:1906.00423, 2019.
  24. Model-based reinforcement learning with value-targeted regression. In Learning for Dynamics and Control, pp.  666–686. PMLR, 2020.
  25. Contextual decision processes with low bellman rank are pac-learnable. In International Conference on Machine Learning, pp.  1704–1713. PMLR, 2017.
  26. Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pp.  2137–2143. PMLR, 2020.
  27. Bellman eluder dimension: New rich classes of rl problems, and sample-efficient algorithms. Advances in neural information processing systems, 34:13406–13418, 2021.
  28. Online sub-sampling for reinforcement learning with general function approximation. arXiv preprint arXiv:2106.07203, 2021.
  29. Learning with good feature representations in bandits and in RL with a generative model. In International Conference on Machine Learning, pp.  5662–5670. PMLR, 2020.
  30. Low-switching policy gradient with exploration via online sensitivity sampling. arXiv preprint arXiv:2306.09554, 2023.
  31. One objective to rule them all: A maximization objective fusing estimation and planning for exploration. arXiv preprint arXiv:2305.18258, 2023.
  32. Stochastic bandits robust to adversarial corruptions. In Proceedings of the Annual ACM SIGACT Symposium on Theory of Computing, 2018.
  33. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
  34. Human-level control through deep reinforcement learning. nature, 518(7540):529–533, 2015.
  35. Sample complexity of reinforcement learning using linearly combined model ensembles. In International Conference on Artificial Intelligence and Statistics, pp.  2010–2020. PMLR, 2020.
  36. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  37. Puterman, M. L. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.
  38. Eluder dimension and the sample complexity of optimistic exploration. Advances in Neural Information Processing Systems, 26, 2013.
  39. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016.
  40. A parameter-free algorithm for misspecified linear contextual bandits. In International Conference on Artificial Intelligence and Statistics, pp.  3367–3375. PMLR, 2021.
  41. Improved algorithms for misspecified linear markov decision processes. In International Conference on Artificial Intelligence and Statistics, pp.  4723–4746. PMLR, 2022.
  42. On reward-free reinforcement learning with linear function approximation. Advances in neural information processing systems, 33:17816–17826, 2020a.
  43. Reinforcement learning with general value function approximation: Provably efficient approach via bounded eluder dimension. Advances in Neural Information Processing Systems, 33:6123–6135, 2020b.
  44. Optimism in reinforcement learning with generalized linear function approximation. arXiv preprint arXiv:1912.04136, 2019.
  45. A model selection approach for corruption robust reinforcement learning. In International Conference on Algorithmic Learning Theory, pp.  1043–1096. PMLR, 2022.
  46. Sample-optimal parametric q-learning using linearly additive features. In International Conference on Machine Learning, pp.  6995–7004. PMLR, 2019.
  47. Reinforcement learning in feature space: Matrix bandit, kernels, and regret bound. In International Conference on Machine Learning, pp.  10746–10756. PMLR, 2020.
  48. Corruption-robust algorithms with uncertainty weighting for nonlinear contextual bandits and markov decision processes. In International Conference on Machine Learning, pp.  39834–39863. PMLR, 2023.
  49. Limiting extrapolation in linear approximate value iteration. Advances in Neural Information Processing Systems, 32, 2019.
  50. Cautiously optimistic policy optimization and exploration with linear function approximation. In Conference on Learning Theory, pp.  4473–4525. PMLR, 2021.
  51. Optimal horizon-free reward-free exploration for linear mixture mdps. arXiv preprint arXiv:2303.10165, 2023a.
  52. Reward-free model-based reinforcement learning with linear function approximation. Advances in Neural Information Processing Systems, 34:1582–1593, 2021.
  53. On the interplay between misspecification and sub-optimality gap in linear contextual bandits. arXiv preprint arXiv:2303.09390, 2023b.
  54. Linear contextual bandits with adversarial corruptions. arXiv preprint arXiv:2110.12615, 2021.
  55. A posterior sampling framework for interactive decision making. arXiv preprint arXiv:2211.01962, 2022.
  56. Computationally efficient horizon-free reinforcement learning for linear mixture mdps. Advances in neural information processing systems, 35:36337–36349, 2022.
  57. Nearly minimax optimal reinforcement learning for linear mixture markov decision processes. In Conference on Learning Theory, pp.  4532–4576. PMLR, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Yunfan Li (26 papers)
  2. Lin Yang (212 papers)
Citations (3)

Summary

We haven't generated a summary for this paper yet.