Sample Complexity Characterization for Linear Contextual MDPs (2402.02700v1)
Abstract: Contextual Markov decision processes (CMDPs) describe a class of reinforcement learning problems in which the transition kernels and reward functions can change over time with different MDPs indexed by a context variable. While CMDPs serve as an important framework to model many real-world applications with time-varying environments, they are largely unexplored from theoretical perspective. In this paper, we study CMDPs under two linear function approximation models: Model I with context-varying representations and common linear weights for all contexts; and Model II with common representations for all contexts and context-varying linear weights. For both models, we propose novel model-based algorithms and show that they enjoy guaranteed $\epsilon$-suboptimality gap with desired polynomial sample complexity. In particular, instantiating our result for the first model to the tabular CMDP improves the existing result by removing the reachability assumption. Our result for the second model is the first-known result for such a type of function approximation models. Comparison between our results for the two models further indicates that having context-varying features leads to much better sample efficiency than having common representations for all contexts under linear CMDPs.
- Taming the monster: A fast and simple algorithm for contextual bandits. In International Conference on Machine Learning (ICML).
- Flambe: Structural complexity and representation learning of low rank mdps. Advances in neural information processing systems (NeurIPS), 33:20095–20107.
- Provably efficient lifelong reinforcement learning with linear function approximation. arXiv preprint arXiv:2206.00270.
- Bertsekas, D. P. (2011). Dynamic Programming and Optimal Control 3rd edition, volume II. Belmont, MA: Athena Scientific.
- Improved sample complexity for reward-free reinforcement learning under low-rank MDPs. International Conference on Learning Representations (ICLR).
- Provably efficient algorithm for nonstationary low-rank mdps. Advances in Neural Information Processing Systems (NeurIPS).
- Contextual bandits with linear payoff functions. In International Conference on Artificial Intelligence and Statistics (AISTATS).
- Unifying pac and regret: Uniform pac bounds for episodic reinforcement learning. Advances in Neural Information Processing Systems (NeurIPS), 30.
- Root-n-regret for learning in markov decision processes with function approximation and low bellman rank. In Conference on Learning Theory (COLT).
- Is a good representation sufficient for sample efficient reinforcement learning? arXiv preprint arXiv:1910.03016.
- Non-stationary reinforcement learning under general function approximation. International Conference on Machine Learning (ICML).
- Practical contextual bandits with regression oracles. In International Conference on Machine Learning (ICML).
- Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. In IEEE international conference on robotics and automation.
- Contextual markov decision processes. arXiv preprint arXiv:1502.02259.
- Nearly minimax optimal reinforcement learning for linear markov decision processes. arXiv preprint arXiv:2212.06132.
- Logarithmic regret for reinforcement learning with linear function approximation. In International Conference on Machine Learning (ICML).
- Nearly minimax optimal reinforcement learning with linear function approximation. In International Conference on Machine Learning (ICML).
- Contextual decision processes with low bellman rank are pac-learnable. In International Conference on Machine Learning (ICML).
- Learning adversarial MDPs with bandit feedback and unknown transition. arXiv preprint arXiv:1912.01192.
- Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory (COLT).
- Potent real-time recommendations using multimodel contextual reinforcement learning. IEEE Transactions on Computational Social Systems.
- End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research, 17(1):1334–1373.
- Learning efficiently function approximation for contextual MDP. arXiv preprint arXiv:2203.00995.
- Optimism in face of a context: Regret guarantees for stochastic contextual mdp. arXiv preprint arXiv:2207.11126.
- A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pages 661–670.
- Corruption-robust exploration in episodic reinforcement learning. In Conference on Learning Theory (COLT).
- Near-optimal model-free reinforcement learning in non-stationary episodic mdps. In International Conference on Machine Learning (ICML).
- Markov decision processes with continuous side information. In Algorithmic Learning Theory (ALT).
- The adversarial stochastic shortest path problem with unknown transition probabilities. In International Conference on Artificial Intelligence and Statistics (AISTATS).
- Online convex optimization in adversarial markov decision processes. In International Conference on Machine Learning (ICML).
- Mastering the game of Go with deep neural networks and tree search. nature, 529(7587):484.
- Slivkins, A. et al. (2019). Introduction to multi-armed bandits. Foundations and Trends® in Machine Learning, 12(1-2):1–286.
- Block contextual mdps for continual learning. In Learning for Dynamics and Control Conference, pages 608–623. PMLR.
- Model-based RL in contextual decision processes: PAC bounds and exponential improvements over model-free approaches. In Conference on Learning Theory (COLT).
- Reinforcement Learning: An Introduction. The MIT Press, Cambridge, Massachusetts.
- Efficient learning in non-stationary linear markov decision processes. CoRR, abs/2010.12870.
- Representation learning for online and offline rl in low-rank mdps. arXiv preprint arXiv:2110.04652.
- Optimism in reinforcement learning with generalized linear function approximation. arXiv preprint arXiv:1912.04136.
- Non-asymptotic convergence analysis of adam-type reinforcement learning algorithms under markovian sampling. In Proc. AAAI Conference on Artificial Intelligence (AAAI).
- Finite-time analysis for double q-learning. Advances in neural information processing systems (NeurIPS), 33:16628–16638.
- Upper counterfactual confidence bounds: a new optimism principle for contextual bandits. arXiv preprint arXiv:2007.07876.
- Cautiously optimistic policy optimization and exploration with linear function approximation. In Conference on Learning Theory (COLT).
- Improved variance-aware confidence sets for linear bandits and linear mixture mdp. Advances in Neural Information Processing Systems (NeurIPS), 34:4342–4355.
- Optimistic policy optimization is provably efficient in non-stationary mdps. CoRR, abs/2110.08984.
- Nonstationary reinforcement learning with linear function approximation. CoRR.
- Online learning in episodic markovian decision processes by relative entropy policy search. In Advances in Neural Information Processing Systems (NeurIPS).