Improved Algorithm for Adversarial Linear Mixture MDPs with Bandit Feedback and Unknown Transition (2403.04568v1)
Abstract: We study reinforcement learning with linear function approximation, unknown transition, and adversarial losses in the bandit feedback setting. Specifically, we focus on linear mixture MDPs whose transition kernel is a linear mixture model. We propose a new algorithm that attains an $\widetilde{O}(d\sqrt{HS3K} + \sqrt{HSAK})$ regret with high probability, where $d$ is the dimension of feature mappings, $S$ is the size of state space, $A$ is the size of action space, $H$ is the episode length and $K$ is the number of episodes. Our result strictly improves the previous best-known $\widetilde{O}(dS2 \sqrt{K} + \sqrt{HSAK})$ result in Zhao et al. (2023a) since $H \leq S$ holds by the layered MDP structure. Our advancements are primarily attributed to (i) a new least square estimator for the transition parameter that leverages the visit information of all states, as opposed to only one state in prior work, and (ii) a new self-normalized concentration tailored specifically to handle non-independent noises, originally proposed in the dynamic assortment area and firstly applied in reinforcement learning to handle correlations between different states.
- Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems 24 (NIPS), pages 2312–2320.
- Model-based reinforcement learning with value-targeted regression. In Proceedings of the 37th International Conference on Machine Learning (ICML), pages 463–474.
- Minimax regret bounds for reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning (ICML), pages 263–272.
- Contextual bandit algorithms with supervised learning guarantees. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS), pages 19–26.
- Provably efficient exploration in policy optimization. In Proceedings of the 37th International Conference on Machine Learning (ICML), pages 1283–1294.
- Refined regret for adversarial MDPs with linear function approximation. In Proceedings of the 40th International Conference on Machine Learning (ICML), pages 6726–6759.
- Provably efficient RL with rich observations via latent state decoding. In Proceedings of the 36th International Conference on Machine Learning (ICML), pages 1665–1674.
- Online Markov decision processes. Mathematics of Operations Research, pages 726–736.
- Improved optimistic algorithms for logistic bandits. In Proceedings of the 37th International Conference on Machine Learning (ICML), pages 3052–3060.
- Dynamic regret of policy optimization in non-stationary environments. In Advances in Neural Information Processing Systems 33 (NeurIPS), pages 6743–6754.
- Near-optimal policy optimization algorithms for learning adversarial linear mixture MDPs. In Proceedings of the 25th International Conference on Artificial Intelligence and Statistics (AISTATS), pages 4259–4280.
- Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, pages 1563–1600.
- Horizon-free reinforcement learning in adversarial linear mixture MDPs. In Proceedings of the 12th International Conference on Learning Representations (ICLR).
- Contextual decision processes with low Bellman rank are PAC-learnable. In Proceedings of the 34th International Conference on Machine Learning (ICML), pages 1704–1713.
- Is Q-learning provably efficient? In Advances in Neural Information Processing Systems 31 (NeurIPS), pages 4868–4878.
- Learning adversarial Markov decision processes with bandit feedback and unknown transition. In Proceedings of the 37th International Conference on Machine Learning (ICML), pages 4860–4869.
- Provably efficient reinforcement learning with linear function approximation. In Proceedings of the 33rd Conference on Learning Theory (COLT), pages 2137–2143.
- Improved regret bounds for linear adversarial MDPs via linear optimization. Transactions on Machine Learning Research.
- Bandit Algorithms. Cambridge University Press.
- Dynamic regret of adversarial linear mixture MDPs. In Advances in Neural Information Processing Systems 36 (NeurIPS), pages 60685–60711.
- Dynamic regret of adversarial MDPs with unknown transition and linear function approximation. In Proceedings of the 38th AAAI Conference on Artificial Intelligence (AAAI), page to appear.
- Towards optimal regret in adversarial linear MDPs with bandit feedback. In Proceedings of the 12th International Conference on Learning Representations (ICLR).
- Policy optimization in adversarial MDPs: Improved exploration via dilated bonuses. In Advances in Neural Information Processing Systems 34 (NeurIPS), pages 22931–22942.
- Playing Atari with deep reinforcement learning. ArXiv preprint, 1312.5602.
- Neu, G. (2015). Explore no more: Improved high-probability regret bounds for non-stochastic bandits. In Advances in Neural Information Processing Systems 28 (NeurIPS), pages 3168–3176.
- Online learning in MDPs with linear function approximation and bandit feedback. In Advances in Neural Information Processing Systems 34 (NeurIPS), pages 10407–10417.
- OpenAI (2023). GPT-4 technical report. ArXiv preprint, 2303.08774.
- Orabona, F. (2019). A modern introduction to online learning. ArXiv preprint, 1912.13213.
- Dynamic pricing and assortment under a contextual MNL demand. In Advances in Neural Information Processing Systems 35 (NeurIPS), pages 3461–3474.
- Puterman, M. L. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley.
- Online convex optimization in adversarial Markov decision processes. In Proceedings of the 36th International Conference on Machine Learning (ICML), pages 5478–5486.
- Online stochastic shortest path with bandit feedback and unknown transition function. In Advances in Neural Information Processing Systems 32 (NeurIPS), pages 2209–2218.
- Proximal policy optimization algorithms. ArXiv preprint, 1707.06347.
- Optimistic policy optimization with bandit feedback. In Proceedings of the 37th International Conference on Machine Learning (ICML), pages 8604–8613.
- Rate-optimal policy optimization for linear Markov decision processes. ArXiv preprint, arXiv:2308.14642.
- Improved regret for efficient online reinforcement learning with linear function approximation. In Proceedings of the 40th International Conference on Machine Learning (ICML), pages 31117–31150.
- More adaptive algorithms for adversarial bandits. In Proceedings of the 31st Conference on Learning Theory (COLT), pages 1263–1291.
- Sample-optimal parametric Q-learning using linearly additive features. In Proceedings of the 36th International Conference on Machine Learning (ICML), pages 6995–7004.
- Markov decision processes with arbitrary reward processes. Mathematics of Operations Research, pages 737–757.
- Learning adversarial linear mixture Markov decision processes with bandit feedback and unknown transition. In Proceedings of the 11th International Conference on Learning Representations (ICLR).
- Variance-dependent regret bounds for linear bandits and reinforcement learning: Adaptivity and computational efficiency. In The 36th Annual Conference on Learning Theory (COLT), pages 4977–5020.
- Dynamic regret of online Markov decision processes. In Proceedings of the 39th International Conference on Machine Learning (ICML), pages 26865–26894.
- Adaptivity and non-stationarity: Problem-dependent dynamic regret for online convex optimization. ArXiv preprint, 2112.14368.
- A theoretical analysis of optimistic proximal policy optimization in linear Markov decision processes. In Advances in Neural Information Processing Systems 36 (NeurIPS), pages 73666–73690.
- Nearly minimax optimal reinforcement learning for linear mixture Markov decision processes. In Proceedings of the 34th Conference on Learning Theory (COLT), pages 4532–4576.
- Online learning in episodic Markovian decision processes by relative entropy policy search. In Advances in Neural Information Processing Systems 26 (NIPS), pages 1583–1591.
- Long-Fei Li (5 papers)
- Peng Zhao (162 papers)
- Zhi-Hua Zhou (127 papers)