Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Improved Algorithm for Adversarial Linear Mixture MDPs with Bandit Feedback and Unknown Transition (2403.04568v1)

Published 7 Mar 2024 in cs.LG and stat.ML

Abstract: We study reinforcement learning with linear function approximation, unknown transition, and adversarial losses in the bandit feedback setting. Specifically, we focus on linear mixture MDPs whose transition kernel is a linear mixture model. We propose a new algorithm that attains an $\widetilde{O}(d\sqrt{HS3K} + \sqrt{HSAK})$ regret with high probability, where $d$ is the dimension of feature mappings, $S$ is the size of state space, $A$ is the size of action space, $H$ is the episode length and $K$ is the number of episodes. Our result strictly improves the previous best-known $\widetilde{O}(dS2 \sqrt{K} + \sqrt{HSAK})$ result in Zhao et al. (2023a) since $H \leq S$ holds by the layered MDP structure. Our advancements are primarily attributed to (i) a new least square estimator for the transition parameter that leverages the visit information of all states, as opposed to only one state in prior work, and (ii) a new self-normalized concentration tailored specifically to handle non-independent noises, originally proposed in the dynamic assortment area and firstly applied in reinforcement learning to handle correlations between different states.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (46)
  1. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems 24 (NIPS), pages 2312–2320.
  2. Model-based reinforcement learning with value-targeted regression. In Proceedings of the 37th International Conference on Machine Learning (ICML), pages 463–474.
  3. Minimax regret bounds for reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning (ICML), pages 263–272.
  4. Contextual bandit algorithms with supervised learning guarantees. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS), pages 19–26.
  5. Provably efficient exploration in policy optimization. In Proceedings of the 37th International Conference on Machine Learning (ICML), pages 1283–1294.
  6. Refined regret for adversarial MDPs with linear function approximation. In Proceedings of the 40th International Conference on Machine Learning (ICML), pages 6726–6759.
  7. Provably efficient RL with rich observations via latent state decoding. In Proceedings of the 36th International Conference on Machine Learning (ICML), pages 1665–1674.
  8. Online Markov decision processes. Mathematics of Operations Research, pages 726–736.
  9. Improved optimistic algorithms for logistic bandits. In Proceedings of the 37th International Conference on Machine Learning (ICML), pages 3052–3060.
  10. Dynamic regret of policy optimization in non-stationary environments. In Advances in Neural Information Processing Systems 33 (NeurIPS), pages 6743–6754.
  11. Near-optimal policy optimization algorithms for learning adversarial linear mixture MDPs. In Proceedings of the 25th International Conference on Artificial Intelligence and Statistics (AISTATS), pages 4259–4280.
  12. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, pages 1563–1600.
  13. Horizon-free reinforcement learning in adversarial linear mixture MDPs. In Proceedings of the 12th International Conference on Learning Representations (ICLR).
  14. Contextual decision processes with low Bellman rank are PAC-learnable. In Proceedings of the 34th International Conference on Machine Learning (ICML), pages 1704–1713.
  15. Is Q-learning provably efficient? In Advances in Neural Information Processing Systems 31 (NeurIPS), pages 4868–4878.
  16. Learning adversarial Markov decision processes with bandit feedback and unknown transition. In Proceedings of the 37th International Conference on Machine Learning (ICML), pages 4860–4869.
  17. Provably efficient reinforcement learning with linear function approximation. In Proceedings of the 33rd Conference on Learning Theory (COLT), pages 2137–2143.
  18. Improved regret bounds for linear adversarial MDPs via linear optimization. Transactions on Machine Learning Research.
  19. Bandit Algorithms. Cambridge University Press.
  20. Dynamic regret of adversarial linear mixture MDPs. In Advances in Neural Information Processing Systems 36 (NeurIPS), pages 60685–60711.
  21. Dynamic regret of adversarial MDPs with unknown transition and linear function approximation. In Proceedings of the 38th AAAI Conference on Artificial Intelligence (AAAI), page to appear.
  22. Towards optimal regret in adversarial linear MDPs with bandit feedback. In Proceedings of the 12th International Conference on Learning Representations (ICLR).
  23. Policy optimization in adversarial MDPs: Improved exploration via dilated bonuses. In Advances in Neural Information Processing Systems 34 (NeurIPS), pages 22931–22942.
  24. Playing Atari with deep reinforcement learning. ArXiv preprint, 1312.5602.
  25. Neu, G. (2015). Explore no more: Improved high-probability regret bounds for non-stochastic bandits. In Advances in Neural Information Processing Systems 28 (NeurIPS), pages 3168–3176.
  26. Online learning in MDPs with linear function approximation and bandit feedback. In Advances in Neural Information Processing Systems 34 (NeurIPS), pages 10407–10417.
  27. OpenAI (2023). GPT-4 technical report. ArXiv preprint, 2303.08774.
  28. Orabona, F. (2019). A modern introduction to online learning. ArXiv preprint, 1912.13213.
  29. Dynamic pricing and assortment under a contextual MNL demand. In Advances in Neural Information Processing Systems 35 (NeurIPS), pages 3461–3474.
  30. Puterman, M. L. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley.
  31. Online convex optimization in adversarial Markov decision processes. In Proceedings of the 36th International Conference on Machine Learning (ICML), pages 5478–5486.
  32. Online stochastic shortest path with bandit feedback and unknown transition function. In Advances in Neural Information Processing Systems 32 (NeurIPS), pages 2209–2218.
  33. Proximal policy optimization algorithms. ArXiv preprint, 1707.06347.
  34. Optimistic policy optimization with bandit feedback. In Proceedings of the 37th International Conference on Machine Learning (ICML), pages 8604–8613.
  35. Rate-optimal policy optimization for linear Markov decision processes. ArXiv preprint, arXiv:2308.14642.
  36. Improved regret for efficient online reinforcement learning with linear function approximation. In Proceedings of the 40th International Conference on Machine Learning (ICML), pages 31117–31150.
  37. More adaptive algorithms for adversarial bandits. In Proceedings of the 31st Conference on Learning Theory (COLT), pages 1263–1291.
  38. Sample-optimal parametric Q-learning using linearly additive features. In Proceedings of the 36th International Conference on Machine Learning (ICML), pages 6995–7004.
  39. Markov decision processes with arbitrary reward processes. Mathematics of Operations Research, pages 737–757.
  40. Learning adversarial linear mixture Markov decision processes with bandit feedback and unknown transition. In Proceedings of the 11th International Conference on Learning Representations (ICLR).
  41. Variance-dependent regret bounds for linear bandits and reinforcement learning: Adaptivity and computational efficiency. In The 36th Annual Conference on Learning Theory (COLT), pages 4977–5020.
  42. Dynamic regret of online Markov decision processes. In Proceedings of the 39th International Conference on Machine Learning (ICML), pages 26865–26894.
  43. Adaptivity and non-stationarity: Problem-dependent dynamic regret for online convex optimization. ArXiv preprint, 2112.14368.
  44. A theoretical analysis of optimistic proximal policy optimization in linear Markov decision processes. In Advances in Neural Information Processing Systems 36 (NeurIPS), pages 73666–73690.
  45. Nearly minimax optimal reinforcement learning for linear mixture Markov decision processes. In Proceedings of the 34th Conference on Learning Theory (COLT), pages 4532–4576.
  46. Online learning in episodic Markovian decision processes by relative entropy policy search. In Advances in Neural Information Processing Systems 26 (NIPS), pages 1583–1591.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Long-Fei Li (5 papers)
  2. Peng Zhao (162 papers)
  3. Zhi-Hua Zhou (127 papers)
Citations (3)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets