Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
162 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Near-Optimal Reinforcement Learning with Self-Play under Adaptivity Constraints (2402.01111v1)

Published 2 Feb 2024 in cs.LG, cs.AI, cs.MA, and stat.ML

Abstract: We study the problem of multi-agent reinforcement learning (MARL) with adaptivity constraints -- a new problem motivated by real-world applications where deployments of new policies are costly and the number of policy updates must be minimized. For two-player zero-sum Markov Games, we design a (policy) elimination based algorithm that achieves a regret of $\widetilde{O}(\sqrt{H3 S2 ABK})$, while the batch complexity is only $O(H+\log\log K)$. In the above, $S$ denotes the number of states, $A,B$ are the number of actions for the two players respectively, $H$ is the horizon and $K$ is the number of episodes. Furthermore, we prove a batch complexity lower bound $\Omega(\frac{H}{\log_{A}K}+\log\log K)$ for all algorithms with $\widetilde{O}(\sqrt{K})$ regret bound, which matches our upper bound up to logarithmic factors. As a byproduct, our techniques naturally extend to learning bandit games and reward-free MARL within near optimal batch complexity. To the best of our knowledge, these are the first line of results towards understanding MARL with low adaptivity.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (76)
  1. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312–2320, 2011.
  2. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2):235–256, 2002.
  3. Minimax regret bounds for reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 263–272. JMLR. org, 2017.
  4. Yu Bai and Chi Jin. Provable self-play algorithms for competitive reinforcement learning. In International Conference on Machine Learning, pages 551–560. PMLR, 2020.
  5. Provably efficient q-learning with low switching cost. Advances in Neural Information Processing Systems, 32, 2019.
  6. Near-optimal reinforcement learning with self-play. Advances in neural information processing systems, 33:2159–2170, 2020.
  7. Qflow: A reinforcement learning approach to high qoe video streaming over wireless networks. In Proceedings of the twentieth ACM international symposium on mobile ad hoc networking and computing, pages 251–260, 2019.
  8. Superhuman ai for multiplayer poker. Science, 365(6456):885–890, 2019.
  9. Online learning with switching costs and other adaptive adversaries. In Advances in Neural Information Processing Systems, pages 1160–1168, 2013.
  10. On the statistical efficiency of reward-free exploration in non-linear rl. Advances in Neural Information Processing Systems, 35:20960–20973, 2022.
  11. Near-optimal reward-free exploration for linear mixture mdps with plug-in solver. arXiv preprint arXiv:2110.03244, 2021.
  12. Herman Chernoff et al. A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. The Annals of Mathematical Statistics, 23(4):493–507, 1952.
  13. Distributed deep reinforcement learning for intelligent load scheduling in residential smart grids. IEEE Transactions on Industrial Informatics, 17(4):2752–2763, 2020.
  14. Breaking the curse of multiagents in a large state space: Rl in markov games with independent linear function approximation. In The Thirty Sixth Annual Conference on Learning Theory, pages 2651–2652. PMLR, 2023.
  15. Unifying pac and regret: Uniform pac bounds for episodic reinforcement learning. In Advances in Neural Information Processing Systems, pages 5713–5723, 2017.
  16. Policy certificates: Towards accountable reinforcement learning. In International Conference on Machine Learning, pages 1507–1516. PMLR, 2019.
  17. Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography: Third Theory of Cryptography Conference, TCC 2006, New York, NY, USA, March 4-7, 2006. Proceedings 3, pages 265–284. Springer, 2006.
  18. Competitive Markov decision processes. Springer Science & Business Media, 2012.
  19. A provably efficient algorithm for linear markov decision process with low switching cost. arXiv preprint arXiv:2101.00494, 2021.
  20. Batched multi-armed bandits problem. Advances in Neural Information Processing Systems, 32, 2019.
  21. Sequential batch learning in finite-action linear contextual bandits. arXiv preprint arXiv:2004.06321, 2020.
  22. Nearly minimax optimal reinforcement learning for linear markov decision processes. In International Conference on Machine Learning, pages 12790–12822. PMLR, 2023.
  23. Towards minimax optimal reward-free reinforcement learning in linear mdps. In The Eleventh International Conference on Learning Representations, 2023.
  24. Towards general function approximation in zero-sum markov games. In 10th International Conference on Learning Representations, ICLR 2022, 2022a.
  25. Towards deployment-efficient reinforcement learning: Lower bound and optimality. In International Conference on Learning Representations, 2022b.
  26. Regret-optimal model-free reinforcement learning for discounted mdps with short burn-in time. arXiv preprint arXiv:2305.15546, 2023.
  27. Is q-learning provably efficient? In Advances in Neural Information Processing Systems, pages 4863–4873, 2018.
  28. Reward-free exploration for reinforcement learning. In International Conference on Machine Learning, pages 4870–4879. PMLR, 2020.
  29. V-learning–a simple, efficient, decentralized algorithm for multiagent rl. arXiv preprint arXiv:2110.14555, 2021.
  30. The power of exploiter: Provable multi-agent rl in large state spaces. In International Conference on Machine Learning, pages 10251–10279. PMLR, 2022.
  31. Sample-efficiency in multi-batch reinforcement learning: The need for dimension-dependent adaptivity. arXiv preprint arXiv:2310.01616, 2023.
  32. Adaptive reward-free exploration. In Algorithmic Learning Theory, pages 865–891. PMLR, 2021.
  33. Online sub-sampling for reinforcement learning with general function approximation. arXiv preprint arXiv:2106.07203, 2021.
  34. Learning two-player markov games: Neural function approximation and correlated equilibrium. Advances in Neural Information Processing Systems, 35:33262–33274, 2022.
  35. Minimax-optimal reward-agnostic exploration in reinforcement learning. arXiv preprint arXiv:2304.07278, 2023.
  36. A sharp analysis of model-based reinforcement learning with self-play. In International Conference on Machine Learning, pages 7001–7010. PMLR, 2021.
  37. On improving model-free algorithms for decentralized multi-agent reinforcement learning. In International Conference on Machine Learning, pages 15007–15049. PMLR, 2022.
  38. Deployment-efficient reinforcement learning via model-based offline optimization. In International Conference on Learning Representations, 2020.
  39. Empirical bernstein bounds and sample variance penalization. arXiv preprint arXiv:0907.3740, 2009.
  40. Fast active learning for pure exploration in reinforcement learning. In International Conference on Machine Learning, pages 7599–7608. PMLR, 2021.
  41. A simple reward-free approach to constrained reinforcement learning. In International Conference on Machine Learning, pages 15666–15698. PMLR, 2022.
  42. Model-free representation learning and exploration in low-rank mdps. arXiv preprint arXiv:2102.07035, 2021.
  43. Polynomial-time algorithms for linear programming. Integer and Combinatorial Optimization, pages 146–181, 1988.
  44. Batched bandit problems. The Annals of Statistics, 44(2):660–681, 2016.
  45. Offline reinforcement learning with differential privacy. arXiv preprint arXiv:2206.00810, 2022.
  46. Near-optimal deployment efficiency in reward-free reinforcement learning with linear function approximation. In The Eleventh International Conference on Learning Representations, 2023a.
  47. Near-optimal differentially private reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pages 9914–9940. PMLR, 2023b.
  48. Sample-efficient reinforcement learning with loglog(T) switching cost. In International Conference on Machine Learning, pages 18031–18061. PMLR, 2022.
  49. Logarithmic switching cost in reinforcement learning beyond linear mdps. arXiv preprint arXiv:2302.12456, 2023.
  50. On reward-free rl with kernel and neural function approximations: Single-agent mdp and markov game. In International Conference on Machine Learning, pages 8737–8747. PMLR, 2021.
  51. Linear bandits with limited adaptivity and learning distributional optimal design. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pages 74–87, 2021.
  52. Safe, multi-agent, reinforcement learning for autonomous driving. arXiv preprint arXiv:1610.03295, 2016.
  53. Near-optimal adversarial reinforcement learning with switching costs. In International Conference on Learning Representations (ICLR), 2023.
  54. Mastering the game of go without human knowledge. nature, 550(7676):354–359, 2017.
  55. Phase transitions and cyclic phenomena in bandits with switching constraints. Advances in Neural Information Processing Systems, 32, 2019.
  56. The best of both worlds: Reinforcement learning with logarithmic regret and policy switches. arXiv preprint arXiv:2203.01491, 2022.
  57. Reward-free rl is no harder than reward-aware rl in linear markov decision processes. In International Conference on Machine Learning, pages 22430–22456. PMLR, 2022.
  58. On reward-free reinforcement learning with linear function approximation. Advances in neural information processing systems, 33:17816–17826, 2020.
  59. Provably efficient reinforcement learning with linear function approximation under adaptivity constraints. Advances in Neural Information Processing Systems, 34, 2021.
  60. Breaking the curse of multiagency: Provably efficient decentralized multi-agent rl with function approximation. arXiv preprint arXiv:2302.06606, 2023.
  61. Learning zero-sum simultaneous-move markov games using function approximation and correlated equilibrium. In Conference on learning theory, pages 3674–3682. PMLR, 2020.
  62. A general framework for sequential decision-making under adaptivity constraints. arXiv preprint arXiv:2306.14468, 2023.
  63. Doubly fair dynamic pricing. In International Conference on Artificial Intelligence and Statistics, pages 9941–9975. PMLR, 2023.
  64. Beyond information gain: An empirical benchmark for low-switching-cost reinforcement learning. Transactions on Machine Learning Research, 2022.
  65. A reduction-based framework for sequential decision making with delayed feedback. arXiv preprint arXiv:2302.01477, 2023.
  66. Towards playing full moba games with deep reinforcement learning. Advances in Neural Information Processing Systems, 33:621–632, 2020.
  67. Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. In International Conference on Machine Learning, pages 7304–7312. PMLR, 2019.
  68. Provably efficient reward-agnostic navigation with linear value iteration. Advances in Neural Information Processing Systems, 33:11756–11766, 2020.
  69. Policy finetuning in reinforcement learning via design of experiments using offline data. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  70. Reward-free model-based reinforcement learning with linear function approximation. Advances in Neural Information Processing Systems, 34:1582–1593, 2021.
  71. Task-agnostic exploration in reinforcement learning. Advances in Neural Information Processing Systems, 2020a.
  72. Nearly minimax optimal reward-free reinforcement learning. arXiv preprint arXiv:2010.05901, 2020b.
  73. Almost optimal model-free reinforcement learning via reference-advantage decomposition. Advances in Neural Information Processing Systems, 33:15198–15207, 2020c.
  74. Near-optimal regret bounds for multi-batch reinforcement learning. Advances in Neural Information Processing Systems, 35:24586–24596, 2022.
  75. Differentially private linear sketches: Efficient implementations and applications. Advances in Neural Information Processing Systems, 35:12691–12704, 2022.
  76. A nearly optimal and low-switching algorithm for reinforcement learning with general function approximation. arXiv preprint arXiv:2311.15238, 2023.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com