Near-Optimal Reinforcement Learning with Self-Play under Adaptivity Constraints (2402.01111v1)
Abstract: We study the problem of multi-agent reinforcement learning (MARL) with adaptivity constraints -- a new problem motivated by real-world applications where deployments of new policies are costly and the number of policy updates must be minimized. For two-player zero-sum Markov Games, we design a (policy) elimination based algorithm that achieves a regret of $\widetilde{O}(\sqrt{H3 S2 ABK})$, while the batch complexity is only $O(H+\log\log K)$. In the above, $S$ denotes the number of states, $A,B$ are the number of actions for the two players respectively, $H$ is the horizon and $K$ is the number of episodes. Furthermore, we prove a batch complexity lower bound $\Omega(\frac{H}{\log_{A}K}+\log\log K)$ for all algorithms with $\widetilde{O}(\sqrt{K})$ regret bound, which matches our upper bound up to logarithmic factors. As a byproduct, our techniques naturally extend to learning bandit games and reward-free MARL within near optimal batch complexity. To the best of our knowledge, these are the first line of results towards understanding MARL with low adaptivity.
- Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312–2320, 2011.
- Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2):235–256, 2002.
- Minimax regret bounds for reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 263–272. JMLR. org, 2017.
- Yu Bai and Chi Jin. Provable self-play algorithms for competitive reinforcement learning. In International Conference on Machine Learning, pages 551–560. PMLR, 2020.
- Provably efficient q-learning with low switching cost. Advances in Neural Information Processing Systems, 32, 2019.
- Near-optimal reinforcement learning with self-play. Advances in neural information processing systems, 33:2159–2170, 2020.
- Qflow: A reinforcement learning approach to high qoe video streaming over wireless networks. In Proceedings of the twentieth ACM international symposium on mobile ad hoc networking and computing, pages 251–260, 2019.
- Superhuman ai for multiplayer poker. Science, 365(6456):885–890, 2019.
- Online learning with switching costs and other adaptive adversaries. In Advances in Neural Information Processing Systems, pages 1160–1168, 2013.
- On the statistical efficiency of reward-free exploration in non-linear rl. Advances in Neural Information Processing Systems, 35:20960–20973, 2022.
- Near-optimal reward-free exploration for linear mixture mdps with plug-in solver. arXiv preprint arXiv:2110.03244, 2021.
- Herman Chernoff et al. A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. The Annals of Mathematical Statistics, 23(4):493–507, 1952.
- Distributed deep reinforcement learning for intelligent load scheduling in residential smart grids. IEEE Transactions on Industrial Informatics, 17(4):2752–2763, 2020.
- Breaking the curse of multiagents in a large state space: Rl in markov games with independent linear function approximation. In The Thirty Sixth Annual Conference on Learning Theory, pages 2651–2652. PMLR, 2023.
- Unifying pac and regret: Uniform pac bounds for episodic reinforcement learning. In Advances in Neural Information Processing Systems, pages 5713–5723, 2017.
- Policy certificates: Towards accountable reinforcement learning. In International Conference on Machine Learning, pages 1507–1516. PMLR, 2019.
- Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography: Third Theory of Cryptography Conference, TCC 2006, New York, NY, USA, March 4-7, 2006. Proceedings 3, pages 265–284. Springer, 2006.
- Competitive Markov decision processes. Springer Science & Business Media, 2012.
- A provably efficient algorithm for linear markov decision process with low switching cost. arXiv preprint arXiv:2101.00494, 2021.
- Batched multi-armed bandits problem. Advances in Neural Information Processing Systems, 32, 2019.
- Sequential batch learning in finite-action linear contextual bandits. arXiv preprint arXiv:2004.06321, 2020.
- Nearly minimax optimal reinforcement learning for linear markov decision processes. In International Conference on Machine Learning, pages 12790–12822. PMLR, 2023.
- Towards minimax optimal reward-free reinforcement learning in linear mdps. In The Eleventh International Conference on Learning Representations, 2023.
- Towards general function approximation in zero-sum markov games. In 10th International Conference on Learning Representations, ICLR 2022, 2022a.
- Towards deployment-efficient reinforcement learning: Lower bound and optimality. In International Conference on Learning Representations, 2022b.
- Regret-optimal model-free reinforcement learning for discounted mdps with short burn-in time. arXiv preprint arXiv:2305.15546, 2023.
- Is q-learning provably efficient? In Advances in Neural Information Processing Systems, pages 4863–4873, 2018.
- Reward-free exploration for reinforcement learning. In International Conference on Machine Learning, pages 4870–4879. PMLR, 2020.
- V-learning–a simple, efficient, decentralized algorithm for multiagent rl. arXiv preprint arXiv:2110.14555, 2021.
- The power of exploiter: Provable multi-agent rl in large state spaces. In International Conference on Machine Learning, pages 10251–10279. PMLR, 2022.
- Sample-efficiency in multi-batch reinforcement learning: The need for dimension-dependent adaptivity. arXiv preprint arXiv:2310.01616, 2023.
- Adaptive reward-free exploration. In Algorithmic Learning Theory, pages 865–891. PMLR, 2021.
- Online sub-sampling for reinforcement learning with general function approximation. arXiv preprint arXiv:2106.07203, 2021.
- Learning two-player markov games: Neural function approximation and correlated equilibrium. Advances in Neural Information Processing Systems, 35:33262–33274, 2022.
- Minimax-optimal reward-agnostic exploration in reinforcement learning. arXiv preprint arXiv:2304.07278, 2023.
- A sharp analysis of model-based reinforcement learning with self-play. In International Conference on Machine Learning, pages 7001–7010. PMLR, 2021.
- On improving model-free algorithms for decentralized multi-agent reinforcement learning. In International Conference on Machine Learning, pages 15007–15049. PMLR, 2022.
- Deployment-efficient reinforcement learning via model-based offline optimization. In International Conference on Learning Representations, 2020.
- Empirical bernstein bounds and sample variance penalization. arXiv preprint arXiv:0907.3740, 2009.
- Fast active learning for pure exploration in reinforcement learning. In International Conference on Machine Learning, pages 7599–7608. PMLR, 2021.
- A simple reward-free approach to constrained reinforcement learning. In International Conference on Machine Learning, pages 15666–15698. PMLR, 2022.
- Model-free representation learning and exploration in low-rank mdps. arXiv preprint arXiv:2102.07035, 2021.
- Polynomial-time algorithms for linear programming. Integer and Combinatorial Optimization, pages 146–181, 1988.
- Batched bandit problems. The Annals of Statistics, 44(2):660–681, 2016.
- Offline reinforcement learning with differential privacy. arXiv preprint arXiv:2206.00810, 2022.
- Near-optimal deployment efficiency in reward-free reinforcement learning with linear function approximation. In The Eleventh International Conference on Learning Representations, 2023a.
- Near-optimal differentially private reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pages 9914–9940. PMLR, 2023b.
- Sample-efficient reinforcement learning with loglog(T) switching cost. In International Conference on Machine Learning, pages 18031–18061. PMLR, 2022.
- Logarithmic switching cost in reinforcement learning beyond linear mdps. arXiv preprint arXiv:2302.12456, 2023.
- On reward-free rl with kernel and neural function approximations: Single-agent mdp and markov game. In International Conference on Machine Learning, pages 8737–8747. PMLR, 2021.
- Linear bandits with limited adaptivity and learning distributional optimal design. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pages 74–87, 2021.
- Safe, multi-agent, reinforcement learning for autonomous driving. arXiv preprint arXiv:1610.03295, 2016.
- Near-optimal adversarial reinforcement learning with switching costs. In International Conference on Learning Representations (ICLR), 2023.
- Mastering the game of go without human knowledge. nature, 550(7676):354–359, 2017.
- Phase transitions and cyclic phenomena in bandits with switching constraints. Advances in Neural Information Processing Systems, 32, 2019.
- The best of both worlds: Reinforcement learning with logarithmic regret and policy switches. arXiv preprint arXiv:2203.01491, 2022.
- Reward-free rl is no harder than reward-aware rl in linear markov decision processes. In International Conference on Machine Learning, pages 22430–22456. PMLR, 2022.
- On reward-free reinforcement learning with linear function approximation. Advances in neural information processing systems, 33:17816–17826, 2020.
- Provably efficient reinforcement learning with linear function approximation under adaptivity constraints. Advances in Neural Information Processing Systems, 34, 2021.
- Breaking the curse of multiagency: Provably efficient decentralized multi-agent rl with function approximation. arXiv preprint arXiv:2302.06606, 2023.
- Learning zero-sum simultaneous-move markov games using function approximation and correlated equilibrium. In Conference on learning theory, pages 3674–3682. PMLR, 2020.
- A general framework for sequential decision-making under adaptivity constraints. arXiv preprint arXiv:2306.14468, 2023.
- Doubly fair dynamic pricing. In International Conference on Artificial Intelligence and Statistics, pages 9941–9975. PMLR, 2023.
- Beyond information gain: An empirical benchmark for low-switching-cost reinforcement learning. Transactions on Machine Learning Research, 2022.
- A reduction-based framework for sequential decision making with delayed feedback. arXiv preprint arXiv:2302.01477, 2023.
- Towards playing full moba games with deep reinforcement learning. Advances in Neural Information Processing Systems, 33:621–632, 2020.
- Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. In International Conference on Machine Learning, pages 7304–7312. PMLR, 2019.
- Provably efficient reward-agnostic navigation with linear value iteration. Advances in Neural Information Processing Systems, 33:11756–11766, 2020.
- Policy finetuning in reinforcement learning via design of experiments using offline data. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- Reward-free model-based reinforcement learning with linear function approximation. Advances in Neural Information Processing Systems, 34:1582–1593, 2021.
- Task-agnostic exploration in reinforcement learning. Advances in Neural Information Processing Systems, 2020a.
- Nearly minimax optimal reward-free reinforcement learning. arXiv preprint arXiv:2010.05901, 2020b.
- Almost optimal model-free reinforcement learning via reference-advantage decomposition. Advances in Neural Information Processing Systems, 33:15198–15207, 2020c.
- Near-optimal regret bounds for multi-batch reinforcement learning. Advances in Neural Information Processing Systems, 35:24586–24596, 2022.
- Differentially private linear sketches: Efficient implementations and applications. Advances in Neural Information Processing Systems, 35:12691–12704, 2022.
- A nearly optimal and low-switching algorithm for reinforcement learning with general function approximation. arXiv preprint arXiv:2311.15238, 2023.