Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
116 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
24 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
3 tokens/sec
DeepSeek R1 via Azure Pro
35 tokens/sec
2000 character limit reached

Sample-Efficiency in Multi-Batch Reinforcement Learning: The Need for Dimension-Dependent Adaptivity (2310.01616v2)

Published 2 Oct 2023 in cs.LG and cs.AI

Abstract: We theoretically explore the relationship between sample-efficiency and adaptivity in reinforcement learning. An algorithm is sample-efficient if it uses a number of queries $n$ to the environment that is polynomial in the dimension $d$ of the problem. Adaptivity refers to the frequency at which queries are sent and feedback is processed to update the querying strategy. To investigate this interplay, we employ a learning framework that allows sending queries in $K$ batches, with feedback being processed and queries updated after each batch. This model encompasses the whole adaptivity spectrum, ranging from non-adaptive 'offline' ($K=1$) to fully adaptive ($K=n$) scenarios, and regimes in between. For the problems of policy evaluation and best-policy identification under $d$-dimensional linear function approximation, we establish $\Omega(\log \log d)$ lower bounds on the number of batches $K$ required for sample-efficient algorithms with $n = O(poly(d))$ queries. Our results show that just having adaptivity ($K>1$) does not necessarily guarantee sample-efficiency. Notably, the adaptivity-boundary for sample-efficiency is not between offline reinforcement learning ($K=1$), where sample-efficiency was known to not be possible, and adaptive settings. Instead, the boundary lies between different regimes of adaptivity and depends on the problem dimension.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. Model-based reinforcement learning with a generative model is minimax optimal. In Proceedings of Thirty Third Conference on Learning Theory, volume 125 of Proceedings of Machine Learning Research, pp.  67–83. PMLR, 2020.
  2. A variant of the wang-foster-kakade lower bound for the discounted setting. 2020. Preprint, arXiv: 2011.01075.
  3. Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model. Machine Learning, 91(3):325–349, 2013.
  4. Provably efficient q-learning with low switching cost. In Advances in Neural Information Processing Systems, volume 32, 2019.
  5. Polynomial approximation–a new computational technique in dynamic programming: Allocation processes. Mathematics of Computation, 17(82):155–161, 1963.
  6. Infinite-horizon offline reinforcement learning with linear function approximation: Curse of dimensionality and algorithm. 2021. Preprint, arXiv: 2103.09847.
  7. Packing lines, planes, etc.: packings in Grassmannian spaces. Experimental Mathematics, 5(2):139 – 159, 1996.
  8. Policy certificates: Towards accountable reinforcement learning. In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp.  1507–1516. PMLR, 2019.
  9. Minimax-optimal off-policy evaluation with linear function approximation. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp.  2701–2709. PMLR, 2020.
  10. Minimax bounds on stochastic batched convex optimization. In Proceedings of the 31st Conference On Learning Theory, volume 75 of Proceedings of Machine Learning Research, pp.  3065–3162. PMLR, 2018.
  11. Regret bounds for batched bandits. Proceedings of the AAAI Conference on Artificial Intelligence, 35(8):7340–7348, 2021.
  12. A provably efficient algorithm for linear markov decision process with low switching cost. 2021. Preprint, arXiv: 2101.00494.
  13. Batched multi-armed bandits problem. In Advances in Neural Information Processing Systems, volume 32, 2019.
  14. Sequential batch learning in finite-action linear contextual bandits. 2020. Preprint, arXiv: 2004.06321.
  15. Towards deployment-efficient reinforcement learning: Lower bound and optimality. In International Conference on Learning Representations, 2022.
  16. Linear reinforcement learning with ball structure action space. In International Conference on Algorithmic Learning Theory, pp.  755–775. PMLR, 2023.
  17. Top arm identification in multi-armed bandits with batch arm pulls. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, volume 51 of Proceedings of Machine Learning Research, pp.  139–148. PMLR, 2016.
  18. Regularization and variance-weighted regression achieves minimax optimality in linear MDPs: Theory and practice. 2023. Preprint, arXiv: 2305.13185.
  19. Racial disparities in automated speech recognition. Proceedings of the National Academy of Sciences, 117(14):7684–7689, 2020.
  20. Batch Reinforcement Learning, pp.  45–73. Springer, 2012.
  21. Learning with good feature representations in bandits and in RL with a generative model. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp.  5662–5670. PMLR, 2020.
  22. Breaking the sample size barrier in model-based reinforcement learning with a generative model. In Advances in Neural Information Processing Systems, volume 33, 2020.
  23. Deployment-efficient reinforcement learning via model-based offline optimization. In International Conference on Learning Representations, 2021.
  24. Batched bandit problems. In Proceedings of The 28th Conference on Learning Theory, volume 40 of Proceedings of Machine Learning Research, pp.  1456–1456. PMLR, 2015.
  25. Martin L. Puterman. Markov decision processes : discrete stochastic dynamic programming. Wiley-Interscience, 1994.
  26. Near-optimal deployment efficiency in reward-free reinforcement learning with linear function approximation. In The Eleventh International Conference on Learning Representations, 2023.
  27. Sample-efficient reinforcement learning with loglog(T) switching cost. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp.  18031–18061. PMLR, 2022.
  28. Logarithmic switching cost in reinforcement learning beyond linear MDPs. 2023. Preprint, arXiv: 2302.12456.
  29. Linear bandits with limited adaptivity and learning distributional optimal design. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pp.  74–87, 2021.
  30. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016.
  31. New packings in grassmannian space. In 2021 IEEE International Symposium on Information Theory (ISIT), pp.  807–812, 2021.
  32. Best policy identification in discounted linear MDPs. In Sixteenth European Workshop on Reinforcement Learning, 2023.
  33. Provably efficient reinforcement learning with linear function approximation under adaptivity constraints. In Advances in Neural Information Processing Systems, volume 34, 2021.
  34. On query-efficient planning in mdps under linear realizability of the optimal state-value function. In Proceedings of Thirty Fourth Conference on Learning Theory, volume 134 of Proceedings of Machine Learning Research, pp.  4355–4385. PMLR, 2021.
  35. The curse of passive data collection in batch reinforcement learning. In Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, volume 151 of Proceedings of Machine Learning Research, pp.  8413–8438. PMLR, 2022.
  36. Q* approximation schemes for batch reinforcement learning: A theoretical comparison. In Conference on Uncertainty in Artificial Intelligence, pp.  550–559. PMLR, 2020.
  37. Policy finetuning: Bridging sample-efficient offline and online reinforcement learning. In Advances in Neural Information Processing Systems, volume 34, 2021.
  38. A general framework for sequential decision-making under adaptivity constraints. 2023. Preprint, arXiv: 2306.14468.
  39. Reinforcement learning in healthcare: A survey. ACM Comput. Surv., 55(1):1–36, 2021.
  40. Andrea Zanette. Exponential lower bounds for batch reinforcement learning: Batch rl can be exponentially harder than online rl. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp.  12287–12297. PMLR, 2021.
  41. Andrea Zanette. When is realizability sufficient for off-policy reinforcement learning? In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp.  40637–40668. PMLR, 2023.
  42. Learning near optimal policies with low inherent Bellman error. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp.  10978–10989. PMLR, 2020.
  43. Policy finetuning in reinforcement learning via design of experiments using offline data. 2023. Preprint, arXiv: 2307.04354.
  44. Almost optimal model-free reinforcement learningvia reference-advantage decomposition. In Advances in Neural Information Processing Systems, volume 33, 2020.
  45. Near-optimal regret bounds for multi-batch reinforcement learning. In Advances in Neural Information Processing Systems, 2022.
Citations (2)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.