Bigger, Regularized, Optimistic: scaling for compute and sample-efficient continuous control (2405.16158v3)
Abstract: Sample efficiency in Reinforcement Learning (RL) has traditionally been driven by algorithmic enhancements. In this work, we demonstrate that scaling can also lead to substantial improvements. We conduct a thorough investigation into the interplay of scaling model capacity and domain-specific RL enhancements. These empirical findings inform the design choices underlying our proposed BRO (Bigger, Regularized, Optimistic) algorithm. The key innovation behind BRO is that strong regularization allows for effective scaling of the critic networks, which, paired with optimistic exploration, leads to superior performance. BRO achieves state-of-the-art results, significantly outperforming the leading model-based and model-free algorithms across 40 complex tasks from the DeepMind Control, MetaWorld, and MyoSuite benchmarks. BRO is the first model-free algorithm to achieve near-optimal policies in the notoriously challenging Dog and Humanoid tasks.
- Deep reinforcement learning at the edge of the statistical precipice. Advances in neural information processing systems, 34:29304–29320, 2021.
- What matters in on-policy reinforcement learning? a large-scale empirical study. In ICLR 2021-Ninth International Conference on Learning Representations, 2021.
- Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- A distributional perspective on reinforcement learning. In International conference on machine learning, pp. 449–458. PMLR, 2017.
- Cross q𝑞qitalic_q: Batch normalization in deep reinforcement learning for greater sample efficiency and simplicity. In The Twelfth International Conference on Learning Representations, 2023.
- Towards deeper deep reinforcement learning with spectral normalization. Advances in neural information processing systems, 34:8242–8255, 2021.
- Openai gym, 2016.
- Myosuite–a contact-rich simulation suite for musculoskeletal motor control. arXiv preprint arXiv:2205.13600, 2022.
- Learning pessimism for reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pp. 6971–6979, 2023.
- Ucb exploration via q-ensembles. arXiv preprint arXiv:1706.01502, 2017.
- Expected policy gradients for reinforcement learning. Journal of Machine Learning Research, 21(2020), 2020.
- Better exploration with optimistic actor critic. Advances in Neural Information Processing Systems, 32, 2019.
- Implicit quantile networks for distributional reinforcement learning. In International conference on machine learning, pp. 1096–1105. PMLR, 2018.
- Bert: Pre-training of deep bidirectional transformers for language understanding. North American Chapter of the Association for Computational Linguistics, 2019. doi: 10.18653/v1/N19-1423.
- Sample-efficient reinforcement learning by breaking the replay ratio barrier. In The Eleventh International Conference on Learning Representations, 2022.
- An image is worth 16x16 words: Transformers for image recognition at scale. International Conference on Learning Representations, 2020.
- Palm-e: An embodied multimodal language model. arXiv preprint arXiv: 2303.03378, 2023.
- Stop regressing: Training value functions via classification for scalable deep rl. arXiv preprint arXiv:2403.03950, 2024.
- How to discount deep reinforcement learning: Towards new dynamic strategies. arXiv preprint arXiv: 1512.02011, 2015.
- Addressing function approximation error in actor-critic methods. In International conference on machine learning, pp. 1587–1596. PMLR, 2018.
- Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905, 2018.
- Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104, 2023.
- Td-mpc2: Scalable, robust world models for continuous control. arXiv preprint arXiv: 2310.16828, 2023.
- Dropout q-functions for doubly efficient reinforcement learning. In International Conference on Learning Representations, 2021.
- Improving regression performance with distributional losses. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 2157–2166. PMLR, 10–15 Jul 2018. URL https://proceedings.mlr.press/v80/imani18a.html.
- Seizing serendipity: Exploiting the value of past success in off-policy actor-critic. arXiv preprint arXiv:2306.02865, 2023.
- Bias-variance error bounds for temporal difference updates. In Annual Conference Computational Learning Theory, 2000. URL https://api.semanticscholar.org/CorpusID:5053575.
- Kostrikov, I. JAXRL: Implementations of Reinforcement Learning algorithms in JAX, 10 2021. URL https://github.com/ikostrikov/jaxrl.
- Offline q-learning on diverse multi-task data both scales and generalizes. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=4-k7kUavAj.
- Efficient deep reinforcement learning requires regulating overfitting. In The Eleventh International Conference on Learning Representations, 2022.
- Decoupled weight decay regularization. International Conference on Learning Representations, 2017.
- Disentangling the causes of plasticity loss in neural networks. arXiv preprint arXiv: 2402.18762, 2024.
- Spectral normalization for generative adversarial networks. International Conference on Learning Representations, 2018.
- Human-level control through deep reinforcement learning. nature, 518(7540):529–533, 2015.
- Tactical optimism and pessimism for deep reinforcement learning. Advances in Neural Information Processing Systems, 34:12849–12863, 2021.
- On the theory of risk-aware agents: Bridging actor-critic and economics. arXiv preprint arXiv:2310.19527, 2023.
- Overestimation, overfitting, and plasticity in actor-critic: the bitter lesson of reinforcement learning. 2024.
- The primacy bias in deep reinforcement learning. In International conference on machine learning, pp. 16828–16847. PMLR, 2022.
- Small batch deep reinforcement learning. Advances in Neural Information Processing Systems, 36, 2024.
- Mixtures of experts unlock parameter scaling for deep rl. arXiv preprint arXiv:2402.08609, 2024.
- Open x-embodiment: Robotic learning datasets and rt-x models. arXiv preprint arXiv:2310.08864, 2023.
- Puterman, M. L. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.
- Stable-baselines3: Reliable reinforcement learning implementations. Journal of Machine Learning Research, 22(268):1–8, 2021. URL http://jmlr.org/papers/v22/20-1364.html.
- A generalist dynamics model for control. arXiv preprint arXiv: 2305.10912, 2023.
- Data-efficient reinforcement learning with self-predictive representations. In International Conference on Learning Representations, 2020.
- Bigger, better, faster: Human-level atari with human-level efficiency. In International Conference on Machine Learning, pp. 30365–30380. PMLR, 2023.
- Reinforcement learning: An introduction. MIT press, 2018.
- Shimmy: Gymnasium and PettingZoo Wrappers for Commonly Used Environments. URL https://github.com/Farama-Foundation/shimmy.
- Investigating multi-task pretraining and generalization in reinforcement learning. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=sSt9fROSZRO.
- Efficientnet: Rethinking model scaling for convolutional neural networks. International Conference on Machine Learning, 2019.
- Deepmind control suite. arXiv preprint arXiv:1801.00690, 2018.
- A theoretical and empirical analysis of expected sarsa. In 2009 ieee symposium on adaptive dynamic programming and reinforcement learning, pp. 177–184. IEEE, 2009.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Efficientzero v2: Mastering discrete and continuous control with limited data. arXiv preprint arXiv:2403.00564, 2024.
- Optimism in reinforcement learning with generalized linear function approximation. In International Conference on Learning Representations, 2020.
- On layer normalization in the transformer architecture. International Conference on Machine Learning, 2020.
- Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In Conference on robot learning, pp. 1094–1100. PMLR, 2020.
- Rt-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning, pp. 2165–2183. PMLR, 2023.