Mixtures of Experts Unlock Parameter Scaling for Deep RL
Abstract: The recent rapid progress in (self) supervised learning models is in large part predicted by empirical scaling laws: a model's performance scales proportionally to its size. Analogous scaling laws remain elusive for reinforcement learning domains, however, where increasing the parameter count of a model often hurts its final performance. In this paper, we demonstrate that incorporating Mixture-of-Expert (MoE) modules, and in particular Soft MoEs (Puigcerver et al., 2023), into value-based networks results in more parameter-scalable models, evidenced by substantial performance increases across a variety of training regimes and model sizes. This work thus provides strong empirical evidence towards developing scaling laws for reinforcement learning.
- A. Abbas and Y. Andreopoulos. Biased mixtures of experts: Enabling computer vision inference under data transfer limitations. IEEE Transactions on Image Processing, 29:7656–7667, 2020.
- An optimistic perspective on offline reinforcement learning. In H. D. III and A. Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 104–114. PMLR, 13–18 Jul 2020. URL https://proceedings.mlr.press/v119/agarwal20c.html.
- Deep reinforcement learning at the edge of the statistical precipice. Advances in neural information processing systems, 34:29304–29320, 2021.
- Continuous action reinforcement learning from a mixture of interpretable experts. IEEE Trans. Pattern Anal. Mach. Intell., 44(10):6795–6806, oct 2022. ISSN 0162-8828. 10.1109/TPAMI.2021.3103132. URL https://doi.org/10.1109/TPAMI.2021.3103132.
- Single-shot pruning for offline reinforcement learning. arXiv preprint arXiv:2112.15579, 2021.
- The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013.
- Autonomous navigation of stratospheric balloons using reinforcement learning. Nature, 588:77 – 82, 2020.
- Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680, 2019.
- Dopamine: A Research Framework for Deep Reinforcement Learning. 2018. URL http://arxiv.org/abs/1812.06110.
- Revisiting rainbow: Promoting more insightful and inclusive deep reinforcement learning research. In International Conference on Machine Learning, pages 1373–1383. PMLR, 2021.
- Small batch deep reinforcement learning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=wPqEvmwFEh.
- Decision transformer: Reinforcement learning via sequence modeling. arXiv preprint arXiv:2106.01345, 2021.
- Mod-squad: Designing mixtures of experts as modular multi-task learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11828–11837, 2023.
- Continual backprop: Stochastic gradient descent with persistent randomness. arXiv preprint arXiv:2108.06325, 2021.
- Sample-efficient reinforcement learning by breaking the replay ratio barrier. In Deep Reinforcement Learning Workshop NeurIPS 2022, 2022. URL https://openreview.net/forum?id=4GBGwVIEYJ.
- Sample-efficient reinforcement learning by breaking the replay ratio barrier. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=OpC-9aBBVJe.
- An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=YicbFdNTTy.
- Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. In Proceedings of the International Conference on Machine Learning (ICML), 2018.
- Rigging the lottery: Making all tickets winners. In H. D. III and A. Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 2943–2952. PMLR, 13–18 Jul 2020. URL https://proceedings.mlr.press/v119/evci20a.html.
- M33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPTvit: Mixture-of-experts vision transformer for efficient multi-task learning with model-accelerator co-design. Advances in Neural Information Processing Systems, 35:28441–28457, 2022.
- Proto-value networks: Scaling representation learning with auxiliary tasks. In The Eleventh International Conference on Learning Representations, 2022.
- Discovering faster matrix multiplication algorithms with reinforcement learning. Nature, 610(7930):47–53, 2022.
- Revisiting fundamentals of experience replay. In International Conference on Machine Learning, pages 3061–3071. PMLR, 2020.
- Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. The Journal of Machine Learning Research, 23(1):5232–5270, 2022.
- K. Fukushima. Visual feature extraction by a multilayered network of analog threshold elements. IEEE Trans. Syst. Sci. Cybern., 5(4):322–333, 1969. 10.1109/TSSC.1969.300225. URL https://doi.org/10.1109/TSSC.1969.300225.
- The state of sparsity in deep neural networks. CoRR, abs/1902.09574, 2019. URL http://arxiv.org/abs/1902.09574.
- MegaBlocks: Efficient Sparse Training with Mixture-of-Experts. Proceedings of Machine Learning and Systems, 5, 2023.
- The state of sparse training in deep reinforcement learning. In K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, and S. Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 7766–7792. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/graesser22a.html.
- An empirical study of implicit regularization in deep offline rl. arXiv preprint arXiv:2207.02099, 2022.
- Dselect-k: Differentiable selection in the mixture of experts with applications to multi-task learning. Advances in Neural Information Processing Systems, 34:29335–29347, 2021.
- Multi-task reinforcement learning with mixture of orthogonal experts. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=aZH1dM3GOX.
- Rainbow: Combining improvements in deep reinforcement learning. In AAAI, 2018.
- Parameter-efficient transfer learning for NLP. In K. Chaudhuri and R. Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 2790–2799. PMLR, 09–15 Jun 2019. URL https://proceedings.mlr.press/v97/houlsby19a.html.
- Transient non-stationarity and generalisation in deep reinforcement learning. In International Conference on Learning Representations, 2020.
- Adaptive mixtures of local experts. Neural computation, 3(1):79–87, 1991.
- Model based reinforcement learning for atari. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=S1xCPJHtDB.
- Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
- Conservative q-learning for offline reinforcement learning. Advances in Neural Information Processing Systems, 33:1179–1191, 2020.
- Implicit under-parameterization inhibits data-efficient deep reinforcement learning. In International Conference on Learning Representations, 2021a. URL https://openreview.net/forum?id=O9bnihsFfXU.
- Dr3: Value-based deep reinforcement learning requires explicit regularization. In International Conference on Learning Representations, 2021b.
- Offline q-learning on diverse multi-task data both scales and generalizes. In The Eleventh International Conference on Learning Representations, 2022.
- A neural dirichlet process mixture model for task-free continual learning. In International Conference on Learning Representations, 2019.
- Gshard: Scaling giant models with conditional computation and automatic sharding. In International Conference on Learning Representations, 2020.
- Base layers: Simplifying training of large, sparse models. In International Conference on Machine Learning, 2021. URL https://api.semanticscholar.org/CorpusID:232428341.
- Understanding and preventing capacity loss in reinforcement learning. In International Conference on Learning Representations, 2022a. URL https://openreview.net/forum?id=ZkC8wKoLbQ7.
- Learning dynamics and generalization in deep reinforcement learning. In International Conference on Machine Learning, pages 14560–14581. PMLR, 2022b.
- Revisiting the arcade learning environment: evaluation protocols and open problems for general agents. J. Artif. Int. Res., 61(1):523–562, jan 2018. ISSN 1076-9757.
- Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, Feb. 2015.
- Multimodal contrastive learning with limoe: the language-image mixture of experts. Advances in Neural Information Processing Systems, 35:9564–9576, 2022.
- The primacy bias in deep reinforcement learning. In K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, and S. Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 16828–16847. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/nikishin22a.html.
- The difficulty of passive learning in deep reinforcement learning. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, editors, Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=nPHA8fGicZk.
- Using mixture of expert models to gain insights into semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 342–343, 2020.
- Scalable transfer learning with expert models. In International Conference on Learning Representations, 2020.
- From sparse to soft mixtures of experts, 2023.
- Probabilistic mixture-of-experts for efficient deep reinforcement learning. CoRR, abs/2104.09122, 2021. URL https://arxiv.org/abs/2104.09122.
- Scaling vision with sparse mixture of experts. Advances in Neural Information Processing Systems, 34:8583–8595, 2021.
- Scaling vision with sparse mixture of experts. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, editors, Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=FrIDgjDOH1u.
- Bigger, better, faster: Human-level Atari with human-level efficiency. In A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 30365–30380. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/v202/schwarzer23a.html.
- Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=B1ckMDqlg.
- Dynamic sparse training for deep reinforcement learning. In International Joint Conference on Artificial Intelligence, 2022.
- The dormant neuron phenomenon in deep reinforcement learning. In A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 32145–32168. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/v202/sokar23a.html.
- Introduction to Reinforcement Learning. MIT Press, Cambridge, MA, USA, 1st edition, 1998. ISBN 0262193981.
- Investigating multi-task pretraining and generalization in reinforcement learning. In The Eleventh International Conference on Learning Representations, 2022.
- Rlx2: Training a sparse deep reinforcement learning model from scratch. In The Eleventh International Conference on Learning Representations, 2022.
- When to use parametric models in reinforcement learning? Advances in Neural Information Processing Systems, 32, 2019.
- Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
- Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575(7782):350–354, 2019.
- Deep mixture of experts via shallow embedding. In Uncertainty in artificial intelligence, pages 552–562. PMLR, 2020.
- Condconv: Conditionally parameterized convolutions for efficient inference. Advances in neural information processing systems, 32, 2019.
- Image augmentation is all you need: Regularizing deep reinforcement learning from pixels. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=GY6-6sTvGaf.
- H. Ye and D. Xu. Taskexpert: Dynamically assembling multi-task representations with memorial mixture-of-experts. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 21828–21837, 2023.
- Mixture-of-experts with expert choice routing. Advances in Neural Information Processing Systems, 35:7103–7114, 2022.
- St-moe: Designing stable and transferable sparse expert models, 2022.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.