Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

BlackMamba: Mixture of Experts for State-Space Models (2402.01771v1)

Published 1 Feb 2024 in cs.CL, cs.AI, cs.DC, and cs.LG
BlackMamba: Mixture of Experts for State-Space Models

Abstract: State-space models (SSMs) have recently demonstrated competitive performance to transformers at large-scale LLMing benchmarks while achieving linear time and memory complexity as a function of sequence length. Mamba, a recently released SSM model, shows impressive performance in both LLMing and long sequence processing tasks. Simultaneously, mixture-of-expert (MoE) models have shown remarkable performance while significantly reducing the compute and latency costs of inference at the expense of a larger memory footprint. In this paper, we present BlackMamba, a novel architecture that combines the Mamba SSM with MoE to obtain the benefits of both. We demonstrate that BlackMamba performs competitively against both Mamba and transformer baselines, and outperforms in inference and training FLOPs. We fully train and open-source 340M/1.5B and 630M/2.8B BlackMamba models on 300B tokens of a custom dataset. We show that BlackMamba inherits and combines both of the benefits of SSM and MoE architectures, combining linear-complexity generation from SSM with cheap and fast inference from MoE. We release all weights, checkpoints, and inference code open-source. Inference code at: https://github.com/Zyphra/BlackMamba

Introduction to BlackMamba

State-Space Models (SSMs) and Mixture of Expert (MoE) models each represent innovative advancements in the field of language processing, each addressing different limitations posed by traditional transformer architectures. The novel contribution of our work lies in the successful hybridization of these two architectures, creating BlackMamba, which effectively leverages the linear time and memory complexity benefits of SSMs with the computational and latency efficiencies of MoE models. This synergy yields a novel LLM that can outperform existing LLMing benchmarks not only in terms of cost-efficiency but also in actual performance metrics.

Distinctive Architecture and Implementation

BlackMamba's architecture is characterized by the concurrent use of alternating Mamba blocks, which replace the traditional attention mechanism common in transformers, and MoE blocks. The arrangement of these blocks within the architecture ensures that the benefits inherent to each individual model are preserved and utilized to full effect within BlackMamba. A notable design decision was to use the SwiGLU activation function for the expert MLPs and to engage only a sparse subset of the model's total parameters for any given forward pass, enabling improved compute efficiency. Standout results were achieved by our team: the 340M/1.5B and 630M/2.8B BlackMamba models were not only fully trained but were also open-sourced after training on a remarkable 300 billion tokens of a custom dataset.

Comprehensive Results and Performance

The results showcased by BlackMamba are striking. Using significantly fewer training FLOPs, BlackMamba was able to achieve comparable performance metrics to dense transformer models on a range of downstream tasks. In terms of inference speed, our model demonstrated a remarkable advantage over not just transformer models, but also over Mamba and transformer-MoE models. Even more compelling is the fact that BlackMamba's generation latency remained constant as a function of sequence length, unlike transformers that suffer quadratic scaling. These results indicate BlackMamba as an exceptionally efficient model for both inference and training compared to its predecessors.

Further Discussion and Implications

The implications of the BlackMamba architecture extend far beyond performance metrics alone. The combination of SSMs with MoE in our model underscores a potential paradigm shift in how various architectural components can be modularly combined for efficient AI model design. While still preliminary, our exploration opens numerous avenues for future research, such as optimizing hyperparameters, exploring fine-tuning approaches, and investigating the composite effect on the model’s learned representations and behaviors. The open-sourced nature of BlackMamba provides a valuable asset for the broader AI community to enhance the collective understanding and development of this pioneering architecture.

In conclusion, BlackMamba embodies a significant leap forward in the evolution of LLMs, offering a new-wave architecture that achieves remarkable efficiency without compromising on quality or performance. Its linear complexity and swift inference capabilities pave the way for LLMs that can process longer sequences more rapidly, marking an exciting juncture in the landscape of AI-driven language processing.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (44)
  1. D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” arXiv preprint arXiv:1409.0473, 2014.
  2. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  3. A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever et al., “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, p. 9, 2019.
  4. T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
  5. H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale et al., “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023.
  6. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
  7. K. Rasul, A. Ashok, A. R. Williams, A. Khorasani, G. Adamopoulos, R. Bhagwatkar, M. Biloš, H. Ghonia, N. V. Hassen, A. Schneider et al., “Lag-llama: Towards foundation models for time series forecasting,” arXiv preprint arXiv:2310.08278, 2023.
  8. S. Reed, K. Zolna, E. Parisotto, S. G. Colmenarejo, A. Novikov, G. Barth-Maron, M. Gimenez, Y. Sulsky, J. Kay, J. T. Springenberg et al., “A generalist agent,” arXiv preprint arXiv:2205.06175, 2022.
  9. A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,” arXiv preprint arXiv:2312.00752, 2023.
  10. B. Peng, E. Alcaide, Q. Anthony, A. Albalak, S. Arcadinho, H. Cao, X. Cheng, M. Chung, M. Grella, K. K. GV et al., “Rwkv: Reinventing rnns for the transformer era,” arXiv preprint arXiv:2305.13048, 2023.
  11. W. Fedus, B. Zoph, and N. Shazeer, “Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,” The Journal of Machine Learning Research, vol. 23, no. 1, pp. 5232–5270, 2022.
  12. S. Rajbhandari, C. Li, Z. Yao, M. Zhang, R. Y. Aminabadi, A. A. Awan, J. Rasley, and Y. He, “Deepspeed-moe: Advancing mixture-of-experts inference and training to power next-generation ai scale,” in International Conference on Machine Learning.   PMLR, 2022, pp. 18 332–18 346.
  13. A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bamford, D. S. Chaplot, D. d. l. Casas, E. B. Hanna, F. Bressand et al., “Mixtral of experts,” arXiv preprint arXiv:2401.04088, 2024.
  14. Y. Sun, L. Dong, S. Huang, S. Ma, Y. Xia, J. Xue, J. Wang, and F. Wei, “Retentive network: A successor to transformer for large language models (2023),” URL http://arxiv. org/abs/2307.08621 v1.
  15. D. Lepikhin, H. Lee, Y. Xu, D. Chen, O. Firat, Y. Huang, M. Krikun, N. Shazeer, and Z. Chen, “Gshard: Scaling giant models with conditional computation and automatic sharding,” arXiv preprint arXiv:2006.16668, 2020.
  16. W. Fedus, J. Dean, and B. Zoph, “A review of sparse expert models in deep learning,” arXiv preprint arXiv:2209.01667, 2022.
  17. A. Gu, K. Goel, and C. Ré, “Efficiently modeling long sequences with structured state spaces,” arXiv preprint arXiv:2111.00396, 2021.
  18. B. Peng, J. Quesnelle, H. Fan, and E. Shippole, “Yarn: Efficient context window extension of large language models,” arXiv preprint arXiv:2309.00071, 2023.
  19. S. Chen, S. Wong, L. Chen, and Y. Tian, “Extending context window of large language models via positional interpolation,” arXiv preprint arXiv:2306.15595, 2023.
  20. M. Poli, S. Massaroli, E. Nguyen, D. Y. Fu, T. Dao, S. Baccus, Y. Bengio, S. Ermon, and C. Ré, “Hyena hierarchy: Towards larger convolutional language models,” arXiv preprint arXiv:2302.10866, 2023.
  21. S. Arora, S. Eyuboglu, A. Timalsina, I. Johnson, M. Poli, J. Zou, A. Rudra, and C. Ré, “Zoology: Measuring and improving recall in efficient language models,” arXiv preprint arXiv:2312.04927, 2023.
  22. A. Clark, D. De Las Casas, A. Guy, A. Mensch, M. Paganini, J. Hoffmann, B. Damoc, B. Hechtman, T. Cai, S. Borgeaud et al., “Unified scaling laws for routed language models,” in International Conference on Machine Learning.   PMLR, 2022, pp. 4057–4086.
  23. A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. d. l. Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier et al., “Mistral 7b,” arXiv preprint arXiv:2310.06825, 2023.
  24. M. Pióro, K. Ciebiera, K. Król, J. Ludziejewski, and S. Jaszczur, “Moe-mamba: Efficient selective state space models with mixture of experts,” arXiv preprint arXiv:2401.04081, 2024.
  25. N. Shazeer, “Glu variants improve transformer,” arXiv preprint arXiv:2002.05202, 2020.
  26. B. Wang and A. Komatsuzaki, “Gpt-j-6b: A 6 billion parameter autoregressive language model,” 2021.
  27. M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro, “Megatron-lm: Training multi-billion parameter language models using model parallelism,” arXiv preprint arXiv:1909.08053, 2019.
  28. L. Gao, S. Biderman, S. Black, L. Golding, T. Hoppe, C. Foster, J. Phang, H. He, A. Thite, N. Nabeshima et al., “The pile: An 800gb dataset of diverse text for language modeling,” arXiv preprint arXiv:2101.00027, 2020.
  29. D. Soboleva, F. Al-Khateeb, R. Myers, J. Steeves, J. Hestness, and N. Dey, “Slimpajama: A 627b token cleaned and deduplicated version of redpajama,” 7 2023. [Online]. Available: https://www.cerebras.net/blog/slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama
  30. R. Li, L. B. Allal, Y. Zi, N. Muennighoff, D. Kocetkov, C. Mou, M. Marone, C. Akiki, J. Li, J. Chim et al., “Starcoder: may the source be with you!” arXiv preprint arXiv:2305.06161, 2023.
  31. L. Soldaini and K. Lo, “peS2o (Pretraining Efficiently on S2ORC) Dataset,” Allen Institute for AI, Tech. Rep., 2023, oDC-By, https://github.com/allenai/pes2o.
  32. Z. Azerbayev, H. Schoelkopf, K. Paster, M. D. Santos, S. McAleer, A. Q. Jiang, J. Deng, S. Biderman, and S. Welleck, “Llemma: An open language model for mathematics,” arXiv preprint arXiv:2310.10631, 2023.
  33. J. W. Rae, A. Potapenko, S. M. Jayakumar, and T. P. Lillicrap, “Compressive transformers for long-range sequence modelling,” 2019.
  34. J. He, J. Zhai, T. Antunes, H. Wang, F. Luo, S. Shi, and Q. Li, “Fastermoe: modeling and optimizing training of large-scale dynamic pre-trained models,” in Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2022, pp. 120–134.
  35. Y. Elazar, A. Bhagia, I. Magnusson, A. Ravichander, D. Schwenk, A. Suhr, P. Walsh, D. Groeneveld, L. Soldaini, S. Singh, H. Hajishirzi, N. A. Smith, and J. Dodge, “What’s in my big data?” 2023.
  36. L. Gao, J. Tow, B. Abbasi, S. Biderman, S. Black, A. DiPofi, C. Foster, L. Golding, J. Hsu, A. Le Noac’h, H. Li, K. McDonell, N. Muennighoff, C. Ociepa, J. Phang, L. Reynolds, H. Schoelkopf, A. Skowron, L. Sutawika, E. Tang, A. Thite, B. Wang, K. Wang, and A. Zou, “A framework for few-shot language model evaluation,” 12 2023. [Online]. Available: https://zenodo.org/records/10256836
  37. R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, and Y. Choi, “Hellaswag: Can a machine really finish your sentence?” 2019.
  38. Y. Bisk, R. Zellers, R. L. Bras, J. Gao, and Y. Choi, “Piqa: Reasoning about physical commonsense in natural language,” 2019.
  39. K. Sakaguchi, R. L. Bras, C. Bhagavatula, and Y. Choi, “Winogrande: An adversarial winograd schema challenge at scale,” 2019.
  40. D. Paperno, G. Kruszewski, A. Lazaridou, Q. N. Pham, R. Bernardi, S. Pezzelle, M. Baroni, G. Boleda, and R. Fernández, “The lambada dataset: Word prediction requiring a broad discourse context,” 2016.
  41. P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord, “Think you have solved question answering? try arc, the ai2 reasoning challenge,” 2018.
  42. T. Mihaylov, P. Clark, T. Khot, and A. Sabharwal, “Can a suit of armor conduct electricity? a new dataset for open book question answering,” 2018.
  43. S. Biderman, H. Schoelkopf, Q. G. Anthony, H. Bradley, K. O’Brien, E. Hallahan, M. A. Khan, S. Purohit, U. S. Prashanth, E. Raff et al., “Pythia: A suite for analyzing large language models across training and scaling,” in International Conference on Machine Learning.   PMLR, 2023, pp. 2397–2430.
  44. R. Sinkhorn and P. Knopp, “Concerning nonnegative matrices and doubly stochastic matrices,” Pacific Journal of Mathematics, vol. 21, no. 2, pp. 343–348, 1967.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Quentin Anthony (25 papers)
  2. Yury Tokpanov (6 papers)
  3. Paolo Glorioso (32 papers)
  4. Beren Millidge (49 papers)
Citations (18)
Github Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com