Emergent Mind

Jamba: A Hybrid Transformer-Mamba Language Model

(2403.19887)
Published Mar 28, 2024 in cs.CL and cs.LG

Abstract

We present Jamba, a new base large language model based on a novel hybrid Transformer-Mamba mixture-of-experts (MoE) architecture. Specifically, Jamba interleaves blocks of Transformer and Mamba layers, enjoying the benefits of both model families. MoE is added in some of these layers to increase model capacity while keeping active parameter usage manageable. This flexible architecture allows resource- and objective-specific configurations. In the particular configuration we have implemented, we end up with a powerful model that fits in a single 80GB GPU. Built at large scale, Jamba provides high throughput and small memory footprint compared to vanilla Transformers, and at the same time state-of-the-art performance on standard language model benchmarks and long-context evaluations. Remarkably, the model presents strong results for up to 256K tokens context length. We study various architectural decisions, such as how to combine Transformer and Mamba layers, and how to mix experts, and show that some of them are crucial in large scale modeling. We also describe several interesting properties of these architectures which the training and evaluation of Jamba have revealed, and plan to release checkpoints from various ablation runs, to encourage further exploration of this novel architecture. We make the weights of our implementation of Jamba publicly available under a permissive license.

Single Jamba block and layer types with l=8, 1:7 attention-to-Mamba ratio, MoE every 2 layers.

Overview

  • Jamba combines Transformer and Mamba layers with a mixture-of-experts (MoE) component for improved language model performance.

  • Designed for efficiency, Jamba operates within a single 80GB GPU, balancing model capacity, memory usage, and computational demands.

  • Demonstrates superior performance on benchmarks, especially in tasks requiring long context lengths, with greater throughput efficiency than leading models.

  • Introduces a scalable and computationally efficient architecture, setting new precedents for future large-scale language model development.

Jamba: Unveiling a Hybrid Transformer-Mamba Architecture with MoE for Enhanced Language Model Performance

Introduction to Jamba

The recently developed Jamba framework represents a significant stride in language model architecture, integrating Transformer and Mamba layers in a hybrid fashion, along with employing a mixture-of-experts (MoE) component. This architecture leverages the strengths of both the Transformer's and Mamba's architectural benefits, enhancing model capacity and performance while optimally managing memory usage and computational efficiency. Jamba is particularly designed to fit within the confines of a single 80GB GPU, making it highly accessible for large-scale language modeling tasks.

Model Architecture

The Jamba architecture is unique in its combination of Transformer layers, known for their attention mechanism, with Mamba layers, a class of state-space models acclaimed for efficiently handling sequence data. This amalgamation is further fortified with MoE layers, strategically enhancing the model's capacity. Each 'Jamba block' contains a mix of Mamba and Attention layers, interspersed with MoE layers applied to some of the MLPs. This structure allows for flexibility in model design, enabling the balancing of memory footprint, computational demands, and overall model performance. Jamba employs a configurable ratio of Attention-to-Mamba layers, thus allowing for adjustments based on specific resource and objective needs.

Performance Insights

Jamba's innovative architecture demonstrates superior performance on standard benchmarks, particularly excelling in tasks requiring long context lengths of up to 256K tokens. It showcases strong results across various evaluations, attaining comparable or superior performance relative to current leading models, such as Mixtral-8x7B and Llama-2 70B, while supporting significantly longer contexts. Furthermore, Jamba achieves this with a significantly smaller KV cache footprint and superior throughput efficiency, marking a substantial advancement in the practical application of large-scale language models.

Computational Efficiency

In addition to its impressive performance on benchmarks, Jamba stands out for its computational efficiency. Its unique architecture supports much larger batch processing and extended context lengths within single-GPU environments, a critical consideration for real-world applications. This efficiency is particularly pronounced in scenarios with extended sequence lengths, where Jamba's throughput far surpasses that of comparable models, highlighting its practical advantages in handling long-context tasks.

Future Implications and Research Directions

The introduction of Jamba opens up new avenues for the development of efficient and powerful language models. Its hybrid architecture provides a template for balancing the computational and memory requirements of large models, a common challenge in the field. The successful integration of MoE layers into this setup further underscores the potential for such techniques to expand model capacity without proportionately increasing computational demands. As the first production-grade model of its kind, Jamba sets a precedent for future research and development in the realm of hybrid language models.

Concluding Remarks

Jamba represents a significant advancement in language modeling, effectively harnessing the strengths of Transformer and Mamba architectures alongside MoE components. This hybrid model not only achieves state-of-the-art performance across a broad range of benchmarks but does so with remarkable efficiency and adaptability. The release of Jamba under a permissive license encourages further exploration and optimization by the research community, potentially spurring the next wave of innovations in language model development.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube
HackerNews
Reddit
Jamba: A Hybrid Transformer-Mamba Language Model (36 points, 2 comments) in /r/singularity
Jamba: A Hybrid Transformer-Mamba Language Model (13 points, 2 comments) in /r/mlscaling
Jamba: A Hybrid Transformer-Mamba Language Model (1 point, 2 comments) in /r/LocalLLaMA
References
  1. The Hidden Attention of Mamba Models
  2. L-Eval: Instituting Standardized Evaluation for Long Context Language Models
  3. PIQA: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 7432–7439
  4. Evaluating Large Language Models Trained on Code
  5. QuAC: Question answering in context. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2174–2184
  6. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113
  7. Unified scaling laws for routed language models. In International conference on machine learning, pages 4057–4086. PMLR
  8. BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2924–2936
  9. Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
  10. Training Verifiers to Solve Math Word Problems
  11. Hugging Face. Open LLM leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard

  12. Multi-head state space model for speech recognition. In Proceedings of INTERSPEECH 2023, pages 241–245
  13. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23(120):1–39
  14. Hungry hungry hippos: Towards language modeling with state space models. In The Eleventh International Conference on Learning Representations
  15. Philip Gage. A new algorithm for data compression. The C Users Journal, 12(2):23–38
  16. Mamba: Linear-Time Sequence Modeling with Selective State Spaces
  17. Efficiently modeling long sequences with structured state spaces. In International Conference on Learning Representations
  18. Combining recurrent, convolutional, and continuous-time models with linear state space layers. Advances in neural information processing systems, 34:572–585
  19. Transformer language models without positional encodings still learn positional information. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 1382–1390
  20. Measuring massive multitask language understanding. In International Conference on Learning Representations
  21. CUAD: An expert-annotated NLP dataset for legal contract review. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1)
  22. Mistral 7B
  23. Mixtral of Experts
  24. Greg Kamradt. Needle in a haystack - pressure testing llms. https://github.com/gkamradt/LLMTest_NeedleInAHaystack/

  25. The NarrativeQA reading comprehension challenge. Transactions of the Association for Computational Linguistics, 6:317–328
  26. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:452–466
  27. TruthfulQA: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214–3252, Dublin, Ireland, May 2022. Association for Computational Linguistics.
  28. Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies, pages 142–150
  29. Between words and characters: A Brief History of Open-Vocabulary Modeling and Tokenization in NLP
  30. Rethinking the role of demonstrations: What makes in-context learning work? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11048–11064
  31. In-context Learning and Induction Heads
  32. Can Mamba Learn How to Learn? A Comparative Study on In-Context Learning Tasks
  33. Block-state transformers. In Thirty-seventh Conference on Neural Information Processing Systems
  34. MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts
  35. Hyena hierarchy: Towards larger convolutional language models. In International Conference on Machine Learning, pages 28043–28078. PMLR
  36. StripedHyena: Moving Beyond Transformers with Hybrid Signal Processing Models. https://github.com/togethercomputer/stripedhyena

  37. WinoGrande: An adversarial winograd schema challenge at scale. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 8732–8740
  38. Diagonal state space augmented transformers for speech recognition. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE
  39. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725
  40. GLU Variants Improve Transformer
  41. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In International Conference on Learning Representations
  42. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063
  43. Challenging BIG-Bench tasks and whether chain-of-thought can solve them. In Findings of the Association for Computational Linguistics: ACL 2023, pages 13003–13051
  44. Gemma: Open Models Based on Gemini Research and Technology
  45. Llama 2: Open Foundation and Fine-Tuned Chat Models
  46. Attention is all you need. Advances in neural information processing systems, 30
  47. HellaSwag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800
  48. Root mean square layer normalization. Advances in Neural Information Processing Systems, 32
  49. ST-MoE: Designing Stable and Transferable Sparse Expert Models
  50. Efficient Long Sequence Modeling via State Space Augmented Transformer

Show All 50