DenseMamba: State Space Models with Dense Hidden Connection for Efficient Large Language Models (2403.00818v2)
Abstract: LLMs face a daunting challenge due to the excessive computational and memory requirements of the commonly used Transformer architecture. While state space model (SSM) is a new type of foundational network architecture offering lower computational complexity, their performance has yet to fully rival that of Transformers. This paper introduces DenseSSM, a novel approach to enhance the flow of hidden information between layers in SSMs. By selectively integrating shallowlayer hidden states into deeper layers, DenseSSM retains fine-grained information crucial for the final output. Dense connections enhanced DenseSSM still maintains the training parallelizability and inference efficiency. The proposed method can be widely applicable to various SSM types like RetNet and Mamba. With similar model size, DenseSSM achieves significant improvements, exemplified by DenseRetNet outperforming the original RetNet with up to 5% accuracy improvement on public benchmarks. code is avalaible at https://github.com/WailordHe/DenseSSM
- Piqa: Reasoning about physical commonsense in natural language, 2019.
- Gpt-neox-20b: An open-source autoregressive language model, 2022.
- Language models are few-shot learners, 2020.
- Palm: Scaling language modeling with pathways, 2022.
- Boolq: Exploring the surprising difficulty of natural yes/no questions, 2019.
- Think you have solved question answering? try arc, the ai2 reasoning challenge. ArXiv, abs/1803.05457, 2018.
- Dao, T. Flashattention-2: Faster attention with better parallelism and work partitioning, 2023.
- Flashattention: Fast and memory-efficient exact attention with io-awareness, 2022.
- Bert: Pre-training of deep bidirectional transformers for language understanding, 2019.
- Longnet: Scaling transformers to 1,000,000,000 tokens. arXiv preprint arXiv:2307.02486, 2023.
- Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural networks, 107:3–11, 2018.
- Hungry hungry hippos: Towards language modeling with state space models, 2023.
- The pile: An 800gb dataset of diverse text for language modeling, 2020.
- A framework for few-shot language model evaluation, 12 2023. URL https://zenodo.org/records/10256836.
- Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023.
- Hippo: Recurrent memory with optimal polynomial projections. Advances in neural information processing systems, 33:1474–1487, 2020.
- Efficiently modeling long sequences with structured state spaces. In International Conference on Learning Representations, 2021.
- Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.
- Transformer quality in linear time, 2022.
- Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708, 2017.
- Scaling laws for neural language models, 2020.
- Transformers are rnns: Fast autoregressive transformers with linear attention, 2020.
- Lei, T. When attention meets fast recurrence: Training language models with reduced compute. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 7633–7648, 2021.
- Few-shot learning with multilingual language models. CoRR, abs/2112.10668, 2021. URL https://arxiv.org/abs/2112.10668.
- Decoupled weight decay regularization, 2019.
- Long range language modeling via gated state spaces, 2022.
- Pointer sentinel mixture models, 2016.
- Can a suit of armor conduct electricity? a new dataset for open book question answering. In EMNLP, 2018.
- Crosslingual generalization through multitask finetuning, 2022.
- OpenAI. Chatgpt (mar 14 version). https://chat.openai.com/chat, 2023.
- The lambada dataset, Aug 2016.
- Rwkv: Reinventing rnns for the transformer era. In Findings of EMNLP 2023, 2023.
- Hyena hierarchy: Towards larger convolutional language models. arXiv preprint arXiv:2302.10866, 2023.
- XCOPA: A multilingual dataset for causal commonsense reasoning. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020. URL https://ducdauge.github.io/files/xcopa.pdf.
- Transnormerllm: A faster and better large language model with improved transnormer, 2024.
- Exploring the limits of transfer learning with a unified text-to-text transformer, 2023.
- Winogrande: An adversarial winograd schema challenge at scale. arXiv preprint arXiv:1907.10641, 2019.
- Simplified state space layers for sequence modeling, 2023.
- Retentive network: A successor to transformer for large language models, 2023.
- Llama: Open and efficient foundation language models, 2023.
- Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008, 2017.
- Crowdsourcing multiple choice science questions. In NUT@EMNLP, 2017.
- Gated linear attention transformers with hardware-efficient training. arXiv preprint arXiv:2312.06635, 2023.
- Hellaswag: Can a machine really finish your sentence?, 2019.
- GLM-130b: An open bilingual pre-trained model. In The Eleventh International Conference on Learning Representations (ICLR), 2023. URL https://openreview.net/forum?id=-Aw0rrrPUF.
- An attention free transformer, 2021.
- Opt: Open pre-trained transformer language models, 2022.