Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
103 tokens/sec
GPT-4o
11 tokens/sec
Gemini 2.5 Pro Pro
50 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
3 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
2000 character limit reached

DenseMamba: State Space Models with Dense Hidden Connection for Efficient Large Language Models (2403.00818v2)

Published 26 Feb 2024 in cs.CL and cs.LG

Abstract: LLMs face a daunting challenge due to the excessive computational and memory requirements of the commonly used Transformer architecture. While state space model (SSM) is a new type of foundational network architecture offering lower computational complexity, their performance has yet to fully rival that of Transformers. This paper introduces DenseSSM, a novel approach to enhance the flow of hidden information between layers in SSMs. By selectively integrating shallowlayer hidden states into deeper layers, DenseSSM retains fine-grained information crucial for the final output. Dense connections enhanced DenseSSM still maintains the training parallelizability and inference efficiency. The proposed method can be widely applicable to various SSM types like RetNet and Mamba. With similar model size, DenseSSM achieves significant improvements, exemplified by DenseRetNet outperforming the original RetNet with up to 5% accuracy improvement on public benchmarks. code is avalaible at https://github.com/WailordHe/DenseSSM

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. Piqa: Reasoning about physical commonsense in natural language, 2019.
  2. Gpt-neox-20b: An open-source autoregressive language model, 2022.
  3. Language models are few-shot learners, 2020.
  4. Palm: Scaling language modeling with pathways, 2022.
  5. Boolq: Exploring the surprising difficulty of natural yes/no questions, 2019.
  6. Think you have solved question answering? try arc, the ai2 reasoning challenge. ArXiv, abs/1803.05457, 2018.
  7. Dao, T. Flashattention-2: Faster attention with better parallelism and work partitioning, 2023.
  8. Flashattention: Fast and memory-efficient exact attention with io-awareness, 2022.
  9. Bert: Pre-training of deep bidirectional transformers for language understanding, 2019.
  10. Longnet: Scaling transformers to 1,000,000,000 tokens. arXiv preprint arXiv:2307.02486, 2023.
  11. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural networks, 107:3–11, 2018.
  12. Hungry hungry hippos: Towards language modeling with state space models, 2023.
  13. The pile: An 800gb dataset of diverse text for language modeling, 2020.
  14. A framework for few-shot language model evaluation, 12 2023. URL https://zenodo.org/records/10256836.
  15. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023.
  16. Hippo: Recurrent memory with optimal polynomial projections. Advances in neural information processing systems, 33:1474–1487, 2020.
  17. Efficiently modeling long sequences with structured state spaces. In International Conference on Learning Representations, 2021.
  18. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.
  19. Transformer quality in linear time, 2022.
  20. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  4700–4708, 2017.
  21. Scaling laws for neural language models, 2020.
  22. Transformers are rnns: Fast autoregressive transformers with linear attention, 2020.
  23. Lei, T. When attention meets fast recurrence: Training language models with reduced compute. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  7633–7648, 2021.
  24. Few-shot learning with multilingual language models. CoRR, abs/2112.10668, 2021. URL https://arxiv.org/abs/2112.10668.
  25. Decoupled weight decay regularization, 2019.
  26. Long range language modeling via gated state spaces, 2022.
  27. Pointer sentinel mixture models, 2016.
  28. Can a suit of armor conduct electricity? a new dataset for open book question answering. In EMNLP, 2018.
  29. Crosslingual generalization through multitask finetuning, 2022.
  30. OpenAI. Chatgpt (mar 14 version). https://chat.openai.com/chat, 2023.
  31. The lambada dataset, Aug 2016.
  32. Rwkv: Reinventing rnns for the transformer era. In Findings of EMNLP 2023, 2023.
  33. Hyena hierarchy: Towards larger convolutional language models. arXiv preprint arXiv:2302.10866, 2023.
  34. XCOPA: A multilingual dataset for causal commonsense reasoning. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020. URL https://ducdauge.github.io/files/xcopa.pdf.
  35. Transnormerllm: A faster and better large language model with improved transnormer, 2024.
  36. Exploring the limits of transfer learning with a unified text-to-text transformer, 2023.
  37. Winogrande: An adversarial winograd schema challenge at scale. arXiv preprint arXiv:1907.10641, 2019.
  38. Simplified state space layers for sequence modeling, 2023.
  39. Retentive network: A successor to transformer for large language models, 2023.
  40. Llama: Open and efficient foundation language models, 2023.
  41. Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008, 2017.
  42. Crowdsourcing multiple choice science questions. In NUT@EMNLP, 2017.
  43. Gated linear attention transformers with hardware-efficient training. arXiv preprint arXiv:2312.06635, 2023.
  44. Hellaswag: Can a machine really finish your sentence?, 2019.
  45. GLM-130b: An open bilingual pre-trained model. In The Eleventh International Conference on Learning Representations (ICLR), 2023. URL https://openreview.net/forum?id=-Aw0rrrPUF.
  46. An attention free transformer, 2021.
  47. Opt: Open pre-trained transformer language models, 2022.
Citations (20)

Summary

  • The paper presents DenseSSM, a method that adds dense hidden connections to state space models, significantly enhancing efficiency and performance.
  • It demonstrates that integrating shallow-layer hidden states into deeper layers improves information flow while maintaining training parallelizability.
  • Empirical results show that DenseRetNet outperforms RetNet by up to 5% on public benchmarks, underlining its potential for scalable LLM design.

DenseMamba: State Space Models with Dense Hidden Connection for Efficient LLMs

The paper "DenseMamba: State Space Models with Dense Hidden Connection for Efficient LLMs" presents an innovative approach to addressing the computational and memory challenges faced by LLMs that traditionally rely on the Transformer architecture. By introducing DenseSSM, this work proposes a method to enhance the flow of hidden information between layers in State Space Models (SSMs), thereby improving model efficiency and performance.

Motivation and Context

Transformers have become the foundational architecture for LLMs due to their superior ability in tasks such as language comprehension, dialogue, and reasoning. However, their deployment is often encumbered by high computational and memory demands. SSMs have surfaced as an alternative, offering a mechanism for sequence modeling. Nonetheless, their historical performance has not yet outpaced Transformers, particularly in terms of capturing hierarchical information across layers.

DenseSSM Framework

DenseSSM innovatively enhances traditional SSMs by introducing dense connections that allow for selective integration of shallow-layer hidden states into deeper layers. This approach retains crucial fine-grained information, pivotal for the model's output, while ensuring that training parallelizability and inference efficiency remain intact. DenseSSM demonstrates its adaptability across various SSM types, such as RetNet and Mamba, showing significant performance improvements.

Numerical Results

The paper reports substantial empirical improvements; DenseRetNet, an implementation of DenseSSM, surpasses the original RetNet's performance by up to 5% on public benchmarks with a similar model size. This is a noteworthy enhancement, indicating that dense connections facilitate a superior flow of information, which translates into more accurate predictions on diverse language tasks.

Implications and Future Directions

The implications of DenseSSM are twofold: practically, it provides a pathway to building more efficient LLMs that could be deployed with reduced resource requirements; theoretically, it pushes the boundaries of understanding how information propagation within neural networks can be optimized. This work opens avenues for further exploration into the interplay between architecture design and learning dynamics.

The research indicates potential future developments in AI:

  1. Exploration of SSMs: Continued exploration of SSMs in comparison to Transformer-based architectures could yield insights into efficient model design.
  2. Adaptation and Scalability: The application of DenseSSM principles to larger and more diverse datasets will test its scalability and robustness.
  3. Hardware Optimization: Future work could delve into further hardware optimizations in parallel with architectural improvements.

Conclusion

"DenseMamba: State Space Models with Dense Hidden Connection for Efficient LLMs" makes a significant contribution to the field by rethinking how hidden states are utilized within LLMs. Through meticulous design and empirical validation, the paper establishes DenseSSM as a promising methodology to enhance both the efficiency and performance of state space modeling in large language contexts.

Github Logo Streamline Icon: https://streamlinehq.com