Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MambaMixer: Efficient Selective State Space Models with Dual Token and Channel Selection (2403.19888v4)

Published 29 Mar 2024 in cs.LG, cs.AI, and cs.CV

Abstract: Recent advances in deep learning have mainly relied on Transformers due to their data dependency and ability to learn at scale. The attention module in these architectures, however, exhibits quadratic time and space in input size, limiting their scalability for long-sequence modeling. Despite recent attempts to design efficient and effective architecture backbone for multi-dimensional data, such as images and multivariate time series, existing models are either data independent, or fail to allow inter- and intra-dimension communication. Recently, State Space Models (SSMs), and more specifically Selective State Space Models, with efficient hardware-aware implementation, have shown promising potential for long sequence modeling. Motivated by the success of SSMs, we present MambaMixer, a new architecture with data-dependent weights that uses a dual selection mechanism across tokens and channels, called Selective Token and Channel Mixer. MambaMixer connects selective mixers using a weighted averaging mechanism, allowing layers to have direct access to early features. As a proof of concept, we design Vision MambaMixer (ViM2) and Time Series MambaMixer (TSM2) architectures based on the MambaMixer block and explore their performance in various vision and time series forecasting tasks. Our results underline the importance of selective mixing across both tokens and channels. In ImageNet classification, object detection, and semantic segmentation tasks, ViM2 achieves competitive performance with well-established vision models and outperforms SSM-based vision models. In time series forecasting, TSM2 achieves outstanding performance compared to state-of-the-art methods while demonstrating significantly improved computational cost. These results show that while Transformers, cross-channel attention, and MLPs are sufficient for good performance in time series forecasting, neither is necessary.

MambaMixer: Introducing Efficient Selectivity in State Space Models for Multidimensional Data

Introduction

Recent developments in State Space Models (SSMs) and their structured counterparts have ushered in a new era of sequence modeling, challenging the hegemony of attention-based architectures, notably Transformers. SSMs, by virtue of their linear time complexity, offer a promising avenue for efficient and scalable modeling of long sequences. The introduction of Selective State Space Models (S6), which incorporate data-dependent weights, has further enhanced their applicability, enabling selective focus on relevant context. Building on this advancement, MambaMixer emerges as a novel architecture that incorporates dual selection mechanisms across both tokens and channels, marking a significant stride in the evolution of SSMs. This paper, authored by Behrouz et al., elaborates on the MambaMixer block and demonstrates its application through Vision MambaMixer (ViM2) and Time Series MambaMixer (TSM2) for tackling vision and time series forecasting tasks, respectively.

MambaMixer Architecture

The MambaMixer architecture introduces a Selective Token and Channel Mixer, designed to selectively mix and fuse information across both tokens and channels in a data-dependent manner. This dual selection mechanism, a defining feature of the MambaMixer, allows layers to directly access inputs and outputs of different layers via a weighted averaging mechanism. This setup not only enhances information flow between the selective mixers but also across different layers, facilitating the construction of large-scale, stable networks.

The MambaMixer block sequentially employs Selective Token Mixer and Selective Channel Mixer blocks, each complemented by bidirectional S6 blocks. The inclusion of direct access to earlier features through a weighted averaging mechanism allows MambaMixer-based models to benefit from large numbers of layers while maintaining stability during training.

Application to Vision and Time Series Forecasting

The application of the MambaMixer block leads to the development of two distinct architectures: Vision MambaMixer (ViM2) for vision tasks, and Time Series MambaMixer (TSM2) for time series forecasting. ViM2 leverages the MambaMixer block to perform selective mixing across tokens and channels in image data, outperforming existing SSM-based vision models and achieving competitive performance with established vision models like ViT and MLP-Mixer. On the other hand, TSM2 demonstrates superior performance in time series forecasting tasks, outdoing state-of-the-art methods while showcasing significantly improved computational cost efficacy.

Evaluation and Results

ViM2 and TSM2 were evaluated across various vision and time series forecasting tasks, respectively. ViM2 achieved noteworthy performance in ImageNet classification, object detection, and semantic segmentation tasks, surpassing several well-established models. TSM2 excelled in time series forecasting across various benchmark datasets, demonstrating outstanding performance and setting new benchmarks.

Implications and Future Directions

The introduction of MambaMixer represents a significant advancement in the field of SSMs, offering a versatile architecture that can be adapted to various domains and tasks. The dual selection mechanism allows for efficient and effective selection and mixing of information across both tokens and channels, a capability that proves particularly beneficial for multi-dimensional data like images and multivariate time series. The performance of ViM2 and TSM2 illustrates the potential of MambaMixer-based models to challenge existing paradigms and set new standards for future developments in AI and deep learning.

Looking ahead, the MambaMixer architecture opens new avenues for exploring the possibilities of selective mixing in other domains, potentially leading to further innovations in AI models that are both efficient and effective. Beyond immediate practical applications, the principles underlying MambaMixer may inspire novel approaches to modeling complex data structures, further enriching the landscape of deep learning research.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (77)
  1. Aoki, M. State space modeling of time series. Springer Science & Business Media, 2013.
  2. Learning temporal higher-order patterns to detect anomalous brain activity. In Hegselmann, S., Parziale, A., Shanmugam, D., Tang, S., Asiedu, M. N., Chang, S., Hartvigsen, T., and Singh, H. (eds.), Proceedings of the 3rd Machine Learning for Health Symposium, volume 225 of Proceedings of Machine Learning Research, pp.  39–51. PMLR, 10 Dec 2023. URL https://proceedings.mlr.press/v225/behrouz23a.html.
  3. Graph mamba: Towards learning on graphs with state space models. arXiv preprint arXiv:2402.08678, 2024.
  4. Unsupervised representation learning of brain activity via bridging voxel activity and functional connectivity. In NeurIPS 2023 AI for Science Workshop, 2023. URL https://openreview.net/forum?id=HSvg7qFFd2.
  5. Scatterbrain: Unifying sparse and low-rank attention. In Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=SehIKudiIo1.
  6. Mmdetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155, 2019.
  7. Tsmixer: An all-mlp architecture for time series forecasting. arXiv preprint arXiv:2303.06053, 2023.
  8. Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  1251–1258, 2017.
  9. Monarch: Expressive structured matrices for efficient and accurate training. In International Conference on Machine Learning, pp.  4690–4721. PMLR, 2022.
  10. Griffin: Mixing gated linear recurrences with local attention for efficient language models. arXiv preprint arXiv:2402.19427, 2024.
  11. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp.  248–255. Ieee, 2009.
  12. Longnet: Scaling transformers to 1,000,000,000 tokens. arXiv preprint arXiv:2307.02486, 2023.
  13. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=YicbFdNTTy.
  14. Monarch mixer: A simple sub-quadratic GEMM-based architecture. In Thirty-seventh Conference on Neural Information Processing Systems, 2023a. URL https://openreview.net/forum?id=cB0BImqSS9.
  15. Hungry hungry hippos: Towards language modeling with state space models. In The Eleventh International Conference on Learning Representations, 2023b. URL https://openreview.net/forum?id=COZDy0WYGg.
  16. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023.
  17. Hippo: Recurrent memory with optimal polynomial projections. Advances in neural information processing systems, 33:1474–1487, 2020.
  18. On the parameterization and initialization of diagonal state space models. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K. (eds.), Advances in Neural Information Processing Systems, 2022a. URL https://openreview.net/forum?id=yJE7iQSAep.
  19. Efficiently modeling long sequences with structured state spaces. In International Conference on Learning Representations, 2022b. URL https://openreview.net/forum?id=uYLFoz1vlAC.
  20. Depthwise convolution is all you need for learning multiple visual domains. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp.  8368–8375, 2019.
  21. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
  22. Zigma: Zigzag mamba diffusion model. arXiv preprint arXiv:2403.13802, 2024.
  23. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  4700–4708, 2017.
  24. Localmamba: Visual state space model with windowed selective scan. arXiv preprint arXiv:2403.09338, 2024.
  25. Unlocking the potential of transformers in time series forecasting with sharpness-aware minimization and channel-wise attention. arXiv preprint arXiv:2402.10198, 2024.
  26. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pp.  448–456. pmlr, 2015.
  27. Polysketchformer: Fast transformers via sketches for polynomial kernels. arXiv preprint arXiv:2310.01655, 2023.
  28. Orchid: Flexible and data-dependent convolution for sequence modeling. arXiv preprint arXiv:2402.18508, 2024.
  29. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 2012.
  30. Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting. Advances in neural information processing systems, 32, 2019.
  31. Mamba-nd: Selective state space modeling for multi-dimensional data. arXiv preprint arXiv:2402.05892, 2024.
  32. Temporal fusion transformers for interpretable multi-horizon time series forecasting. International Journal of Forecasting, 37(4):1748–1764, 2021.
  33. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pp.  740–755. Springer, 2014.
  34. Sparse convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  806–814, 2015.
  35. Pyraformer: Low-complexity pyramidal attention for long-range time series modeling and forecasting. In International conference on learning representations, 2021a.
  36. Vmamba: Visual state space model. arXiv preprint arXiv:2401.10166, 2024.
  37. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pp.  10012–10022, 2021b.
  38. Image retrieval on real-life images with pre-trained vision-and-language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  2125–2134, 2021c.
  39. Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  12009–12019, 2022a.
  40. A convnet for the 2020s. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  11976–11986, 2022b.
  41. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=Bkg6RiCqY7.
  42. M5 accuracy competition: Results, findings, and conclusions. International Journal of Forecasting, 38(4):1346–1364, 2022.
  43. Parallelizing linear recurrent neural nets over sequence length. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=HyUNwulC-.
  44. S4nd: Modeling images and videos as multidimensional signals with state spaces. Advances in neural information processing systems, 35:2846–2861, 2022.
  45. A time series is worth 64 words: Long-term forecasting with transformers. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=Jbdc0vTOcol.
  46. Denseformer: Enhancing information flow in transformers via depth weighted averaging. arXiv preprint arXiv:2402.02622, 2024.
  47. Simba: Simplified mamba-based architecture for vision and multivariate time series, 2024.
  48. Pinz, A. et al. Object categorization. Foundations and Trends® in Computer Graphics and Vision, 1(4):255–353, 2006.
  49. Hyena hierarchy: Towards larger convolutional language models. In International Conference on Machine Learning, pp.  28043–28078. PMLR, 2023.
  50. Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning, pp.  28492–28518. PMLR, 2023.
  51. Designing network design spaces. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  10428–10436, 2020.
  52. Deepar: Probabilistic forecasting with autoregressive recurrent networks. International journal of forecasting, 36(3):1181–1191, 2020.
  53. Caduceus: Bi-directional equivariant long-range dna sequence modeling. arXiv preprint arXiv:2403.03234, 2024.
  54. Convolutional state space models for long-range spatiotemporal modeling. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=1ZvEtnrHS1.
  55. Efficientnet: Rethinking model scaling for convolutional neural networks. In ICML, pp.  6105–6114, 2019.
  56. Sparse mlp for image recognition: Is self-attention really necessary? In Proceedings of the AAAI conference on artificial intelligence, volume 36, pp.  2344–2351, 2022.
  57. Brain encoding models based on multimodal transformers can transfer across language and vision. In Oh, A., Neumann, T., Globerson, A., Saenko, K., Hardt, M., and Levine, S. (eds.), Advances in Neural Information Processing Systems, volume 36, pp.  29654–29666. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/5ebbbac62b968254093023f1c95015d3-Paper-Conference.pdf.
  58. Mlp-mixer: An all-mlp architecture for vision. Advances in neural information processing systems, 34:24261–24272, 2021.
  59. Training data-efficient image transformers & distillation through attention. In International conference on machine learning, pp.  10347–10357. PMLR, 2021.
  60. Patches are all you need? Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=rAnB7JSMXL. Featured Certification.
  61. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  62. Pvt v2: Improved baselines with pyramid vision transformer. Computational Visual Media, 8(3):415–424, 2022a.
  63. Dynamixer: a vision mlp architecture with dynamic mixing. In International conference on machine learning, pp.  22691–22701. PMLR, 2022b.
  64. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, 2019.
  65. Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting. Advances in neural information processing systems, 34:22419–22430, 2021.
  66. Efficient streaming language models with attention sinks. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=NG7sS51zVF.
  67. Unified perceptual parsing for scene understanding. In ECCV, pp.  418–434, 2018a.
  68. Unified perceptual parsing for scene understanding. In Proceedings of the European conference on computer vision (ECCV), pp.  418–434, 2018b.
  69. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  1492–1500, 2017.
  70. Restormer: Efficient transformer for high-resolution image restoration. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  5728–5739, 2022.
  71. Resnest: Split-attention networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  2736–2746, 2022.
  72. Effectively modeling time series with simple discrete state spaces. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=2EpjkjzdCAa.
  73. Semantic understanding of scenes through the ade20k dataset. International Journal of Computer Vision, 127:302–321, 2019.
  74. Informer: Beyond efficient transformer for long sequence time-series forecasting. In Proceedings of the AAAI conference on artificial intelligence, volume 35, pp.  11106–11115, 2021.
  75. Film: Frequency improved legendre memory model for long-term time series forecasting. Advances in Neural Information Processing Systems, 35:12677–12690, 2022a.
  76. Fedformer: Frequency enhanced decomposed transformer for long-term series forecasting. In International conference on machine learning, pp.  27268–27286. PMLR, 2022b.
  77. Vision mamba: Efficient visual representation learning with bidirectional state space model. arXiv preprint arXiv:2401.09417, 2024.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Ali Behrouz (17 papers)
  2. Michele Santacatterina (14 papers)
  3. Ramin Zabih (19 papers)
Citations (23)
Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com