Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

VMRNN: Integrating Vision Mamba and LSTM for Efficient and Accurate Spatiotemporal Forecasting (2403.16536v3)

Published 25 Mar 2024 in cs.CV

Abstract: Combining CNNs or ViTs, with RNNs for spatiotemporal forecasting, has yielded unparalleled results in predicting temporal and spatial dynamics. However, modeling extensive global information remains a formidable challenge; CNNs are limited by their narrow receptive fields, and ViTs struggle with the intensive computational demands of their attention mechanisms. The emergence of recent Mamba-based architectures has been met with enthusiasm for their exceptional long-sequence modeling capabilities, surpassing established vision models in efficiency and accuracy, which motivates us to develop an innovative architecture tailored for spatiotemporal forecasting. In this paper, we propose the VMRNN cell, a new recurrent unit that integrates the strengths of Vision Mamba blocks with LSTM. We construct a network centered on VMRNN cells to tackle spatiotemporal prediction tasks effectively. Our extensive evaluations show that our proposed approach secures competitive results on a variety of tasks while maintaining a smaller model size. Our code is available at https://github.com/yyyujintang/VMRNN-PyTorch.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (65)
  1. Long-term on-board prediction of people in traffic scenes under uncertainty. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4194–4202, 2018.
  2. Swin-unet: Unet-like pure transformer for medical image segmentation. In Computer Vision–ECCV 2022 Workshops: Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part III, pages 205–218. Springer, 2023.
  3. Mau: A motion-aware unit for video prediction and beyond. Advances in Neural Information Processing Systems, 34:26950–26962, 2021.
  4. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  5. Hungry hungry hippos: Towards language modeling with state space models. In The Eleventh International Conference on Learning Representations, 2022.
  6. Disentangling propagation and generation for video prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9006–9015, 2019.
  7. Simvp: Simpler yet better video prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3170–3180, 2022.
  8. nnmamba: 3d biomedical image segmentation, classification and landmark detection with state space model. arXiv preprint arXiv:2402.03526, 2024.
  9. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023.
  10. Hippo: Recurrent memory with optimal polynomial projections. Advances in neural information processing systems, 33:1474–1487, 2020.
  11. On the parameterization and initialization of diagonal state space models. Advances in Neural Information Processing Systems, 35:35971–35983, 2022.
  12. Efficiently modeling long sequences with structured state spaces. In International Conference on Learning Representations, 2021.
  13. Combining recurrent, convolutional, and continuous-time models with linear state space layers. Advances in neural information processing systems, 34:572–585, 2021.
  14. Disentangling physical dynamics from unknown factors for unsupervised video prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11474–11484, 2020.
  15. Mambair: A simple baseline for image restoration with state-space model. arXiv preprint arXiv:2402.15648, 2024.
  16. Mambamorph: a mamba-based backbone with contrastive feature learning for deformable mr-ct registration. arXiv preprint arXiv:2401.13934, 2024.
  17. Diagonal state spaces are as effective as structured state spaces. Advances in Neural Information Processing Systems, 35:22982–22994, 2022.
  18. Liquid structural state-space models. In The Eleventh International Conference on Learning Representations, 2022.
  19. A dynamic multi-scale voxel flow network for video prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6121–6131, 2023.
  20. Video representation learning by recognizing temporal transformations. In European Conference on Computer Vision, pages 425–442. Springer, 2020.
  21. Dynamic filter networks. Advances in neural information processing systems, 29, 2016.
  22. Exploring spatial-temporal multi-frequency analysis for high-fidelity and temporal-consistency video prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4554–4563, 2020.
  23. Varnet: Exploring variations for unsupervised video prediction. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 5801–5806. IEEE, 2018.
  24. Video pixel networks. In International Conference on Machine Learning, pages 1771–1779. PMLR, 2017.
  25. Predicting future frames using retrospective cycle gan. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1811–1820, 2019.
  26. Stochastic adversarial video prediction. arXiv preprint arXiv:1804.01523, 2018.
  27. Video prediction recalling long-term motion context via memory alignment learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3054–3063, 2021.
  28. Tfnet: Exploiting temporal cues for fast and accurate lidar semantic segmentation. arXiv preprint arXiv:2309.07849, 2023.
  29. Swin-umamba: Mamba-based unet with imagenet-based pretraining. arXiv preprint arXiv:2402.03302, 2024.
  30. Vmamba: Visual state space model. arXiv preprint arXiv:2401.10166, 2024.
  31. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10012–10022, 2021.
  32. Deep predictive coding networks for video prediction and unsupervised learning. In International Conference on Learning Representations, 2017.
  33. Understanding the effective receptive field in deep convolutional neural networks. Advances in neural information processing systems, 29, 2016.
  34. U-mamba: Enhancing long-range dependency for biomedical image segmentation. arXiv preprint arXiv:2401.04722, 2024.
  35. Mega: Moving average equipped gated attention. In The Eleventh International Conference on Learning Representations, 2022.
  36. Long range language modeling via gated state spaces. In International Conference on Learning Representations, 2023.
  37. Folded recurrent neural networks for future video prediction. In Proceedings of the European Conference on Computer Vision (ECCV), pages 716–731, 2018.
  38. Spatiotemporal contrastive video representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6964–6974, 2021.
  39. Recognizing human actions: a local svm approach. In Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004., volume 3, pages 32–36. IEEE, 2004.
  40. Noam Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020.
  41. Convolutional lstm network: A machine learning approach for precipitation nowcasting. Advances in neural information processing systems, 28, 2015.
  42. Deep learning for precipitation nowcasting: A benchmark and a new model. Advances in neural information processing systems, 30, 2017.
  43. Simplified state space layers for sequence modeling. In The Eleventh International Conference on Learning Representations, 2022.
  44. Unsupervised learning of video representations using lstms. In International conference on machine learning, pages 843–852. PMLR, 2015.
  45. Simvp: Towards simple yet powerful spatiotemporal predictive learning. arXiv preprint arXiv:2211.12509, 2022.
  46. Temporal attention unit: Towards efficient spatiotemporal predictive learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18770–18782, 2023.
  47. Openstl: A comprehensive benchmark of spatio-temporal predictive learning. Advances in Neural Information Processing Systems, 36, 2024.
  48. Swinlstm: Improving spatiotemporal prediction accuracy using swin transformer and lstm. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 13470–13479, October 2023.
  49. Postrainbench: A comprehensive benchmark and a new model for precipitation forecasting. arXiv preprint arXiv:2310.02676, 2023.
  50. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  51. Rgb-d-based human motion recognition with deep learning: A survey. Computer Vision and Image Understanding, 171:118–139, 2018.
  52. Predrnn++: Towards a resolution of the deep-in-time dilemma in spatiotemporal predictive learning. In International Conference on Machine Learning, pages 5123–5132. PMLR, 2018.
  53. Eidetic 3d lstm: A model for video prediction and beyond. In International conference on learning representations, 2018.
  54. Predrnn: Recurrent neural networks for predictive learning using spatiotemporal lstms. Advances in neural information processing systems, 30, 2017.
  55. Predrnn: A recurrent neural network for spatiotemporal predictive learning. arXiv preprint arXiv:2103.09504, 2021.
  56. Memory in memory: A predictive neural network for learning higher-order non-stationarity from spatiotemporal dynamics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9154–9162, 2019.
  57. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.
  58. Segmamba: Long-range sequential modeling mamba for 3d medical image segmentation. arXiv preprint arXiv:2401.13560, 2024.
  59. Predcnn: Predictive learning with cascade convolutions. In IJCAI, pages 2940–2947, 2018.
  60. Vivim: a video vision mamba for medical video object segmentation. arXiv preprint arXiv:2401.14168, 2024.
  61. Efficient and information-preserving future frame prediction and beyond. 2020.
  62. Deep spatio-temporal residual networks for citywide crowd flows prediction. In Thirty-first AAAI conference on artificial intelligence, 2017.
  63. Learning spatiotemporal features using 3dcnn and convolutional lstm for gesture recognition. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pages 3120–3128, 2017.
  64. Mmvp: Motion-matrix-based video prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4273–4283, October 2023.
  65. Vision mamba: Efficient visual representation learning with bidirectional state space model. arXiv preprint arXiv:2401.09417, 2024.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Yujin Tang (31 papers)
  2. Peijie Dong (26 papers)
  3. Zhenheng Tang (38 papers)
  4. Xiaowen Chu (108 papers)
  5. Junwei Liang (47 papers)
Citations (10)
X Twitter Logo Streamline Icon: https://streamlinehq.com