Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
157 tokens/sec
GPT-4o
8 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SSM Meets Video Diffusion Models: Efficient Long-Term Video Generation with Structured State Spaces (2403.07711v4)

Published 12 Mar 2024 in cs.CV and cs.AI

Abstract: Given the remarkable achievements in image generation through diffusion models, the research community has shown increasing interest in extending these models to video generation. Recent diffusion models for video generation have predominantly utilized attention layers to extract temporal features. However, attention layers are limited by their computational costs, which increase quadratically with the sequence length. This limitation presents significant challenges when generating longer video sequences using diffusion models. To overcome this challenge, we propose leveraging state-space models (SSMs) as temporal feature extractors. SSMs (e.g., Mamba) have recently gained attention as promising alternatives due to their linear-time memory consumption relative to sequence length. In line with previous research suggesting that using bidirectional SSMs is effective for understanding spatial features in image generation, we found that bidirectionality is also beneficial for capturing temporal features in video data, rather than relying on traditional unidirectional SSMs. We conducted comprehensive evaluations on multiple long-term video datasets, such as MineRL Navigate, across various model sizes. For sequences up to 256 frames, SSM-based models require less memory to achieve the same FVD as attention-based models. Moreover, SSM-based models often deliver better performance with comparable GPU memory usage. Our codes are available at https://github.com/shim0114/SSM-Meets-Video-Diffusion-Models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (69)
  1. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
  2. Decision s4: Efficient sequence-based rl via state spaces layers. In International Conference on Learning Representations, 2023.
  3. Align your latents: High-resolution video synthesis with latent diffusion models. In CVPR, 2023.
  4. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, 2017.
  5. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.
  6. “hungry hungry hippos: Towards language modeling with state space models”. In The International Conference on Learning Representations (ICLR), 2023.
  7. Language modeling with gated convolutional networks. In International conference on machine learning, pp.  933–941. PMLR, 2017.
  8. Facing off world model backbones: Rnns, transformers, and s4. arXiv:2307.02064, 2023.
  9. Diffusion models beat gans on image synthesis. arXiv preprint arXiv:2105.05233, 2021.
  10. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  11. Long video generation with time-agnostic vqgan and time-sensitive transformer. arXiv preprint arXiv:2204.03638, 2022.
  12. It’s raw! audio generation with state-space models. In International Conference on Machine Learning, 2022.
  13. Generative adversarial nets. In Advances in Neural Information Processing Systems, pp.  2672–2680, 2014.
  14. Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural Networks, 18(5-6):602–610, 2005.
  15. Temporal difference variational auto-encoder. arXiv preprint arXiv:1806.03107, 2018.
  16. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023.
  17. Hippo: Recurrent memory with optimal polynomial projections. Advances in Neural Information Processing Systems, 33:1474–1487, 2020.
  18. Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396, 2021.
  19. On the parameterization and initialization of diagonal state space models. In Advances in Neural Information Processing Systems, volume 35, pp.  35971–35983, 2022.
  20. Minerl: A large-scale dataset of minecraft demonstrations. arXiv preprint arXiv:1907.13440, 2019.
  21. Flexible diffusion modeling of long videos. arXiv preprint arXiv:2205.11495, 2022.
  22. He, K. et al. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2016.
  23. He, Y. et al. Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:2211.13221, 2022.
  24. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.
  25. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  26. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, pp.  6840–6851, 2020.
  27. Imagen video: High definition video generation with diffusion models. arXiv:2210.02303, 2022a.
  28. Video diffusion models, 2022b.
  29. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  30. Höppe, T. et al. Diffusion models for video prediction and infilling. arXiv preprint arXiv:2206.07696, 2022.
  31. Long movie clip classification with state-space video models. In ECCV, 2022.
  32. Variational temporal abstraction. In Advances in Neural Information Processing Systems, pp.  11566–11575, 2019.
  33. Auto-encoding variational bayes. International Conference on Learning Representations, 2013.
  34. Variational diffusion models. arXiv preprint arXiv:2107.00630, 2021.
  35. Reformer: The efficient transformer. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=rkgNKkHtvB.
  36. Modelling long range dependencies in nd: From task-specific to a general purpose cnn. In International Conference on Learning Representations, 2023.
  37. Vmamba: Visual state space model. arXiv preprint arXiv:2401.10166, 2024.
  38. Structured state space models for in-context reinforcement learning. arXiv preprint arXiv:2303.03982, 2023.
  39. Long range language modeling via gated state spaces. In The Eleventh International Conference on Learning Representations, 2023.
  40. S4nd: Modeling images and videos as multidimensional signals with state spaces. In Advances in Neural Information Processing Systems, 2022.
  41. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pp.  8162–8171. PMLR, 2021.
  42. Scalable diffusion models with transformers. arXiv preprint arXiv:2212.09748, 2022. doi: 10.48550/arXiv.2212.09748.
  43. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  10684–10695, 2022.
  44. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp.  234–241. Springer, 2015.
  45. Temporal generative adversarial nets with singular value clipping. In IEEE International Conference on Computer Vision (ICCV), 2017.
  46. Progressive distillation for fast sampling of diffusion models. In International Conference on Learning Representations (ICLR), 2022.
  47. Clockwork variational autoencoders. Advances in Neural Information Processing Systems, 34, 2021.
  48. Efficient attention: Attention with linear complexities. CoRR, abs/1812.01243, 2018. URL http://arxiv.org/abs/1812.01243.
  49. Make-a-video: Text-to-video generation without text-video data. arXiv:2209.14792, 2022.
  50. Simplified state space layers for sequence modeling. In International Conference on Learning Representations, 2023.
  51. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pp.  2256–2265, 2015.
  52. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2021. doi: 10.48550/arXiv.2011.13456.
  53. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012. doi: 10.48550/arXiv.1212.0402.
  54. Long range arena: A benchmark for efficient transformers. In International Conference on Learning Representations, 2021.
  55. Mocogan: Decomposing motion and content for video generation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  1526–1535, 2018.
  56. Towards accurate generative models of video: A new metric challenges. arXiv preprint arXiv:1812.01717, 2018.
  57. Attention is all you need. In Advances in Neural Information Processing Systems, pp.  5998–6008, 2017.
  58. Mcvd-masked conditional video diffusion for prediction, generation, and interpolation. In Advances in neural information processing systems, volume 35, pp.  23371–23385, 2022.
  59. Generating videos with scene dynamics. In Advances in Neural Information Processing Systems (NeurIPS), 2016.
  60. Pretraining without attention. arXiv preprint arXiv:2212.10544, 2022.
  61. Selective structured state-spaces for long-form video understanding. In CVPR, 2023.
  62. Wang, J. et al. Mambabyte: Token-free selective state space model. arXiv preprint arXiv:2401.13660, 2024.
  63. Linformer: Self-attention with linear complexity, 2020.
  64. Diffusion models without attention. arXiv preprint arXiv:2311.18257, 2023.
  65. Yan, W. et al. Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:2104.10157, 2021.
  66. Yin, S. et al. Nuwa-xl: Diffusion over diffusion for extremely long video generation. arXiv preprint arXiv:2303.12346, 2023.
  67. Zhou, D. et al. Magicvideo: Efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018, 2022.
  68. Deep latent state space models for time-series generation. In International Conference on Machine Learning, 2023.
  69. Zhu, L. et al. Vision mamba: Efficient visual representation learning with bidirectional state space model. arXiv preprint arXiv:2401.09417, 2024.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com