Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
173 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Matten: Video Generation with Mamba-Attention (2405.03025v2)

Published 5 May 2024 in cs.CV

Abstract: In this paper, we introduce Matten, a cutting-edge latent diffusion model with Mamba-Attention architecture for video generation. With minimal computational cost, Matten employs spatial-temporal attention for local video content modeling and bidirectional Mamba for global video content modeling. Our comprehensive experimental evaluation demonstrates that Matten has competitive performance with the current Transformer-based and GAN-based models in benchmark performance, achieving superior FVD scores and efficiency. Additionally, we observe a direct positive correlation between the complexity of our designed model and the improvement in video quality, indicating the excellent scalability of Matten.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (68)
  1. Video probabilistic diffusion models in projected latent space. In CVPR, 2023.
  2. Latte: Latent diffusion transformer for video generation. arXiv preprint arXiv:2401.03048, 2024.
  3. Vdt: General-purpose video diffusion transformers via mask modeling. In ICLR, 2023.
  4. Video diffusion models. NeurIPS, 2022.
  5. Mcvd-masked conditional video diffusion for prediction, generation, and interpolation. NeurIPS, 2022.
  6. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pages 8162–8171. PMLR, 2021.
  7. Scalable diffusion models with transformers. In International Conference on Computer Vision, pages 4195–4205, 2023.
  8. All are worth words: A vit backbone for diffusion models. In Computer Vision and Pattern Recognition, pages 22669–22679, 2023.
  9. Simplified state space layers for sequence modeling. arXiv preprint arXiv:2208.04933, 2022.
  10. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023.
  11. Combining recurrent, convolutional, and continuous-time models with linear state space layers. NeurIPS, 2021.
  12. Vmamba: Visual state space model. arXiv preprint arXiv:2401.10166, 2024.
  13. Vision mamba: Efficient visual representation learning with bidirectional state space model. arXiv preprint arXiv:2401.09417, 2024.
  14. Cobra: Extending mamba to multi-modal large language model for efficient inference. arXiv preprint arXiv:2403.14520, 2024.
  15. Efficient long sequence modeling via state space augmented transformer. arXiv preprint arXiv:2212.08136, 2022.
  16. Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018.
  17. Generating videos with scene dynamics. Neural Information Processing Systems, 29, 2016.
  18. Temporal generative adversarial nets with singular value clipping. In International Conference on Computer Vision, pages 2830–2839, 2017.
  19. Imaginator: Conditional spatio-temporal gan for video generation. In Winter Conference on Applications of Computer Vision, 2020.
  20. G3an: Disentangling appearance and motion for video generation. In Computer Vision and Pattern Recognition, pages 5264–5273, 2020.
  21. Lower dimensional kernels for video discriminators. Neural Networks, 132:506–520, 2020.
  22. Long video generation with time-agnostic vqgan and time-sensitive transformer. In European Conference on Computer Vision, pages 102–118. Springer, 2022.
  23. Latent video transformer. In Computer Vision, Imaging and Computer Graphics Theory and Applications, 2021.
  24. Scaling autoregressive video models. In International Conference on Learning Representations, 2020.
  25. Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:2104.10157, 2021.
  26. Denoising diffusion probabilistic models. Neural Information Processing Systems, 33:6840–6851, 2020.
  27. Flexible diffusion modeling of long videos. Neural Information Processing Systems, 35:27953–27965, 2022.
  28. Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022.
  29. Vidm: Video implicit diffusion models. In AAAI Conference on Artificial Intelligence, pages 9117–9125, 2023.
  30. Align your latents: High-resolution video synthesis with latent diffusion models. In Computer Vision and Pattern Recognition, pages 22563–22575, 2023.
  31. Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:2309.15103, 2023.
  32. Seine: Short-to-long video diffusion model for generative transition and prediction. arXiv preprint arXiv:2310.20700, 2023.
  33. Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:2305.03989, 2023.
  34. U-mamba: Enhancing long-range dependency for biomedical image segmentation. arXiv preprint arXiv:2401.04722, 2024.
  35. Mambabyte: Token-free selective state space model. arXiv preprint arXiv:2401.13660, 2024.
  36. U-shaped vision mamba for single image dehazing. arXiv preprint arXiv:2402.04139, 2024.
  37. Pointmamba: A simple state space model for point cloud analysis. arXiv preprint arXiv:2402.10739, 2024.
  38. Scalable diffusion models with state space backbone. arXiv preprint arXiv:2402.05608, 2024.
  39. Graph mamba: Towards learning on graphs with state space models. arXiv preprint arXiv:2402.08678, 2024.
  40. Mambatab: A simple yet effective approach for handling tabular data. arXiv preprint arXiv:2401.08867, 2024.
  41. Hippo: Recurrent memory with optimal polynomial projections. Advances in neural information processing systems, 33:1474–1487, 2020.
  42. Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396, 2021.
  43. Diagonal state spaces are as effective as structured state spaces. Advances in Neural Information Processing Systems, 35:22982–22994, 2022.
  44. On the parameterization and initialization of diagonal state space models. Advances in Neural Information Processing Systems, 35:35971–35983, 2022.
  45. Long range language modeling via gated state spaces. arXiv preprint arXiv:2206.13947, 2022.
  46. Mega: moving average equipped gated attention. arXiv preprint arXiv:2209.10655, 2022.
  47. Hungry hungry hippos: Towards language modeling with state space models. arXiv preprint arXiv:2212.14052, 2022.
  48. S4nd: Modeling images and videos as multidimensional signals with state spaces. Advances in neural information processing systems, 35:2846–2861, 2022.
  49. Mamba-nd: Selective state space modeling for multi-dimensional data. arXiv preprint arXiv:2402.05892, 2024.
  50. T-mamba: Frequency-enhanced gated long-range dependency for tooth 3d cbct segmentation. arXiv preprint arXiv:2404.01065, 2024.
  51. Videomamba: State space model for efficient video understanding. arXiv preprint arXiv:2403.06977, 2024.
  52. Video mamba suite: State space model as a versatile alternative for video understanding. arXiv preprint arXiv:2403.09626, 2024.
  53. Zigma: Zigzag mamba diffusion model. arXiv preprint arXiv:2403.13802, 2024.
  54. Ssm meets video diffusion models: Efficient video generation with structured state spaces. arXiv preprint arXiv:2403.07711, 2024.
  55. High-resolution image synthesis with latent diffusion models. In Computer Vision and Pattern Recognition, pages 10684–10695, 2022.
  56. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023.
  57. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023.
  58. Film: Visual reasoning with a general conditioning layer. In AAAI Conference on Artificial Antelligence, 2018.
  59. Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:1803.09179, 2018.
  60. Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In Computer Vision and Pattern Recognition, pages 2364–2373, 2018.
  61. A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision, 2(11), 2012.
  62. First order motion model for image animation. Neural Information Processing Systems, 32, 2019.
  63. Mocogan: Decomposing motion and content for video generation. In Computer Vision and Pattern Recognition, pages 1526–1535, 2018.
  64. A good image generator is what you need for high-resolution video synthesis. In International Conference on Learning Representations, 2021.
  65. Generating videos with dynamics-aware implicit generative adversarial networks. In International Conference on Learning Representations, 2022.
  66. Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In Computer Vision and Pattern Recognition, pages 3626–3636, 2022.
  67. Mostgan-v: Video generation with temporal motion styles. In Computer Vision and Pattern Recognition, pages 5652–5661, 2023.
  68. Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:2211.13221, 2023.
Citations (8)

Summary

We haven't generated a summary for this paper yet.

Reddit Logo Streamline Icon: https://streamlinehq.com