2000 character limit reached
Matten: Video Generation with Mamba-Attention (2405.03025v2)
Published 5 May 2024 in cs.CV
Abstract: In this paper, we introduce Matten, a cutting-edge latent diffusion model with Mamba-Attention architecture for video generation. With minimal computational cost, Matten employs spatial-temporal attention for local video content modeling and bidirectional Mamba for global video content modeling. Our comprehensive experimental evaluation demonstrates that Matten has competitive performance with the current Transformer-based and GAN-based models in benchmark performance, achieving superior FVD scores and efficiency. Additionally, we observe a direct positive correlation between the complexity of our designed model and the improvement in video quality, indicating the excellent scalability of Matten.
- Video probabilistic diffusion models in projected latent space. In CVPR, 2023.
- Latte: Latent diffusion transformer for video generation. arXiv preprint arXiv:2401.03048, 2024.
- Vdt: General-purpose video diffusion transformers via mask modeling. In ICLR, 2023.
- Video diffusion models. NeurIPS, 2022.
- Mcvd-masked conditional video diffusion for prediction, generation, and interpolation. NeurIPS, 2022.
- Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pages 8162–8171. PMLR, 2021.
- Scalable diffusion models with transformers. In International Conference on Computer Vision, pages 4195–4205, 2023.
- All are worth words: A vit backbone for diffusion models. In Computer Vision and Pattern Recognition, pages 22669–22679, 2023.
- Simplified state space layers for sequence modeling. arXiv preprint arXiv:2208.04933, 2022.
- Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023.
- Combining recurrent, convolutional, and continuous-time models with linear state space layers. NeurIPS, 2021.
- Vmamba: Visual state space model. arXiv preprint arXiv:2401.10166, 2024.
- Vision mamba: Efficient visual representation learning with bidirectional state space model. arXiv preprint arXiv:2401.09417, 2024.
- Cobra: Extending mamba to multi-modal large language model for efficient inference. arXiv preprint arXiv:2403.14520, 2024.
- Efficient long sequence modeling via state space augmented transformer. arXiv preprint arXiv:2212.08136, 2022.
- Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018.
- Generating videos with scene dynamics. Neural Information Processing Systems, 29, 2016.
- Temporal generative adversarial nets with singular value clipping. In International Conference on Computer Vision, pages 2830–2839, 2017.
- Imaginator: Conditional spatio-temporal gan for video generation. In Winter Conference on Applications of Computer Vision, 2020.
- G3an: Disentangling appearance and motion for video generation. In Computer Vision and Pattern Recognition, pages 5264–5273, 2020.
- Lower dimensional kernels for video discriminators. Neural Networks, 132:506–520, 2020.
- Long video generation with time-agnostic vqgan and time-sensitive transformer. In European Conference on Computer Vision, pages 102–118. Springer, 2022.
- Latent video transformer. In Computer Vision, Imaging and Computer Graphics Theory and Applications, 2021.
- Scaling autoregressive video models. In International Conference on Learning Representations, 2020.
- Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:2104.10157, 2021.
- Denoising diffusion probabilistic models. Neural Information Processing Systems, 33:6840–6851, 2020.
- Flexible diffusion modeling of long videos. Neural Information Processing Systems, 35:27953–27965, 2022.
- Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022.
- Vidm: Video implicit diffusion models. In AAAI Conference on Artificial Intelligence, pages 9117–9125, 2023.
- Align your latents: High-resolution video synthesis with latent diffusion models. In Computer Vision and Pattern Recognition, pages 22563–22575, 2023.
- Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:2309.15103, 2023.
- Seine: Short-to-long video diffusion model for generative transition and prediction. arXiv preprint arXiv:2310.20700, 2023.
- Leo: Generative latent image animator for human video synthesis. arXiv preprint arXiv:2305.03989, 2023.
- U-mamba: Enhancing long-range dependency for biomedical image segmentation. arXiv preprint arXiv:2401.04722, 2024.
- Mambabyte: Token-free selective state space model. arXiv preprint arXiv:2401.13660, 2024.
- U-shaped vision mamba for single image dehazing. arXiv preprint arXiv:2402.04139, 2024.
- Pointmamba: A simple state space model for point cloud analysis. arXiv preprint arXiv:2402.10739, 2024.
- Scalable diffusion models with state space backbone. arXiv preprint arXiv:2402.05608, 2024.
- Graph mamba: Towards learning on graphs with state space models. arXiv preprint arXiv:2402.08678, 2024.
- Mambatab: A simple yet effective approach for handling tabular data. arXiv preprint arXiv:2401.08867, 2024.
- Hippo: Recurrent memory with optimal polynomial projections. Advances in neural information processing systems, 33:1474–1487, 2020.
- Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396, 2021.
- Diagonal state spaces are as effective as structured state spaces. Advances in Neural Information Processing Systems, 35:22982–22994, 2022.
- On the parameterization and initialization of diagonal state space models. Advances in Neural Information Processing Systems, 35:35971–35983, 2022.
- Long range language modeling via gated state spaces. arXiv preprint arXiv:2206.13947, 2022.
- Mega: moving average equipped gated attention. arXiv preprint arXiv:2209.10655, 2022.
- Hungry hungry hippos: Towards language modeling with state space models. arXiv preprint arXiv:2212.14052, 2022.
- S4nd: Modeling images and videos as multidimensional signals with state spaces. Advances in neural information processing systems, 35:2846–2861, 2022.
- Mamba-nd: Selective state space modeling for multi-dimensional data. arXiv preprint arXiv:2402.05892, 2024.
- T-mamba: Frequency-enhanced gated long-range dependency for tooth 3d cbct segmentation. arXiv preprint arXiv:2404.01065, 2024.
- Videomamba: State space model for efficient video understanding. arXiv preprint arXiv:2403.06977, 2024.
- Video mamba suite: State space model as a versatile alternative for video understanding. arXiv preprint arXiv:2403.09626, 2024.
- Zigma: Zigzag mamba diffusion model. arXiv preprint arXiv:2403.13802, 2024.
- Ssm meets video diffusion models: Efficient video generation with structured state spaces. arXiv preprint arXiv:2403.07711, 2024.
- High-resolution image synthesis with latent diffusion models. In Computer Vision and Pattern Recognition, pages 10684–10695, 2022.
- Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023.
- Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023.
- Film: Visual reasoning with a general conditioning layer. In AAAI Conference on Artificial Antelligence, 2018.
- Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:1803.09179, 2018.
- Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In Computer Vision and Pattern Recognition, pages 2364–2373, 2018.
- A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision, 2(11), 2012.
- First order motion model for image animation. Neural Information Processing Systems, 32, 2019.
- Mocogan: Decomposing motion and content for video generation. In Computer Vision and Pattern Recognition, pages 1526–1535, 2018.
- A good image generator is what you need for high-resolution video synthesis. In International Conference on Learning Representations, 2021.
- Generating videos with dynamics-aware implicit generative adversarial networks. In International Conference on Learning Representations, 2022.
- Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In Computer Vision and Pattern Recognition, pages 3626–3636, 2022.
- Mostgan-v: Video generation with temporal motion styles. In Computer Vision and Pattern Recognition, pages 5652–5661, 2023.
- Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:2211.13221, 2023.