PEEKABOO: Interactive Video Generation via Masked-Diffusion (2312.07509v2)
Abstract: Modern video generation models like Sora have achieved remarkable success in producing high-quality videos. However, a significant limitation is their inability to offer interactive control to users, a feature that promises to open up unprecedented applications and creativity. In this work, we introduce the first solution to equip diffusion-based video generation models with spatio-temporal control. We present Peekaboo, a novel masked attention module, which seamlessly integrates with current video generation models offering control without the need for additional training or inference overhead. To facilitate future research, we also introduce a comprehensive benchmark for interactive video generation. This benchmark offers a standardized framework for the community to assess the efficacy of emerging interactive video generation models. Our extensive qualitative and quantitative assessments reveal that Peekaboo achieves up to a 3.8x improvement in mIoU over baseline models, all while maintaining the same latency. Code and benchmark are available on the webpage.
- A-star: Test-time attention segregation and retention for text-to-image synthesis, 2023.
- Flamingo: a visual language model for few-shot learning, 2022.
- Multidiffusion: Fusing diffusion paths for controlled image generation. arXiv preprint arXiv:2302.08113, 2023.
- Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing, 2023.
- Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017.
- Motion-conditioned diffusion model for controllable video synthesis, 2023a.
- Control-a-video: Controllable text-to-video generation with diffusion models, 2023b.
- Per-pixel classification is not all you need for semantic segmentation. In NeurIPS, 2021.
- Masked-attention mask transformer for universal image segmentation, 2022.
- Diffusion self-guidance for controllable image generation, 2023.
- Structure and content-guided video synthesis with diffusion models, 2023.
- The” something something” video database for learning and evaluating visual common sense. In Proceedings of the IEEE international conference on computer vision, pages 5842–5850, 2017.
- Imagen video: High definition video generation with diffusion models, 2022a.
- Video diffusion models. arxiv 2022. arXiv preprint arXiv:2204.03458, 2022b.
- Cogvideo: Large-scale pretraining for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868, 2022.
- Lamd: Latent motion diffusion for video generation, 2023.
- Free-bloom: Zero-shot text-to-video generator with llm director and ldm animator. arXiv preprint arXiv:2309.14494, 3, 2023.
- Text2video-zero: Text-to-image diffusion models are zero-shot video generators, 2023.
- Mask dino: Towards a unified transformer-based framework for object detection and segmentation, 2022.
- Gligen: Open-set grounded text-to-image generation, 2023.
- Llm-grounded diffusion: Enhancing prompt understanding of text-to-image diffusion models with large language models, 2023a.
- Llm-grounded video diffusion models. arXiv preprint arXiv:2309.17444, 2023b.
- Videodirectorgpt: Consistent multi-scene video generation via llm-guided planning, 2023.
- On the effectiveness of task granularity for transfer learning. arXiv preprint arXiv:1804.09235, 2018.
- Simple open-vocabulary object detection. In European Conference on Computer Vision, pages 728–755. Springer, 2022.
- T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453, 2023.
- Hotshot-XL, 2023.
- Grounded text-to-image synthesis with attention refocusing, 2023.
- High-resolution image synthesis with latent diffusion models, 2021.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, 2022.
- U-net: Convolutional networks for biomedical image segmentation. In MICCAI, 2015.
- Make-a-video: Text-to-video generation without text-video data, 2022.
- Deep unsupervised learning using nonequilibrium thermodynamics. In Proceedings of the 32nd International Conference on Machine Learning.
- Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018.
- Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571, 2023a.
- Videocomposer: Compositional video synthesis with motion controllability, 2023b.
- Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation, 2023a.
- Cvpr 2023 text guided video editing competition, 2023b.
- Msr-vtt: A large video description dataset for bridging video and language. IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
- Controlvideo: Training-free controllable text-to-video generation, 2023.
- Magicvideo: Efficient video generation with latent diffusion models, 2022.
- Generalized decoding for pixel, image and language. 2022.
- Yash Jain (14 papers)
- Anshul Nasery (12 papers)
- Vibhav Vineet (58 papers)
- Harkirat Behl (9 papers)