Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

PEEKABOO: Interactive Video Generation via Masked-Diffusion (2312.07509v2)

Published 12 Dec 2023 in cs.CV and cs.LG

Abstract: Modern video generation models like Sora have achieved remarkable success in producing high-quality videos. However, a significant limitation is their inability to offer interactive control to users, a feature that promises to open up unprecedented applications and creativity. In this work, we introduce the first solution to equip diffusion-based video generation models with spatio-temporal control. We present Peekaboo, a novel masked attention module, which seamlessly integrates with current video generation models offering control without the need for additional training or inference overhead. To facilitate future research, we also introduce a comprehensive benchmark for interactive video generation. This benchmark offers a standardized framework for the community to assess the efficacy of emerging interactive video generation models. Our extensive qualitative and quantitative assessments reveal that Peekaboo achieves up to a 3.8x improvement in mIoU over baseline models, all while maintaining the same latency. Code and benchmark are available on the webpage.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. A-star: Test-time attention segregation and retention for text-to-image synthesis, 2023.
  2. Flamingo: a visual language model for few-shot learning, 2022.
  3. Multidiffusion: Fusing diffusion paths for controlled image generation. arXiv preprint arXiv:2302.08113, 2023.
  4. Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing, 2023.
  5. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017.
  6. Motion-conditioned diffusion model for controllable video synthesis, 2023a.
  7. Control-a-video: Controllable text-to-video generation with diffusion models, 2023b.
  8. Per-pixel classification is not all you need for semantic segmentation. In NeurIPS, 2021.
  9. Masked-attention mask transformer for universal image segmentation, 2022.
  10. Diffusion self-guidance for controllable image generation, 2023.
  11. Structure and content-guided video synthesis with diffusion models, 2023.
  12. The” something something” video database for learning and evaluating visual common sense. In Proceedings of the IEEE international conference on computer vision, pages 5842–5850, 2017.
  13. Imagen video: High definition video generation with diffusion models, 2022a.
  14. Video diffusion models. arxiv 2022. arXiv preprint arXiv:2204.03458, 2022b.
  15. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868, 2022.
  16. Lamd: Latent motion diffusion for video generation, 2023.
  17. Free-bloom: Zero-shot text-to-video generator with llm director and ldm animator. arXiv preprint arXiv:2309.14494, 3, 2023.
  18. Text2video-zero: Text-to-image diffusion models are zero-shot video generators, 2023.
  19. Mask dino: Towards a unified transformer-based framework for object detection and segmentation, 2022.
  20. Gligen: Open-set grounded text-to-image generation, 2023.
  21. Llm-grounded diffusion: Enhancing prompt understanding of text-to-image diffusion models with large language models, 2023a.
  22. Llm-grounded video diffusion models. arXiv preprint arXiv:2309.17444, 2023b.
  23. Videodirectorgpt: Consistent multi-scene video generation via llm-guided planning, 2023.
  24. On the effectiveness of task granularity for transfer learning. arXiv preprint arXiv:1804.09235, 2018.
  25. Simple open-vocabulary object detection. In European Conference on Computer Vision, pages 728–755. Springer, 2022.
  26. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453, 2023.
  27. Hotshot-XL, 2023.
  28. Grounded text-to-image synthesis with attention refocusing, 2023.
  29. High-resolution image synthesis with latent diffusion models, 2021.
  30. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, 2022.
  31. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, 2015.
  32. Make-a-video: Text-to-video generation without text-video data, 2022.
  33. Deep unsupervised learning using nonequilibrium thermodynamics. In Proceedings of the 32nd International Conference on Machine Learning.
  34. Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018.
  35. Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571, 2023a.
  36. Videocomposer: Compositional video synthesis with motion controllability, 2023b.
  37. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation, 2023a.
  38. Cvpr 2023 text guided video editing competition, 2023b.
  39. Msr-vtt: A large video description dataset for bridging video and language. IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  40. Controlvideo: Training-free controllable text-to-video generation, 2023.
  41. Magicvideo: Efficient video generation with latent diffusion models, 2022.
  42. Generalized decoding for pixel, image and language. 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Yash Jain (14 papers)
  2. Anshul Nasery (12 papers)
  3. Vibhav Vineet (58 papers)
  4. Harkirat Behl (9 papers)
Citations (17)

Summary

  • The paper introduces a training-free masked-diffusion mechanism that lets users specify spatio-temporal constraints during video generation.
  • It leverages masked spatio-temporal attention within a 3D UNet, integrating segmentation insights from MaskFormer and Mask2Former to achieve a 3.8× mIoU improvement.
  • Empirical evaluations on novel datasets demonstrate enhanced video quality and interactive motion control, enabling applications in animation, gaming, and virtual reality.

Peekaboo: Interactive Video Generation via Masked-Diffusion

The paper, Peekaboo: Interactive Video Generation via Masked-Diffusion, tackles a significant limitation of contemporary text-to-video generation models—the lack of user interactivity and control over generated content. Current video generation models, while capable of producing high-fidelity videos from textual descriptions, are inherently non-interactive. They lack mechanisms for users to specify the spatial (size and location) and temporal (movement) characteristics of objects in the generated videos. This work introduces a novel technique—Peekaboo—that aims to address these shortcomings by providing spatio-temporal control in a training-free manner, leading to enhanced user interaction capabilities in video generation.

Methodology

Peekaboo adapts and extends recent advances from the segmentation literature, particularly leveraging concepts from MaskFormer and Mask2Former. The core idea is to apply masked spatio-temporal attention within the denoising steps of diffusion-based video generation models. The architectural backbone used is a 3D UNet, prevalent in state-of-the-art video generation models.

Peekaboo operates by selectively masking the spatial, cross, and temporal attention layers within the diffusion model. It ensures that at any given step in the denoising process, foreground and background elements are only influenced by their local context, based on provided input masks. These masks stipulate the pixels that belong to the foreground object, thereby refining its generation without cross-interference from the background. This approach allows the model to produce user-specified, high-quality video outputs without requiring additional training or introducing latency during inference.

Evaluation and Results

The effectiveness of Peekaboo is demonstrated through both qualitative and quantitative analyses. The paper introduces two datasets for this purpose:

  1. Something-Something v2-Spatio-Temporal (ssv2-ST): Derived from existing video datasets with bounding box annotations, it evaluates the method's capability in controlling spatio-temporal elements in realistic settings.
  2. Interactive Motion Control (IMC): Custom-designed to assess interactive scenarios where users specify bounding boxes for objects in motion.

Spatial Control

Evaluations on these datasets reveal that Peekaboo significantly enhances spatial control in generated videos, as indicated by metrics such as mean Intersection-over-Union (mIoU) and Coverage percentage. Specifically, experiments show a 3.8× improvement in mIoU over baseline models. This improvement is noteworthy as it demonstrates Peekaboo’s ability to localize objects accurately according to user specifications.

Video Quality

Beyond spatial control, Peekaboo also improves the overall quality of generated videos. This is substantiated through comparisons on the MSR-VTT dataset— a benchmark for large-scale video generation evaluations. Metrics like the Frechet Video Distance (FVD) indicate superior quality, validating the claim that spatial conditioning via Peekaboo does not compromise generative fidelity but rather enhances it.

Implications

Peekaboo's implications are manifold:

  • Practical Applications: It opens up avenues for more interactive applications in creative industries such as animation, gaming, and virtual reality, where user-driven content creation is paramount.
  • Research Advancements: Peekaboo sets a benchmark for developing zero-training techniques that can be retrofitted into existing models to provide enhanced functionalities without additional computational costs.

Future Developments

This research paves the way for several future directions:

  • Extension to Other Domains: Exploring the application of Peekaboo in fields beyond text-to-video, such as image-to-video or video-to-video generation, would be valuable.
  • Long-Form Video Generation: Enhancing the scalability of Peekaboo for generating longer and more complex video sequences with intricate user interactions.
  • Integration with LLMs: Coupling Peekaboo with advanced LLMs to create end-to-end systems that can interpret and implement detailed user commands for video generation.

Conclusion

Peekaboo: Interactive Video Generation via Masked-Diffusion offers a robust solution for incorporating user interactivity into video generation models. It achieves this through an innovative use of masked attention mechanisms within a diffusion framework, without necessitating additional training or inference overhead. As such, Peekaboo stands as a significant contribution to the area of interactive AI, with practical and theoretical implications that extend across multiple domains of artificial intelligence and computer vision.

Github Logo Streamline Icon: https://streamlinehq.com

GitHub