PEEKABOO: Interactive Video Generation via Masked-Diffusion (2312.07509v2)

Published 12 Dec 2023 in cs.CV and cs.LG

Abstract: Modern video generation models like Sora have achieved remarkable success in producing high-quality videos. However, a significant limitation is their inability to offer interactive control to users, a feature that promises to open up unprecedented applications and creativity. In this work, we introduce the first solution to equip diffusion-based video generation models with spatio-temporal control. We present Peekaboo, a novel masked attention module, which seamlessly integrates with current video generation models offering control without the need for additional training or inference overhead. To facilitate future research, we also introduce a comprehensive benchmark for interactive video generation. This benchmark offers a standardized framework for the community to assess the efficacy of emerging interactive video generation models. Our extensive qualitative and quantitative assessments reveal that Peekaboo achieves up to a 3.8x improvement in mIoU over baseline models, all while maintaining the same latency. Code and benchmark are available on the webpage.

References (42)

Authors (4)

Yash Jain (14 papers)
Anshul Nasery (12 papers)
Vibhav Vineet (58 papers)
Harkirat Behl (9 papers)

Citations (17)

View on Semantic Scholar

Summary

The paper introduces a training-free masked-diffusion mechanism that lets users specify spatio-temporal constraints during video generation.
It leverages masked spatio-temporal attention within a 3D UNet, integrating segmentation insights from MaskFormer and Mask2Former to achieve a 3.8× mIoU improvement.
Empirical evaluations on novel datasets demonstrate enhanced video quality and interactive motion control, enabling applications in animation, gaming, and virtual reality.

Peekaboo: Interactive Video Generation via Masked-Diffusion

The paper, Peekaboo: Interactive Video Generation via Masked-Diffusion, tackles a significant limitation of contemporary text-to-video generation models—the lack of user interactivity and control over generated content. Current video generation models, while capable of producing high-fidelity videos from textual descriptions, are inherently non-interactive. They lack mechanisms for users to specify the spatial (size and location) and temporal (movement) characteristics of objects in the generated videos. This work introduces a novel technique—Peekaboo—that aims to address these shortcomings by providing spatio-temporal control in a training-free manner, leading to enhanced user interaction capabilities in video generation.

Methodology

Peekaboo adapts and extends recent advances from the segmentation literature, particularly leveraging concepts from MaskFormer and Mask2Former. The core idea is to apply masked spatio-temporal attention within the denoising steps of diffusion-based video generation models. The architectural backbone used is a 3D UNet, prevalent in state-of-the-art video generation models.

Peekaboo operates by selectively masking the spatial, cross, and temporal attention layers within the diffusion model. It ensures that at any given step in the denoising process, foreground and background elements are only influenced by their local context, based on provided input masks. These masks stipulate the pixels that belong to the foreground object, thereby refining its generation without cross-interference from the background. This approach allows the model to produce user-specified, high-quality video outputs without requiring additional training or introducing latency during inference.

Evaluation and Results

The effectiveness of Peekaboo is demonstrated through both qualitative and quantitative analyses. The paper introduces two datasets for this purpose:

Something-Something v2-Spatio-Temporal (ssv2-ST): Derived from existing video datasets with bounding box annotations, it evaluates the method's capability in controlling spatio-temporal elements in realistic settings.
Interactive Motion Control (IMC): Custom-designed to assess interactive scenarios where users specify bounding boxes for objects in motion.

Spatial Control

Evaluations on these datasets reveal that Peekaboo significantly enhances spatial control in generated videos, as indicated by metrics such as mean Intersection-over-Union (mIoU) and Coverage percentage. Specifically, experiments show a 3.8× improvement in mIoU over baseline models. This improvement is noteworthy as it demonstrates Peekaboo’s ability to localize objects accurately according to user specifications.

Video Quality

Beyond spatial control, Peekaboo also improves the overall quality of generated videos. This is substantiated through comparisons on the MSR-VTT dataset— a benchmark for large-scale video generation evaluations. Metrics like the Frechet Video Distance (FVD) indicate superior quality, validating the claim that spatial conditioning via Peekaboo does not compromise generative fidelity but rather enhances it.

Implications

Peekaboo's implications are manifold:

Practical Applications: It opens up avenues for more interactive applications in creative industries such as animation, gaming, and virtual reality, where user-driven content creation is paramount.
Research Advancements: Peekaboo sets a benchmark for developing zero-training techniques that can be retrofitted into existing models to provide enhanced functionalities without additional computational costs.

Future Developments

This research paves the way for several future directions:

Extension to Other Domains: Exploring the application of Peekaboo in fields beyond text-to-video, such as image-to-video or video-to-video generation, would be valuable.
Long-Form Video Generation: Enhancing the scalability of Peekaboo for generating longer and more complex video sequences with intricate user interactions.
Integration with LLMs: Coupling Peekaboo with advanced LLMs to create end-to-end systems that can interpret and implement detailed user commands for video generation.

Conclusion

Peekaboo: Interactive Video Generation via Masked-Diffusion offers a robust solution for incorporating user interactivity into video generation models. It achieves this through an innovative use of masked attention mechanisms within a diffusion framework, without necessitating additional training or inference overhead. As such, Peekaboo stands as a significant contribution to the area of interactive AI, with practical and theoretical implications that extend across multiple domains of artificial intelligence and computer vision.

PDF Markdown

Related Papers

GitHub

Peekaboo

Tweets

https://twitter.com/nunet_global/status/1790411757519032458

https://twitter.com/anshulnasery/status/1762309958409056424