Through-The-Mask: Mask-based Motion Trajectories for Image-to-Video Generation (2501.03059v1)

Published 6 Jan 2025 in cs.CV, cs.AI, and cs.LG

Abstract: We consider the task of Image-to-Video (I2V) generation, which involves transforming static images into realistic video sequences based on a textual description. While recent advancements produce photorealistic outputs, they frequently struggle to create videos with accurate and consistent object motion, especially in multi-object scenarios. To address these limitations, we propose a two-stage compositional framework that decomposes I2V generation into: (i) An explicit intermediate representation generation stage, followed by (ii) A video generation stage that is conditioned on this representation. Our key innovation is the introduction of a mask-based motion trajectory as an intermediate representation, that captures both semantic object information and motion, enabling an expressive but compact representation of motion and semantics. To incorporate the learned representation in the second stage, we utilize object-level attention objectives. Specifically, we consider a spatial, per-object, masked-cross attention objective, integrating object-specific prompts into corresponding latent space regions and a masked spatio-temporal self-attention objective, ensuring frame-to-frame consistency for each object. We evaluate our method on challenging benchmarks with multi-object and high-motion scenarios and empirically demonstrate that the proposed method achieves state-of-the-art results in temporal coherence, motion realism, and text-prompt faithfulness. Additionally, we introduce \benchmark, a new challenging benchmark for single-object and multi-object I2V generation, and demonstrate our method's superiority on this benchmark. Project page is available at https://guyyariv.github.io/TTM/.

Summary

The paper introduces a two-stage framework that leverages mask-based motion trajectories to generate videos from static images.
It employs object-level attention and masked spatio-temporal self-attention to ensure temporal coherence and realistic motion.
Quantitative benchmarks show state-of-the-art FVD performance and strong human preference for superior visual quality.

An Expert Overview of "Through-The-Mask: Mask-based Motion Trajectories for Image-to-Video Generation"

The paper "Through-The-Mask: Mask-based Motion Trajectories for Image-to-Video Generation" introduces an innovative framework for transforming static images into dynamic video sequences, guided by textual descriptions—a process known as Image-to-Video (I2V) generation. This research addresses a significant challenge within the field: generating videos that maintain consistent and accurate motion, particularly in scenarios involving multiple objects.

Key Contributions and Methodology

This paper presents a novel two-stage compositional framework for I2V generation, which is a departure from the conventional single-step methods. The authors propose to decompose the process into two distinct stages:

Image-to-Motion Generation: This stage involves generating mask-based motion trajectories. These trajectories serve as intermediate representations and are pivotal in capturing both semantic object information and motion. The mask-based approach enables the encapsulation of object-level semantics and their interactions in a more expressive and compact form.
Motion-to-Video Generation: In this stage, the authors utilize the generated motion trajectories to produce the final video. The process is enhanced by incorporating object-level attention mechanisms such as spatial, per-object, and masked-cross attention, along with a masked spatio-temporal self-attention objective. These techniques ensure that each object maintains consistency across frames.

The use of a mask-based motion trajectory as an intermediate representation is particularly noteworthy. It allows for a more robust capture of both motion and semantics compared to previous approaches that rely solely on Optical Flow (OF), which is often redundant and semantically limited in I2V contexts.

Numerical Results and Analysis

The authors evaluate their method against rigorous benchmarks, specifically designed to assess scenarios with complex motion dynamics and multiple interacting objects. Their method demonstrates state-of-the-art performance concerning temporal coherence, motion realism, and adherence to textual prompts.

The proposed method achieves strong numerical results in terms of Fréchet Video Distance (FVD), surpassing contemporary methods in both single-object and multi-object settings.
Human evaluators show a clear preference for the visual quality and motion consistency of the videos generated by this method over previous approaches.

Implications and Future Directions

The introduction of mask-based motion trajectories suggests a substantial improvement in how motion is represented and utilized in the I2V pipeline. This work bridges the gap between static image understanding and dynamic video synthesis, offering a more granular level of control over object motions, which can be crucial for applications in animation and film production, virtual reality, and interactive storytelling.

Furthermore, the proposed framework's architecture-agnostic nature implies its adaptability to future advances in neural network architectures, which could further optimize video generation tasks. The authors' release of a new benchmark dataset (SA-V-128) is another valuable contribution, providing a foundation for future research to explore and refine I2V generation techniques.

Conclusion

By introducing a pioneering approach to generating videos from static images through the use of intermediate mask-based representations, "Through-The-Mask" represents a significant step forward in the field of video synthesis. This research not only provides a robust solution to the persistent problem of maintaining motion accuracy and consistency but also opens avenues for further exploration and enhancement of video generation technologies. As AI continues to evolve, frameworks like this hold the potential to transform how visual media is created and experienced.

PDF Markdown

Related Papers

GitHub

Through-The-Mask: Mask-based Motion Trajectories for Image-to-Video Generation

Tweets

https://twitter.com/adiyossLC/status/1876879180740092080

https://twitter.com/WilliamLamkin/status/1876657029550678189

https://twitter.com/adam_polyak90/status/1876933866255974762

https://twitter.com/guy_yariv/status/1876633421600174217