Reangle-A-Video: 4D Video Generation as Video-to-Video Translation (2503.09151v2)

Published 12 Mar 2025 in cs.CV and cs.AI

Abstract: We introduce Reangle-A-Video, a unified framework for generating synchronized multi-view videos from a single input video. Unlike mainstream approaches that train multi-view video diffusion models on large-scale 4D datasets, our method reframes the multi-view video generation task as video-to-videos translation, leveraging publicly available image and video diffusion priors. In essence, Reangle-A-Video operates in two stages. (1) Multi-View Motion Learning: An image-to-video diffusion transformer is synchronously fine-tuned in a self-supervised manner to distill view-invariant motion from a set of warped videos. (2) Multi-View Consistent Image-to-Images Translation: The first frame of the input video is warped and inpainted into various camera perspectives under an inference-time cross-view consistency guidance using DUSt3R, generating multi-view consistent starting images. Extensive experiments on static view transport and dynamic camera control show that Reangle-A-Video surpasses existing methods, establishing a new solution for multi-view video generation. We will publicly release our code and data. Project page: https://hyeonho99.github.io/reangle-a-video/

Summary

The paper Reangle-A-Video introduces a unified framework to generate synchronized multi-view (4D) videos from a single input video by treating it as a translation task.
The approach involves learning view-invariant motion from warped videos and generating consistent multi-view starting images using cross-view guidance.
Experiments show the method produces higher quality and more consistent 4D videos compared to existing techniques, supporting dynamic camera control for potential applications like VR.

Reangle-A-Video is a framework designed to generate 4D videos by creating synchronized multi-view versions from one input video. The key idea is to treat the task as a translation from one video into multiple videos, each corresponding to a different camera view. This approach bypasses the need to train on vast 4D datasets by leveraging existing diffusion models for both images and videos.

Key Contributions:

Unified framework that handles both static view transport (turning one view into a consistent multi-view sequence) and dynamic camera control.
A method to learn view-invariant motion from a set of warped videos by fine-tuning an image-to-video diffusion transformer in a self-supervised way.
An inference method that uses cross-view consistency guidance (through an image inpainting process supported by a multi-view stereo reconstruction network) to generate the starting images from different perspectives.

Method Overview:

Multi-View Motion Learning:
- The input video is warped to create several versions from different angles.
- An image-to-video diffusion transformer is fine-tuned on these warped videos. This stage captures the underlying motion in a way that works regardless of the camera view.
Multi-View Consistent Image-to-Image Translation:
- The first frame of the video is transformed into a set of initial images corresponding to various camera perspectives.
- This transformation involves warping and inpainting the image while adjusting for cross-view consistency using an off-the-shelf tool.
Generation of 4D Video:
- By combining the view-invariant motion with these multi-view consistent starting images, the system generates a video sequence that is both synchronized across different views and capable of dynamic camera movements.

Results:

Experiments in the paper indicate that the method outperforms existing techniques in terms of visual quality and consistency. The produced 4D videos maintain synchronized motion across different viewpoints and allow for flexible camera control.

In summary, Reangle-A-Video offers an innovative solution for generating multi-view (or 4D) videos by translating a single view into several coherent views, using a two-stage process that first learns motion and then ensures visual consistency. This approach opens new possibilities for applications like virtual reality and advanced video editing by simplifying the process of creating content from limited input data.

PDF Markdown

Related Papers

Find Related Papers

GitHub

Tweets

https://twitter.com/javaeeeee1/status/1900137989155471475

Reddit

[2503.09151] Reangle-A-Video: 4D Video Generation as Video-to-Video Translation (2 points, 0 comments)
[2503.09151] Reangle-A-Video: 4D Video Generation as Video-to-Video Translation (1 point, 0 comments)