- The paper Reangle-A-Video introduces a unified framework to generate synchronized multi-view (4D) videos from a single input video by treating it as a translation task.
- The approach involves learning view-invariant motion from warped videos and generating consistent multi-view starting images using cross-view guidance.
- Experiments show the method produces higher quality and more consistent 4D videos compared to existing techniques, supporting dynamic camera control for potential applications like VR.
Reangle-A-Video is a framework designed to generate 4D videos by creating synchronized multi-view versions from one input video. The key idea is to treat the task as a translation from one video into multiple videos, each corresponding to a different camera view. This approach bypasses the need to train on vast 4D datasets by leveraging existing diffusion models for both images and videos.
Key Contributions:
- Unified framework that handles both static view transport (turning one view into a consistent multi-view sequence) and dynamic camera control.
- A method to learn view-invariant motion from a set of warped videos by fine-tuning an image-to-video diffusion transformer in a self-supervised way.
- An inference method that uses cross-view consistency guidance (through an image inpainting process supported by a multi-view stereo reconstruction network) to generate the starting images from different perspectives.
Method Overview:
- Multi-View Motion Learning:
- The input video is warped to create several versions from different angles.
- An image-to-video diffusion transformer is fine-tuned on these warped videos. This stage captures the underlying motion in a way that works regardless of the camera view.
- Multi-View Consistent Image-to-Image Translation:
- The first frame of the video is transformed into a set of initial images corresponding to various camera perspectives.
- This transformation involves warping and inpainting the image while adjusting for cross-view consistency using an off-the-shelf tool.
- Generation of 4D Video:
- By combining the view-invariant motion with these multi-view consistent starting images, the system generates a video sequence that is both synchronized across different views and capable of dynamic camera movements.
Results:
Experiments in the paper indicate that the method outperforms existing techniques in terms of visual quality and consistency. The produced 4D videos maintain synchronized motion across different viewpoints and allow for flexible camera control.
In summary, Reangle-A-Video offers an innovative solution for generating multi-view (or 4D) videos by translating a single view into several coherent views, using a two-stage process that first learns motion and then ensures visual consistency. This approach opens new possibilities for applications like virtual reality and advanced video editing by simplifying the process of creating content from limited input data.