Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Collaborative Video Diffusion: Consistent Multi-video Generation with Camera Control (2405.17414v1)

Published 27 May 2024 in cs.CV and cs.GR
Collaborative Video Diffusion: Consistent Multi-video Generation with Camera Control

Abstract: Research on video generation has recently made tremendous progress, enabling high-quality videos to be generated from text prompts or images. Adding control to the video generation process is an important goal moving forward and recent approaches that condition video generation models on camera trajectories make strides towards it. Yet, it remains challenging to generate a video of the same scene from multiple different camera trajectories. Solutions to this multi-video generation problem could enable large-scale 3D scene generation with editable camera trajectories, among other applications. We introduce collaborative video diffusion (CVD) as an important step towards this vision. The CVD framework includes a novel cross-video synchronization module that promotes consistency between corresponding frames of the same video rendered from different camera poses using an epipolar attention mechanism. Trained on top of a state-of-the-art camera-control module for video generation, CVD generates multiple videos rendered from different camera trajectories with significantly better consistency than baselines, as shown in extensive experiments. Project page: https://collaborativevideodiffusion.github.io/.

Collaborative Video Diffusion for Multi-View Consistency with Camera Control

The paper presents a novel approach called Collaborative Video Diffusion (CVD), designed to address the challenge of generating multiple videos of the same scene from different camera trajectories while maintaining consistency. The work builds upon recent advances in video generation, particularly leveraging diffusion models and camera control technologies.

Introduction

Recent progress in diffusion models has significantly advanced video generation quality. Models like SORA demonstrate the capability to generate high-quality videos with complex dynamics, primarily controlled through text or image inputs. However, these methods lack precise control over both camera movements and content, which is vital for practical applications. Prior works have explored conditioning video generation on various inputs but have not yet satisfactorily addressed camera control.

The need for consistent multi-view video generation is apparent in several applications, such as large-scale 3D scene generation. Existing approaches like MotionCtrl and CameraCtrl have made initial strides in camera control by conditioning video generative models on sequences of camera poses. However, these are limited to single-camera trajectories, resulting in inconsistencies when generating multiple videos of the same scene.

Methodology

The proposed CVD framework introduces several key innovations to achieve coherent multi-view video generation:

  1. Cross-Video Synchronization Module: To ensure consistency between frames of videos rendered from different camera poses, the paper introduces a cross-video synchronization module. This module employs an epipolar attention mechanism, which aligns features across frames based on the fundamental matrix of corresponding camera poses.
  2. Hybrid Training Strategy: The training procedure utilizes two datasets: RealEstate10K, which provides camera-calibrated static indoor scenes, and WebVid10M, offering a diverse array of dynamic scenes without camera poses. The model is trained in two phases:
    • Phase one uses video folding to create synchronized video pairs from RealEstate10K.
    • Phase two applies homography transformations to WebVid10M videos to simulate camera movements, thus enabling dynamic scene training.
  3. Collaborative Inference Algorithm: The model extends from generating video pairs to handling an arbitrary number of videos via a collaborative inference algorithm. This algorithm selects pairs of video features at each denoising step and averages the noise predictions across selected pairs to ensure consistency.

Experimental Results

The paper's extensive experiments demonstrate several strengths of the CVD framework. Quantitatively, CVD outperforms baseline methods in terms of geometric and semantic consistency across multiple criteria. For instance:

  • On RealEstate10K scenes, CVD achieves higher accuracy in the area under the cumulative error curve (AUC) for both rotation and translation errors compared to CameraCtrl and MotionCtrl.
  • In dynamic scene evaluation using WebVid10M prompts, CVD maintains superior cross-video geometric consistency, showcasing the effectiveness of its epipolar attention mechanism.
  • The model also excels in preserving content fidelity and semantic matching as evidenced by CLIP metrics in the experiments.

Qualitatively, CVD delivers consistent visual content across videos with different camera trajectories, including dynamic elements such as waves and lightning.

Implications and Future Work

The implications of this work span both practical and theoretical realms. Practically, CVD can enhance applications in digital content creation, virtual reality, and 3D scene reconstruction by providing high-quality, coherent multi-view videos. Theoretically, the framework paves the way for integrating more sophisticated camera control mechanisms within generative models, stimulating further research in this direction.

Future developments could involve scaling up the model to handle more complex scenes and integrating real-world dynamic camera data to refine the cross-video synchronization. Enhancements in user control over camera trajectories could also be explored to make the video generation process more intuitive and precise.

Conclusion

In conclusion, the Collaborative Video Diffusion framework represents a significant step towards consistent multi-view video generation with camera control. By introducing cross-video synchronization through epipolar attention and deploying a hybrid training strategy, the model achieves remarkable performance improvements over existing methods. This research opens new avenues for applications requiring high-fidelity, multi-perspective video content, bolstering the capabilities of generative AI in visual computing.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (64)
  1. Frozen in time: A joint video and image encoder for end-to-end retrieval. In IEEE International Conference on Computer Vision, 2021.
  2. Universal guidance for diffusion models. In CVPR, 2023.
  3. Multidiffusion: Fusing diffusion paths for controlled image generation. In arXiv, 2023.
  4. Demystifying mmd gans. In arXiv, 2018.
  5. Stable video diffusion: Scaling latent video diffusion models to large datasets. In arXiv, 2023.
  6. Align your latents: High-resolution video synthesis with latent diffusion models. In CVPR, 2023.
  7. Video generation models as world simulators. 2024.
  8. Generative rendering: Controllable 4d-guided video generation with 2d diffusion models. In CVPR, 2024.
  9. Diffdreamer: Towards consistent unsupervised single-view scene extrapolation with conditional diffusion models. In ICCV, 2023.
  10. Pix2video: Video editing using image diffusion. In ICCV, 2023.
  11. GeNVS: Generative novel view synthesis with 3D-aware diffusion models. In ICCV, 2023.
  12. Structure and content-guided video synthesis with diffusion models. In ICCV, 2023.
  13. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6):381–395, 1981.
  14. Scenescape: Text-driven consistent scene generation. In arXiv, 2023.
  15. Preserve your own correlation: A noise prior for video diffusion models. In ICCV, 2023.
  16. Sparsectrl: Adding sparse controls to text-to-video diffusion models. In arXiv, 2023.
  17. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. In arXiv, 2023.
  18. Cameractrl: Enabling camera control for text-to-video generation. In arXiv, 2024.
  19. Latent video diffusion models for high-fidelity long video generation. In arXiv, 2022.
  20. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NeurIPS, 2017.
  21. Imagen video: High definition video generation with diffusion models. In arXiv, 2022.
  22. Denoising diffusion probabilistic models. In NeurIPS, 2020.
  23. Video diffusion models. In arXiv, 2022.
  24. Text2room: Extracting textured 3d meshes from 2d text-to-image models. In ICCV, 2023.
  25. Lora: Low-rank adaptation of large language models. In ICLR, 2022.
  26. Animate anyone: Consistent and controllable image-to-video synthesis for character animation. In arXiv, 2023.
  27. Elucidating the design space of diffusion-based generative models. In NeurIPS, 2022.
  28. Text2video-zero: Text-to-image diffusion models are zero-shot video generators. In arXiv, 2023.
  29. D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
  30. Instant3d: Fast text-to-3d with sparse-view generation and large reconstruction model. In arxiv, 2023.
  31. Videogen: A reference-guided latent diffusion approach for high definition text-to-video generation. In arXiv, 2023.
  32. Infinitenature-zero: Learning perpetual view generation of natural scenes from single images. In ECCV, 2022.
  33. Infinite nature: Perpetual view generation of natural scenes from a single image. In ICCV, 2021.
  34. Zero-1-to-3: Zero-shot one image to 3d object. In arXiv, 2023.
  35. Syncdreamer: Generating multiview-consistent images from a single-view image. In arXiv, 2023.
  36. Wonder3d: Single image to 3d using cross-domain diffusion. In arXiv, 2023.
  37. High-fidelity performance metrics for generative models in pytorch, 2020.
  38. J. Plücker. Analytisch-Geometrische Entwicklungen. GD Baedeker, 1828.
  39. State of the art on diffusion models for visual computing. In arXiv, 2023.
  40. Learning transferable visual models from natural language supervision. In CoRR, 2021.
  41. A comparative analysis of ransac techniques leading to adaptive real-time random sample consensus. In ECCV, 2008.
  42. Hierarchical text-conditional image generation with clip latents. In arXiv, 2022.
  43. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
  44. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In CVPR, 2023.
  45. Photorealistic text-to-image diffusion models with deep language understanding. In arXiv, 2022.
  46. SuperGlue: Learning feature matching with graph neural networks. In CVPR, 2020.
  47. Zero123++: a single image to consistent multi-view diffusion base model. In arXiv, 2023.
  48. Make-a-video: Text-to-video generation without text-video data. In arXiv, 2022.
  49. Denoising diffusion implicit models. In ICLR, 2020.
  50. Score-based generative modeling through stochastic differential equations. In ICLR, 2020.
  51. Make-it-3d: High-fidelity 3d creation from a single image with diffusion prior. In ICCV, 2023.
  52. Diffusion with forward models: Solving stochastic inverse problems without direct supervision. In arXiv, 2023.
  53. Consistent view synthesis with pose-guided diffusion models. In CVPR, 2023.
  54. Attention is all you need. In NeurIPS, 2017.
  55. Videocomposer: Compositional video synthesis with motion controllability. In NeurIPS, 2023.
  56. Motionctrl: A unified and flexible motion controller for video generation. In arXiv, 2023.
  57. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In arXiv, 2022.
  58. Dmv3d: Denoising multi-view diffusion using 3d large reconstruction model. In arXiv, 2023.
  59. Dragnuwa: Fine-grained control in video generation by integrating text, image, and trajectory. In arXiv, 2023.
  60. Video probabilistic diffusion models in projected latent space. In CVPR, 2023.
  61. Adding conditional control to text-to-image diffusion models. In ICCV, 2023.
  62. Diffcollage: Parallel generation of large content with diffusion models. In CVPR, 2023.
  63. Motiondirector: Motion customization of text-to-video diffusion models. In arXiv, 2023.
  64. Stereo magnification: Learning view synthesis using multiplane images. In SIGGRAPH, 2018.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Zhengfei Kuang (11 papers)
  2. Shengqu Cai (10 papers)
  3. Hao He (99 papers)
  4. Yinghao Xu (57 papers)
  5. Hongsheng Li (340 papers)
  6. Leonidas Guibas (177 papers)
  7. Gordon Wetzstein (144 papers)
Citations (12)
X Twitter Logo Streamline Icon: https://streamlinehq.com