Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
105 tokens/sec
GPT-4o
11 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DepthCrafter: Generating Consistent Long Depth Sequences for Open-world Videos (2409.02095v2)

Published 3 Sep 2024 in cs.CV, cs.AI, and cs.GR

Abstract: Estimating video depth in open-world scenarios is challenging due to the diversity of videos in appearance, content motion, camera movement, and length. We present DepthCrafter, an innovative method for generating temporally consistent long depth sequences with intricate details for open-world videos, without requiring any supplementary information such as camera poses or optical flow. The generalization ability to open-world videos is achieved by training the video-to-depth model from a pre-trained image-to-video diffusion model, through our meticulously designed three-stage training strategy. Our training approach enables the model to generate depth sequences with variable lengths at one time, up to 110 frames, and harvest both precise depth details and rich content diversity from realistic and synthetic datasets. We also propose an inference strategy that can process extremely long videos through segment-wise estimation and seamless stitching. Comprehensive evaluations on multiple datasets reveal that DepthCrafter achieves state-of-the-art performance in open-world video depth estimation under zero-shot settings. Furthermore, DepthCrafter facilitates various downstream applications, including depth-based visual effects and conditional video generation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (77)
  1. Bidirectional attention network for monocular depth estimation. In ICRA, 2021.
  2. Zoedepth: Zero-shot transfer by combining relative and metric depth. arXiv preprint arXiv:2302.12288, 2023.
  3. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023a.
  4. Align your latents: High-resolution video synthesis with latent diffusion models. In CVPR, 2023b.
  5. Video generation models as world simulators, 2024.
  6. A naturalistic open source movie for optical flow evaluation. In ECCV, 2012.
  7. Videocrafter1: Open diffusion models for high-quality video generation. arXiv preprint arXiv:2310.19512, 2023a.
  8. Videocrafter2: Overcoming data limitations for high-quality video diffusion models. In CVPR, 2024.
  9. Control-a-video: Controllable text-to-video generation with diffusion models, 2023b.
  10. Self-supervised learning with geometric constraints in monocular video: Connecting flow, depth, and camera. In ICCV, 2019.
  11. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In CVPR, 2017.
  12. Vision transformers need registers. In ICLR, 2024.
  13. Towards real-time monocular depth estimation for robotics: A survey. IEEE Transactions on Intelligent Transportation Systems, 23(10):16940–16961, 2022.
  14. Depth map prediction from a single image using a multi-scale deep network. NeurIPS, 27, 2014.
  15. Ccedit: Creative and controllable video editing via diffusion models. In CVPR, 2024.
  16. Deep ordinal regression network for monocular depth estimation. In CVPR, 2018.
  17. Geowizard: Unleashing the diffusion priors for 3d geometry estimation from a single image. In ECCV, 2024.
  18. Vision meets robotics: The kitti dataset. IJRR, 2013.
  19. Sparsectrl: Adding sparse controls to text-to-video diffusion models, 2023.
  20. Classifier-free diffusion guidance. NeurIPS, 2021.
  21. Denoising diffusion probabilistic models. NeurIPS, 33:6840–6851, 2020.
  22. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022a.
  23. Video diffusion models. NeurIPS, 35:8633–8646, 2022b.
  24. Depth-aware generative adversarial network for talking head video generation. In CVPR, 2022.
  25. Mononizing binocular videos. TOG (Proceedings of ACM SIGGRAPH Asia), 39(6):228:1–228:16, 2020.
  26. Bidirectional projection network for cross dimensional scene understanding. In CVPR, 2021.
  27. Tri-miprf: Tri-mip representation for efficient anti-aliasing neural radiance fields. In ICCV, 2023.
  28. Make it move: controllable image-to-video generation with text descriptions. In CVPR, 2022.
  29. Match-stereo-videos: Bidirectional alignment for consistent dynamic stereo matching. In ECCV, 2024.
  30. Dynamicstereo: Consistent dynamic depth from stereo videos. In CVPR, 2023.
  31. Elucidating the design space of diffusion-based generative models. NeurIPS, 35:26565–26577, 2022.
  32. Repurposing diffusion-based image generators for monocular depth estimation. In CVPR, 2024.
  33. DP Kingma. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  34. Diederik P Kingma. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  35. Robust consistent video depth estimation. In CVPR, 2021.
  36. From big to small: Multi-scale local planar guidance for monocular depth estimation. arXiv preprint arXiv:1907.10326, 2019.
  37. Matrixcity: A large-scale city dataset for city-scale neural rendering and beyond. In ICCV, 2023a.
  38. Depthformer: Exploiting long-range correlation and local information for accurate monocular depth estimation. Machine Intelligence Research, 20(6):837–854, 2023b.
  39. Temporally consistent online depth estimation in dynamic scenes. In WACV, 2023c.
  40. 3d-to-2d distillation for indoor scene parsing. In CVPR, 2021.
  41. Consistent video depth estimation. TOG (Proceedings of ACM SIGGRAPH), 39(4), 2020.
  42. Dinov2: Learning robust visual features without supervision. In TMLR, 2024.
  43. ReFusion: 3D Reconstruction in Dynamic Environments for RGB-D Cameras Exploiting Residuals. In IROS, 2019.
  44. P3depth: Monocular depth estimation with a piecewise planarity prior. In CVPR, 2022.
  45. A benchmark dataset and evaluation methodology for video object segmentation. In CVPR, 2016.
  46. Unidepth: Universal monocular metric depth estimation. In CVPR, 2024.
  47. Learning transferable visual models from natural language supervision. In ICML, 2021.
  48. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE TPAMI, 44(3):1623–1637, 2020.
  49. Vision transformers for dense prediction. In ICCV, 2021.
  50. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
  51. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, 2015.
  52. Progressive distillation for fast sampling of diffusion models. In ICLR, 2022.
  53. Learning temporally consistent video depth from video diffusion priors. arXiv preprint arXiv:2406.01493, 2024.
  54. Indoor segmentation and support inference from rgbd images. In ECCV, 2012.
  55. Deep unsupervised learning using nonequilibrium thermodynamics. In ICML, 2015.
  56. Neuralrecon: Real-time coherent 3d reconstruction from monocular video. In CVPR, 2021.
  57. Deepv2d: Video to depth with differentiable structure from motion. In ICLR, 2020.
  58. Diffusers: State-of-the-art diffusion models. https://github.com/huggingface/diffusers, 2022.
  59. Web stereo video supervision for depth prediction from dynamic scenes. In IEEE 3DV, 2019.
  60. Less is more: Consistent video depth estimation with masked frames modeling. In ACM MM, 2022.
  61. Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:2309.15103, 2023a.
  62. Neural video depth stabilizer. In ICCV, 2023b.
  63. Neural video depth stabilizer. In ICCV, 2023c.
  64. Tooncrafter: Generative cartoon interpolation. TOG (Proceedings of ACM SIGGRAPH Asia), 2024a.
  65. Dynamicrafter: Animating open-domain images with video diffusion priors. In ECCV, 2024b.
  66. Transformer-based attention networks for continuous pixel-wise prediction. In ICCV, 2021.
  67. Depth anything: Unleashing the power of large-scale unlabeled data. In CVPR, 2024a.
  68. Depth anything v2. arXiv:2406.09414, 2024b.
  69. Mamo: Leveraging memory and attention for monocular video depth estimation. In ICCV, 2023.
  70. Futuredepth: Learning to predict the future improves video depth estimation. arXiv preprint arXiv:2403.12953, 2024.
  71. Metric3d: Towards zero-shot metric 3d prediction from a single image. In ICCV, 2023.
  72. Monosdf: Exploring monocular geometric cues for neural implicit surface reconstruction. NeurIPS, 2022.
  73. Exploiting temporal consistency for real-time video depth estimation. In ICCV, 2019.
  74. Mimicmotion: High-quality human motion video generation with confidence-aware pose guidance. arXiv preprint arXiv:2406.19680, 2024a.
  75. Controlvideo: Training-free controllable text-to-video generation. In ICLR, 2024b.
  76. Consistent depth of moving objects in video. TOG (Proceedings of ACM SIGGRAPH), 40(4):1–12, 2021.
  77. Magicvideo: Efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018, 2022.
Citations (19)

Summary

DepthCrafter: Generating Consistent Long Depth Sequences for Open-world Videos

In the paper "DepthCrafter: Generating Consistent Long Depth Sequences for Open-world Videos," the authors tackle the challenging task of estimating video depth sequences in open-world scenarios characterized by diverse visual content, dynamic motion, and varying camera movements. This paper introduces DepthCrafter, a novel method designed to produce temporally consistent depth sequences without relying on supplementary data such as camera poses or optical flow.

Methodology

DepthCrafter employs a video-to-depth model rooted in diffusion processes, leveraging pre-trained image-to-video diffusion models extended through a three-stage training strategy. This model can generate depth sequences with lengths up to 110 frames, accommodating the need for variable-length sequences within the rich and diverse content typical of open-world video data. The training strategy integrates synthetic and realistic datasets to harvest both precise depth detail and content diversity.

The three-stage training strategy marks a critical contribution, enabling DepthCrafter to adapt to long temporal contexts necessary for maintaining temporal consistency and accurately arranging depth distributions across various video lengths. The first stage aligns the model to the video-to-depth task using a realistic dataset, the second refines temporal layers with extended sequences, and the third enhances spatial precision using synthetic data refinement.

In addressing extremely long video sequences, the paper presents an efficient inference strategy that processes video chunks in overlapping segments, calibrated with a noise initialization method ensuring consistent depth scales across adjoining segments. Seamlessness between segments is maintained through a novel interpolation scheme that minimizes temporal discontinuities.

Evaluation and Results

The evaluations demonstrate DepthCrafter’s state-of-the-art performance across several datasets representing indoor, outdoor, static, and dynamic scenes. DepthCrafter achieves a significant improvement in metrics, such as AbsRel and δ1\delta_1, across diverse datasets, asserting its high generalization ability. Notably, the effectivity in temporal consistency—avoiding flickering often attributed to single-image depth models directly applied to videos—is evident from the qualitative and quantitative results.

The paper provides comprehensive ablation studies asserting the effectiveness of training stages and denoising steps, reinforcing the notion that the outlined strategies contribute significantly to the model’s success. It also discusses inference speed benchmarks, positioning DepthCrafter favorably against existing high-performing baselines.

Implications and Future Work

DepthCrafter’s ability to provide fine-grained depth estimation across variable-length videos presents several practical applications. It facilitates enhanced depth-based visual effects, conditional video generation, and supports advancements in mixed reality, autonomous driving, and robotics imaging systems. The paper articulates future potential in optimizing computational efficiency and memory consumption, suggesting avenues such as model distillation and quantization could promote practical deployments.

The implications for AI developments are substantial, signifying the convergence of video generation models with precise depth encoding capabilities. Future exploration might extend the integration of multi-modal data or refining the training pipeline with active learning based on user interaction loops to further tailor the depth estimation accuracy in edge cases prevalent in open-world scenarios.

Overall, the paper proposes a robust architectural framework with considerable implications for AI-based video interpretation systems, seeking to harmonize state-of-the-art diffusion techniques with expansive real-world applicability. The model represents a substantial stride toward overcoming the limitations associated with dynamic scene representation and inconsistent depth perception.

Reddit Logo Streamline Icon: https://streamlinehq.com