DepthCrafter: Generating Consistent Long Depth Sequences for Open-world Videos (2409.02095v2)
Abstract: Estimating video depth in open-world scenarios is challenging due to the diversity of videos in appearance, content motion, camera movement, and length. We present DepthCrafter, an innovative method for generating temporally consistent long depth sequences with intricate details for open-world videos, without requiring any supplementary information such as camera poses or optical flow. The generalization ability to open-world videos is achieved by training the video-to-depth model from a pre-trained image-to-video diffusion model, through our meticulously designed three-stage training strategy. Our training approach enables the model to generate depth sequences with variable lengths at one time, up to 110 frames, and harvest both precise depth details and rich content diversity from realistic and synthetic datasets. We also propose an inference strategy that can process extremely long videos through segment-wise estimation and seamless stitching. Comprehensive evaluations on multiple datasets reveal that DepthCrafter achieves state-of-the-art performance in open-world video depth estimation under zero-shot settings. Furthermore, DepthCrafter facilitates various downstream applications, including depth-based visual effects and conditional video generation.
- Bidirectional attention network for monocular depth estimation. In ICRA, 2021.
- Zoedepth: Zero-shot transfer by combining relative and metric depth. arXiv preprint arXiv:2302.12288, 2023.
- Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023a.
- Align your latents: High-resolution video synthesis with latent diffusion models. In CVPR, 2023b.
- Video generation models as world simulators, 2024.
- A naturalistic open source movie for optical flow evaluation. In ECCV, 2012.
- Videocrafter1: Open diffusion models for high-quality video generation. arXiv preprint arXiv:2310.19512, 2023a.
- Videocrafter2: Overcoming data limitations for high-quality video diffusion models. In CVPR, 2024.
- Control-a-video: Controllable text-to-video generation with diffusion models, 2023b.
- Self-supervised learning with geometric constraints in monocular video: Connecting flow, depth, and camera. In ICCV, 2019.
- Scannet: Richly-annotated 3d reconstructions of indoor scenes. In CVPR, 2017.
- Vision transformers need registers. In ICLR, 2024.
- Towards real-time monocular depth estimation for robotics: A survey. IEEE Transactions on Intelligent Transportation Systems, 23(10):16940–16961, 2022.
- Depth map prediction from a single image using a multi-scale deep network. NeurIPS, 27, 2014.
- Ccedit: Creative and controllable video editing via diffusion models. In CVPR, 2024.
- Deep ordinal regression network for monocular depth estimation. In CVPR, 2018.
- Geowizard: Unleashing the diffusion priors for 3d geometry estimation from a single image. In ECCV, 2024.
- Vision meets robotics: The kitti dataset. IJRR, 2013.
- Sparsectrl: Adding sparse controls to text-to-video diffusion models, 2023.
- Classifier-free diffusion guidance. NeurIPS, 2021.
- Denoising diffusion probabilistic models. NeurIPS, 33:6840–6851, 2020.
- Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022a.
- Video diffusion models. NeurIPS, 35:8633–8646, 2022b.
- Depth-aware generative adversarial network for talking head video generation. In CVPR, 2022.
- Mononizing binocular videos. TOG (Proceedings of ACM SIGGRAPH Asia), 39(6):228:1–228:16, 2020.
- Bidirectional projection network for cross dimensional scene understanding. In CVPR, 2021.
- Tri-miprf: Tri-mip representation for efficient anti-aliasing neural radiance fields. In ICCV, 2023.
- Make it move: controllable image-to-video generation with text descriptions. In CVPR, 2022.
- Match-stereo-videos: Bidirectional alignment for consistent dynamic stereo matching. In ECCV, 2024.
- Dynamicstereo: Consistent dynamic depth from stereo videos. In CVPR, 2023.
- Elucidating the design space of diffusion-based generative models. NeurIPS, 35:26565–26577, 2022.
- Repurposing diffusion-based image generators for monocular depth estimation. In CVPR, 2024.
- DP Kingma. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
- Diederik P Kingma. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Robust consistent video depth estimation. In CVPR, 2021.
- From big to small: Multi-scale local planar guidance for monocular depth estimation. arXiv preprint arXiv:1907.10326, 2019.
- Matrixcity: A large-scale city dataset for city-scale neural rendering and beyond. In ICCV, 2023a.
- Depthformer: Exploiting long-range correlation and local information for accurate monocular depth estimation. Machine Intelligence Research, 20(6):837–854, 2023b.
- Temporally consistent online depth estimation in dynamic scenes. In WACV, 2023c.
- 3d-to-2d distillation for indoor scene parsing. In CVPR, 2021.
- Consistent video depth estimation. TOG (Proceedings of ACM SIGGRAPH), 39(4), 2020.
- Dinov2: Learning robust visual features without supervision. In TMLR, 2024.
- ReFusion: 3D Reconstruction in Dynamic Environments for RGB-D Cameras Exploiting Residuals. In IROS, 2019.
- P3depth: Monocular depth estimation with a piecewise planarity prior. In CVPR, 2022.
- A benchmark dataset and evaluation methodology for video object segmentation. In CVPR, 2016.
- Unidepth: Universal monocular metric depth estimation. In CVPR, 2024.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE TPAMI, 44(3):1623–1637, 2020.
- Vision transformers for dense prediction. In ICCV, 2021.
- High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
- U-net: Convolutional networks for biomedical image segmentation. In MICCAI, 2015.
- Progressive distillation for fast sampling of diffusion models. In ICLR, 2022.
- Learning temporally consistent video depth from video diffusion priors. arXiv preprint arXiv:2406.01493, 2024.
- Indoor segmentation and support inference from rgbd images. In ECCV, 2012.
- Deep unsupervised learning using nonequilibrium thermodynamics. In ICML, 2015.
- Neuralrecon: Real-time coherent 3d reconstruction from monocular video. In CVPR, 2021.
- Deepv2d: Video to depth with differentiable structure from motion. In ICLR, 2020.
- Diffusers: State-of-the-art diffusion models. https://github.com/huggingface/diffusers, 2022.
- Web stereo video supervision for depth prediction from dynamic scenes. In IEEE 3DV, 2019.
- Less is more: Consistent video depth estimation with masked frames modeling. In ACM MM, 2022.
- Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:2309.15103, 2023a.
- Neural video depth stabilizer. In ICCV, 2023b.
- Neural video depth stabilizer. In ICCV, 2023c.
- Tooncrafter: Generative cartoon interpolation. TOG (Proceedings of ACM SIGGRAPH Asia), 2024a.
- Dynamicrafter: Animating open-domain images with video diffusion priors. In ECCV, 2024b.
- Transformer-based attention networks for continuous pixel-wise prediction. In ICCV, 2021.
- Depth anything: Unleashing the power of large-scale unlabeled data. In CVPR, 2024a.
- Depth anything v2. arXiv:2406.09414, 2024b.
- Mamo: Leveraging memory and attention for monocular video depth estimation. In ICCV, 2023.
- Futuredepth: Learning to predict the future improves video depth estimation. arXiv preprint arXiv:2403.12953, 2024.
- Metric3d: Towards zero-shot metric 3d prediction from a single image. In ICCV, 2023.
- Monosdf: Exploring monocular geometric cues for neural implicit surface reconstruction. NeurIPS, 2022.
- Exploiting temporal consistency for real-time video depth estimation. In ICCV, 2019.
- Mimicmotion: High-quality human motion video generation with confidence-aware pose guidance. arXiv preprint arXiv:2406.19680, 2024a.
- Controlvideo: Training-free controllable text-to-video generation. In ICLR, 2024b.
- Consistent depth of moving objects in video. TOG (Proceedings of ACM SIGGRAPH), 40(4):1–12, 2021.
- Magicvideo: Efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018, 2022.