Digging into contrastive learning for robust depth estimation with diffusion models (2404.09831v4)
Abstract: Recently, diffusion-based depth estimation methods have drawn widespread attention due to their elegant denoising patterns and promising performance. However, they are typically unreliable under adverse conditions prevalent in real-world scenarios, such as rainy, snowy, etc. In this paper, we propose a novel robust depth estimation method called D4RD, featuring a custom contrastive learning mode tailored for diffusion models to mitigate performance degradation in complex environments. Concretely, we integrate the strength of knowledge distillation into contrastive learning, building the `trinity' contrastive scheme. This scheme utilizes the sampled noise of the forward diffusion process as a natural reference, guiding the predicted noise in diverse scenes toward a more stable and precise optimum. Moreover, we extend noise-level trinity to encompass more generic feature and image levels, establishing a multi-level contrast to distribute the burden of robust perception across the overall network. Before addressing complex scenarios, we enhance the stability of the baseline diffusion model with three straightforward yet effective improvements, which facilitate convergence and remove depth outliers. Extensive experiments demonstrate that D4RD surpasses existing state-of-the-art solutions on synthetic corruption datasets and real-world weather conditions. Source code and data are available at \url{https://github.com/wangjiyuan9/D4RD}.
- Seeing Through Fog Without Seeing Fog: Deep Multimodal Sensor Fusion in Unseen Adverse Weather. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
- De-noising of Lidar Point Clouds Corrupted by Snowfall. 254–261. https://doi.org/10.1109/CRV.2018.00043
- Diffusiondet: Diffusion model for object detection. arXiv preprint arXiv:2211.09788 (2022).
- Xinlei Chen and Kaiming He. 2020. Exploring Simple Siamese Representation Learning. arXiv:2011.10566 [cs.CV]
- Diffusiondepth: Diffusion denoising approach for monocular depth estimation. arXiv preprint arXiv:2303.05021 (2023).
- David Eigen and Rob Fergus. 2015. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proceedings of the IEEE international conference on computer vision. 2650–2658.
- Depth map prediction from a single image using a multi-scale deep network. Advances in neural information processing systems 27 (2014).
- Deep Ordinal Regression Network for Monocular Depth Estimation. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE. https://doi.org/10.1109/cvpr.2018.00214
- Robust Monocular Depth Estimation under Challenging Conditions. In Proceedings of the IEEE/CVF International Conference on Computer Vision.
- Digging Into Self-Supervised Monocular Depth Estimation. arXiv:1806.01260 [cs.CV]
- Denoising Diffusion Probabilistic Models. arXiv:2006.11239 [cs.LG]
- DDP: Diffusion Model for Dense Visual Prediction. In 2023 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE. https://doi.org/10.1109/iccv51070.2023.01987
- Tobias Kalb and Jürgen Beyerer. 2023. Principles of Forgetting in Domain-Incremental Semantic Segmentation in Adverse Weather Conditions. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE. https://doi.org/10.1109/cvpr52729.2023.01869
- Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
- Text-image Alignment for Diffusion-based Perception.
- RoboDepth: Robust Out-of-Distribution Depth Estimation under Corruptions. ArXiv abs/2310.15171 (2023). https://api.semanticscholar.org/CorpusID:264436593
- Deeper Depth Prediction with Fully Convolutional Residual Networks. arXiv:1606.00373 [cs.CV]
- DifFlow3D: Toward Robust Uncertainty-Aware Scene Flow Estimation with Diffusion Model. arXiv preprint arXiv:2311.17456 (2023).
- Self-supervised Monocular Depth Estimation for All Day Images using Domain Separation. arXiv:2108.07628 [cs.CV]
- Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. arXiv:1711.05101 [cs.LG]
- Stealing Stable Diffusion Prior for Robust Monocular Depth Estimation. arXiv:2403.05056 [cs.CV]
- DiffMatch: Diffusion Model for Dense Matching. arXiv:2305.19094 [cs.CV]
- ECoDepth: Effective Conditioning of Diffusion Models for Monocular Depth Estimation. arXiv preprint arXiv:2403.18807 (2024).
- Vision transformers for dense prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 12179–12188.
- Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE transactions on pattern analysis and machine intelligence 44, 03 (2020), 1623–1637.
- Self-supervised Monocular Depth Estimation: Let’s Talk About The Weather. arXiv:2307.08357 [cs.CV]
- The Surprising Effectiveness of Diffusion Models for Optical Flow and Monocular Depth Estimation. In Thirty-seventh Conference on Neural Information Processing Systems. https://openreview.net/forum?id=jDIlzSU8wJ
- MonoDiffusion: Self-Supervised Monocular Depth Estimation Using Diffusion Model. arXiv:2311.07198 [cs.CV]
- Denoising Diffusion Implicit Models. arXiv:2010.02502 [cs.LG]
- EC-Depth: Exploring the consistency of self-supervised monocular depth estimation in challenging scenes. https://api.semanticscholar.org/CorpusID:268513620
- WeatherDepth: Curriculum Contrastive Learning for Self-Supervised Depth Estimation under Adverse Weather Conditions. ArXiv abs/2310.05556 (2023). https://api.semanticscholar.org/CorpusID:263831385
- SegRefiner: Towards Model-Agnostic Segmentation Refinement with Discrete Diffusion Process. In NeurIPS.
- DrivingStereo: A Large-Scale Dataset for Stereo Matching in Autonomous Driving Scenarios. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
- Lite-Mono: A Lightweight CNN and Transformer Architecture for Self-Supervised Monocular Depth Estimation. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE. https://doi.org/10.1109/cvpr52729.2023.01778
- Unsupervised Monocular Depth Estimation in Highly Complex Environments. arXiv:2107.13137 [cs.CV]
- MonoViT: Self-Supervised Monocular Depth Estimation with a Vision Transformer. In 2022 International Conference on 3D Vision (3DV). IEEE. https://doi.org/10.1109/3dv57658.2022.00077
- Unleashing Text-to-Image Diffusion Models for Visual Perception. ICCV (2023).
- Unsupervised Learning of Depth and Ego-Motion from Video. arXiv:1704.07813 [cs.CV]