DCPI-Depth: Explicitly Infusing Dense Correspondence Prior to Unsupervised Monocular Depth Estimation (2405.16960v1)
Abstract: There has been a recent surge of interest in learning to perceive depth from monocular videos in an unsupervised fashion. A key challenge in this field is achieving robust and accurate depth estimation in challenging scenarios, particularly in regions with weak textures or where dynamic objects are present. This study makes three major contributions by delving deeply into dense correspondence priors to provide existing frameworks with explicit geometric constraints. The first novelty is a contextual-geometric depth consistency loss, which employs depth maps triangulated from dense correspondences based on estimated ego-motion to guide the learning of depth perception from contextual information, since explicitly triangulated depth maps capture accurate relative distances among pixels. The second novelty arises from the observation that there exists an explicit, deducible relationship between optical flow divergence and depth gradient. A differential property correlation loss is, therefore, designed to refine depth estimation with a specific emphasis on local variations. The third novelty is a bidirectional stream co-adjustment strategy that enhances the interaction between rigid and optical flows, encouraging the former towards more accurate correspondence and making the latter more adaptable across various scenarios under the static scene hypotheses. DCPI-Depth, a framework that incorporates all these innovative components and couples two bidirectional and collaborative streams, achieves state-of-the-art performance and generalizability across multiple public datasets, outperforming all existing prior arts. Specifically, it demonstrates accurate depth estimation in texture-less and dynamic regions, and shows more reasonable smoothness.
- A. Geiger et al., “Vision meets robotics: The KITTI dataset,” The International Journal of Robotics Research, vol. 32, no. 11, pp. 1231–1237, 2013.
- X. Luo et al., “Consistent video depth estimation,” ACM Transactions on Graphics, vol. 39, no. 4, pp. 71–1, 2020.
- W. Wang et al., “Visual robotic manipulation with depth-aware pretraining,” arXiv preprint arXiv:2401.09038, 2024.
- T. Zhou et al., “Unsupervised learning of depth and ego-motion from video,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 1851–1858.
- D. Eigen et al., “Depth map prediction from a single image using a multi-scale deep network,” Advances in Neural Information Processing Systems (NeurIPS), vol. 27, 2014.
- F. Liu et al., “Learning depth from single monocular images using deep convolutional neural fields,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 10, pp. 2024–2039, 2015.
- M. Song et al., “Monocular depth estimation using laplacian pyramid-based depth residuals,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 31, no. 11, pp. 4381–4393, 2021.
- H. Zhou et al., “Self-supervised monocular depth estimation with internal feature fusion,” arXiv preprint arXiv:2110.09482, 2021.
- C. Zhao et al., “MonoViT: Self-supervised monocular depth estimation with a vision transformer,” in 2022 International Conference on 3D Vision (3DV). IEEE, 2022, pp. 668–678.
- C. Godard et al., “Digging into self-supervised monocular depth estimation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 3828–3838.
- X. Xu et al., “Multi-scale spatial attention-guided monocular depth estimation with semantic enhancement,” IEEE Transactions on Image Processing, vol. 30, pp. 8811–8822, 2021.
- X. Song et al., “Unsupervised monocular estimation of depth and visual odometry using attention and depth-pose consistency loss,” IEEE Transactions on Multimedia, 2023.
- L. Sun et al., “SC-DepthV3: Robust self-supervised monocular depth estimation for dynamic scenes,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 1, pp. 497–508, 2023.
- N. Zhang et al., “Lite-Mono: A lightweight CNN and Transformer architecture for self-supervised monocular depth estimation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 18 537–18 546.
- C. Godard et al., “Unsupervised monocular depth estimation with left-right consistency,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 270–279.
- X. Ye et al., “Unsupervised monocular depth estimation via recursive stereo distillation,” IEEE Transactions on Image Processing, vol. 30, pp. 4492–4504, 2021.
- Z. Teed and J. Deng, “RAFT: Recurrent all-pairs field transforms for optical flow,” in Proceedings of the European Conference on Computer Vision (ECCV). Springer, 2020, pp. 402–419.
- R. I. Hartley and P. Sturm, “Triangulation,” Computer Vision and Image Understanding, vol. 68, no. 2, pp. 146–157, 1997.
- Y. Cao et al., “Estimating depth from monocular images as classification using deep fully convolutional residual networks,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 28, no. 11, pp. 3174–3182, 2017.
- W. Yin et al., “Enforcing geometric constraints of virtual normal for depth prediction,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 5684–5693.
- J. Hu et al., “Revisiting single image depth estimation: Toward higher resolution maps with accurate object boundaries,” in 2019 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 2019, pp. 1043–1051.
- H. Fu et al., “Deep ordinal regression network for monocular depth estimation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 2002–2011.
- S. F. Bhat et al., “AdaBins: Depth estimation using adaptive bins,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 4009–4018.
- J. H. Lee et al., “From big to small: Multi-scale local planar guidance for monocular depth estimation,” arXiv preprint arXiv:1907.10326, 2019.
- G. Yang et al., “Transformer-based attention networks for continuous pixel-wise prediction,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 16 269–16 279.
- L. Yang et al., “Depth anything: Unleashing the power of large-scale unlabeled data,” arXiv preprint arXiv:2401.10891, 2024.
- R. Birkl et al., “Midas v3. 1–a model zoo for robust monocular relative depth estimation,” arXiv preprint arXiv:2307.14460, 2023.
- M. Oquab et al., “DINOv2: Learning robust visual features without supervision,” Transactions on Machine Learning Research, 2024. [Online]. Available: https://openreview.net/forum?id=a68SUt6zFt
- R. Garg et al., “Unsupervised cnn for single view depth estimation: Geometry to the rescue,” in Proceedings of the European Conference on Computer Vision (ECCV). Springer, 2016, pp. 740–756.
- Y. Zhang et al., “Unsupervised multi-view constrained convolutional network for accurate depth estimation,” IEEE Transactions on Image Processing, vol. 29, pp. 7019–7031, 2020.
- V. Guizilini et al., “3D packing for self-supervised monocular depth estimation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 2485–2494.
- X. Song et al., “MLDA-Net: Multi-level dual attention-based network for self-supervised monocular depth estimation,” IEEE Transactions on Image Processing, vol. 30, pp. 4691–4705, 2021.
- R. Peng et al., “Excavating the potential capacity of self-supervised monocular depth estimation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 15 560–15 569.
- Y. Zhang et al., “Self-supervised monocular depth estimation with multiscale perception,” IEEE Transactions on Image Processing, vol. 31, pp. 3251–3266, 2022.
- W. Han et al., “Self-supervised monocular depth estimation by direction-aware cumulative convolution network,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 8613–8623.
- Z. Yin and J. Shi, “GeoNet: Unsupervised learning of dense depth, optical flow and camera pose,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 1983–1992.
- Y. Chen et al., “Self-supervised learning with geometric constraints in monocular video: Connecting flow, depth, and camera,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 7063–7072.
- A. Ranjan et al., “Competitive collaboration: Joint unsupervised learning of depth, camera motion, optical flow and motion segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 12 240–12 249.
- X. Chen et al., “Self-supervised monocular depth estimation: Solving the edge-fattening problem,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2023, pp. 5776–5786.
- H. Jung et al., “Fine-grained semantics-aware representation enhancement for self-supervised monocular depth estimation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 12 642–12 652.
- J.-W. Bian et al., “Unsupervised scale-consistent depth learning from video,” International Journal of Computer Vision, vol. 129, no. 9, pp. 2548–2564, 2021.
- Z. Wang et al., “Image quality assessment: from error visibility to structural similarity,” IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600–612, 2004.
- X. Lyu et al., “HR-depth: High resolution self-supervised monocular depth estimation,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 3, 2021, pp. 2294–2301.
- J.-W. Bian et al., “Auto-rectify network for unsupervised indoor depth estimation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 12, pp. 9802–9813, 2021.
- A. Geiger et al., “Are we ready for autonomous driving? the KITTI vision benchmark suite,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2012, pp. 3354–3361.
- D. Shim and H. J. Kim, “Swindepth: Unsupervised depth estimation using monocular sequences via swin transformer and densely cascaded network,” in Proceedings of the IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2023, pp. 4983–4990.
- Y. Sun and B. Hariharan, “Dynamo-Depth: Fixing unsupervised depth estimation for dynamical scenes,” Advances in Neural Information Processing Systems (NeurIPS), 2023.
- G. Li et al., “SENSE: Self-evolving learning for self-supervised monocular depth estimation,” IEEE Transactions on Image Processing, 2023.
- J. L. G. Bello et al., “Self-supervised monocular depth estimation with positional shift depth variance and adaptive disparity quantization,” IEEE Transactions on Image Processing, 2024.
- J. Uhrig et al., “Sparsity invariant CNNs,” in 2017 International Conference on 3D Vision (3DV). IEEE, 2017, pp. 11–20.
- J. Yan et al., “Channel-wise attention-based network for self-supervised monocular depth estimation,” in 2021 International Conference on 3D vision (3DV). IEEE, 2021, pp. 464–473.
- A. Saxena et al., “Make3D: Learning 3d scene structure from a single still image,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 5, pp. 824–840, 2008.
- I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017.
- J. Deng et al., “ImageNet: A large-scale hierarchical image database,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Ieee, 2009, pp. 248–255.
- K. He et al., “Deep residual learning for image recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778.
- M. Menze and A. Geiger, “Object scene flow for autonomous vehicles,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 3061–3070.