Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

On Robust Cross-View Consistency in Self-Supervised Monocular Depth Estimation (2209.08747v3)

Published 19 Sep 2022 in cs.CV

Abstract: Remarkable progress has been made in self-supervised monocular depth estimation (SS-MDE) by exploring cross-view consistency, e.g., photometric consistency and 3D point cloud consistency. However, they are very vulnerable to illumination variance, occlusions, texture-less regions, as well as moving objects, making them not robust enough to deal with various scenes. To address this challenge, we study two kinds of robust cross-view consistency in this paper. Firstly, the spatial offset field between adjacent frames is obtained by reconstructing the reference frame from its neighbors via deformable alignment, which is used to align the temporal depth features via a Depth Feature Alignment (DFA) loss. Secondly, the 3D point clouds of each reference frame and its nearby frames are calculated and transformed into voxel space, where the point density in each voxel is calculated and aligned via a Voxel Density Alignment (VDA) loss. In this way, we exploit the temporal coherence in both depth feature space and 3D voxel space for SS-MDE, shifting the "point-to-point" alignment paradigm to the "region-to-region" one. Compared with the photometric consistency loss as well as the rigid point cloud alignment loss, the proposed DFA and VDA losses are more robust owing to the strong representation power of deep features as well as the high tolerance of voxel density to the aforementioned challenges. Experimental results on several outdoor benchmarks show that our method outperforms current state-of-the-art techniques. Extensive ablation study and analysis validate the effectiveness of the proposed losses, especially in challenging scenes. The code and models are available at https://github.com/sunnyHelen/RCVC-depth.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (60)
  1. M. Bjorkman and J.-O. Eklundh, “Real-time epipolar geometry estimation of binocular stereo heads,” IEEE Transactions on pattern analysis and machine intelligence, vol. 24, no. 3, pp. 425–432, 2002.
  2. D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from a single image using a multi-scale deep network,” in Advances in Neural Information Processing Systems, 2014, pp. 2366–2374.
  3. C. Godard, O. Mac Aodha, and G. J. Brostow, “Unsupervised monocular depth estimation with left-right consistency,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
  4. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in International Conference on Learning Representations, 2020.
  5. Y. Jing, Y. Yang, X. Wang, M. Song, and D. Tao, “Amalgamating knowledge from heterogeneous graph neural networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 15 709–15 718.
  6. Y. Mo, S. Ma, H. Gong, Z. Chen, J. Zhang, and D. Tao, “Terra: A smart and sensible digital twin framework for robust robot deployment in challenging environments,” IEEE Internet of Things Journal, vol. 8, no. 18, pp. 14 039–14 050, 2021.
  7. H. Xi, L. He, Y. Zhang, and Z. Wang, “Differentiable road pricing for environment-oriented electric vehicle and gasoline vehicle users in the bi-objective transportation network,” Transportation Letters, vol. 14, no. 6, pp. 660–674, 2022.
  8. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  9. H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao, “Deep ordinal regression network for monocular depth estimation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 2002–2011.
  10. T. Zhou, M. Brown, N. Snavely, and D. G. Lowe, “Unsupervised learning of depth and ego-motion from video,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
  11. R. Mahjourian, M. Wicke, and A. Angelova, “Unsupervised learning of depth and ego-motion from monocular video using 3d geometric constraints,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 5667–5675.
  12. C. Godard, O. Mac Aodha, M. Firman, and G. J. Brostow, “Digging into self-supervised monocular depth estimation,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 3828–3838.
  13. H. Zhao, W. Bian, B. Yuan, and D. Tao, “Collaborative learning of depth estimation, visual odometry and camera relocalization from monocular videos.” in IJCAI, 2020, pp. 488–494.
  14. H. Zhao, J. Zhang, S. Zhang, and D. Tao, “Jperceiver: Joint perception network for depth, pose and layout estimation in driving scenes,” in European Conference on Computer Vision.   Springer, 2022, pp. 708–726.
  15. S. Zhang, J. Zhang, and D. Tao, “Towards scale-aware, robust, and generalizable unsupervised monocular depth estimation by integrating imu motion dynamics,” in European Conference on Computer Vision.   Springer, 2022, pp. 143–160.
  16. S. Zhang, J. Zhang, and D. Tao, “Towards scale consistent monocular visual odometry by learning from the virtual world,” in 2022 International Conference on Robotics and Automation (ICRA).   IEEE, 2022, pp. 5601–5607.
  17. Y. Cabon, N. Murray, and M. Humenberger, “Virtual kitti 2,” arXiv preprint arXiv:2001.10773, 2020.
  18. C. Shu, K. Yu, Z. Duan, and K. Yang, “Feature-metric loss for self-supervised learning of depth and egomotion,” in European Conference on Computer Vision.   Springer, 2020, pp. 572–588.
  19. M. Jaderberg, K. Simonyan, A. Zisserman et al., “Spatial transformer networks,” in Advances in Neural Information Processing Systems, 2015.
  20. Z. Yin and J. Shi, “Geonet: Unsupervised learning of dense depth, optical flow and camera pose,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 1983–1992.
  21. Y. Zou, Z. Luo, and J.-B. Huang, “Df-net: Unsupervised joint learning of depth and flow using cross-task consistency,” in The European Conference on Computer Vision, 2018, pp. 36–53.
  22. A. Gordon, H. Li, R. Jonschkowski, and A. Angelova, “Depth from videos in the wild: Unsupervised monocular depth learning from unknown cameras,” in Proceedings of the IEEE International Conference on Computer Vision, 2019.
  23. M. Klingner, J.-A. Termöhlen, J. Mikolajczyk, and T. Fingscheidt, “Self-supervised monocular depth estimation: Solving the dynamic object problem by semantic guidance,” in European Conference on Computer Vision.   Springer, 2020, pp. 582–600.
  24. G. Wang, J. Zhong, S. Zhao, W. Wu, Z. Liu, and H. Wang, “3d hierarchical refinement and augmentation for unsupervised learning of depth and pose from monocular video,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 33, no. 4, pp. 1776–1786, 2022.
  25. H. Zhou, D. Greenwood, and S. Taylor, “Self-supervised monocular depth estimation with internal feature fusion,” in British Machine Vision Conference (BMVC), 2021.
  26. J. Bae, S. Moon, and S. Im, “Deep digging into the generalization of self-supervised monocular depth estimation,” arXiv preprint arXiv:2205.11083, 2022.
  27. C. Zhao, Y. Zhang, M. Poggi, F. Tosi, X. Guo, Z. Zhu, G. Huang, Y. Tang, and S. Mattoccia, “Monovit: Self-supervised monocular depth estimation with a vision transformer,” in International Conference on 3D Vision, 2022.
  28. Z. Liu, R. Li, S. Shao, X. Wu, and W. Chen, “Self-supervised monocular depth estimation with self-reference distillation and disparity offset refinement,” IEEE Transactions on Circuits and Systems for Video Technology, 2023.
  29. K. Sun, B. Xiao, D. Liu, and J. Wang, “Deep high-resolution representation learning for human pose estimation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 5693–5703.
  30. H. Zhan, R. Garg, C. Saroj Weerasekera, K. Li, H. Agarwal, and I. Reid, “Unsupervised learning of monocular depth estimation and visual odometry with deep feature reconstruction,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.
  31. A. CS Kumar, S. M. Bhandarkar, and M. Prasad, “Monocular depth prediction using generative adversarial networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition workshops, 2018, pp. 300–308.
  32. I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” Advances in neural information processing systems, vol. 27, 2014.
  33. Z. Chen, C. Wang, B. Yuan, and D. Tao, “Puppeteergan: Arbitrary portrait animation with semantic-aware appearance transformation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 13 518–13 527.
  34. Z. Chen, C. Wang, H. Zhao, B. Yuan, and X. Li, “D2animator: Dual distillation of stylegan for high-resolution face animation,” in Proceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 1769–1778.
  35. C. Zhao, G. G. Yen, Q. Sun, C. Zhang, and Y. Tang, “Masked gan for unsupervised depth and pose prediction with scale consistency,” IEEE Transactions on Neural Networks and Learning Systems, 2020.
  36. R. Li, X. He, D. Xue, S. Su, Q. Mao, Y. Zhu, J. Sun, and Y. Zhang, “Learning depth via leveraging semantics: Self-supervised monocular depth estimation with both implicit and explicit semantic guidance,” arXiv preprint arXiv:2102.06685, 2021.
  37. H. Jung, E. Park, and S. Yoo, “Fine-grained semantics-aware representation enhancement for self-supervised monocular depth estimation,” in Proceedings of the IEEE International Conference on Computer Vision, 2021, pp. 12 642–12 652.
  38. Z. Yang, P. Wang, Y. Wang, W. Xu, and R. Nevatia, “Lego: Learning edge with geometry all at once by watching videos,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 225–234.
  39. X. Luo, J.-B. Huang, R. Szeliski, K. Matzen, and J. Kopf, “Consistent video depth estimation,” ACM Transactions on Graphics, 2020.
  40. Y. Chen, C. Schmid, and C. Sminchisescu, “Self-supervised learning with geometric constraints in monocular video: Connecting flow, depth, and camera,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 7063–7072.
  41. J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei, “Deformable convolutional networks,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 764–773.
  42. Y. Tian, Y. Zhang, Y. Fu, and C. Xu, “Tdan: Temporally-deformable alignment network for video super-resolution,” in The IEEE Conference on Computer Vision and Pattern Recognition, 2020.
  43. P. Yin, J. Lyu, S. Zhang, S. J. Osher, Y. Qi, and J. Xin, “Understanding straight-through estimator in training activation quantized neural nets,” in International Conference on Learning Representations, 2019. [Online]. Available: https://openreview.net/forum?id=Skh4jRcKQ
  44. C. Wang, J. Miguel Buenaposada, R. Zhu, and S. Lucey, “Learning depth from monocular videos using direct methods,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 2022–2030.
  45. V. Casser, S. Pirk, R. Mahjourian, and A. Angelova, “Depth prediction without the sensors: Leveraging structure for unsupervised learning from monocular videos,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, 2019, pp. 8001–8008.
  46. J.-W. Bian, Z. Li, N. Wang, H. Zhan, C. Shen, M.-M. Cheng, and I. Reid, “Unsupervised scale-consistent depth and ego-motion learning from monocular video,” Advances in Neural Information Processing Systems, 2019.
  47. J. Zhou, Y. Wang, K. Qin, and W. Zeng, “Unsupervised high-resolution depth learning from videos with dual networks,” Proceedings of the IEEE International Conference on Computer Vision, 2019.
  48. V. Guizilini, R. Ambrus, S. Pillai, A. Raventos, and A. Gaidon, “3d packing for self-supervised monocular depth estimation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020.
  49. W. Zhao, S. Liu, Y. Shu, and Y.-J. Liu, “Towards better generalization: Joint depth-pose learning without posenet,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2020.
  50. A. Johnston and G. Carneiro, “Self-supervised monocular trained depth estimation using self-attention and discrete disparity volume,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 4756–4765.
  51. X. Song, W. Li, D. Zhou, Y. Dai, J. Fang, H. Li, and L. Zhang, “Mlda-net: Multi-level dual attention-based network for self-supervised monocular depth estimation,” IEEE Transactions on Image Processing, vol. 30, pp. 4691–4705, 2021.
  52. X. Lyu, L. Liu, M. Wang, X. Kong, L. Liu, Y. Liu, X. Chen, and Y. Yuan, “Hr-depth: High resolution self-supervised monocular depth estimation,” Proceedings of the AAAI Conference on Artificial Intelligence, 2021.
  53. Z. Zhou, X. Fan, P. Shi, and Y. Xin, “R-msfm: Recurrent multi-scale feature modulation for monocular depth estimating,” in Proceedings of the IEEE International Conference on Computer Vision, 2021.
  54. L. Wang, Y. Wang, L. Wang, Y. Zhan, Y. Wang, and H. Lu, “Can scale-consistent monocular depth be learned in a self-supervised scale-invariant manner?” in Proceedings of the IEEE International Conference on Computer Vision, 2021, pp. 12 727–12 736.
  55. J. Yan, H. Zhao, P. Bu, and Y. Jin, “Channel-wise attention-based network for self-supervised monocular depth estimation,” in 2021 International Conference on 3D vision (3DV).   IEEE, 2021, pp. 464–473.
  56. N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers, A. Dosovitskiy, and T. Brox, “A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.
  57. A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2012.
  58. K. C. Chan, X. Wang, K. Yu, C. Dong, and C. C. Loy, “Understanding deformable alignment in video super-resolution,” in Proceedings of the AAAI conference on artificial intelligence, 2021.
  59. M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.
  60. A. Saxena, M. Sun, and A. Y. Ng, “Make3d: Learning 3d scene structure from a single still image,” IEEE Transactions on pattern analysis and machine intelligence, vol. 31, no. 5, pp. 824–840, 2008.
Citations (4)

Summary

We haven't generated a summary for this paper yet.