Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Self-Supervised Monocular Depth Estimation with Self-Reference Distillation and Disparity Offset Refinement (2302.09789v2)

Published 20 Feb 2023 in cs.CV

Abstract: Monocular depth estimation plays a fundamental role in computer vision. Due to the costly acquisition of depth ground truth, self-supervised methods that leverage adjacent frames to establish a supervisory signal have emerged as the most promising paradigms. In this work, we propose two novel ideas to improve self-supervised monocular depth estimation: 1) self-reference distillation and 2) disparity offset refinement. Specifically, we use a parameter-optimized model as the teacher updated as the training epochs to provide additional supervision during the training process. The teacher model has the same structure as the student model, with weights inherited from the historical student model. In addition, a multiview check is introduced to filter out the outliers produced by the teacher model. Furthermore, we leverage the contextual consistency between high-scale and low-scale features to obtain multiscale disparity offsets, which are used to refine the disparity output incrementally by aligning disparity information at different scales. The experimental results on the KITTI and Make3D datasets show that our method outperforms previous state-of-the-art competitors.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (73)
  1. Depth estimation from a single image of blast furnace burden surface based on edge defocus tracking. IEEE Transactions on Circuits and Systems for Video Technology, 32(9):6044–6057, 2022.
  2. Monocular depth estimation using laplacian pyramid-based depth residuals. IEEE transactions on circuits and systems for video technology, 31(11):4381–4393, 2021.
  3. Adabins: Depth estimation using adaptive bins. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4009–4018, 2021.
  4. New crfs: Neural window fully-connected crfs for monocular depth estimation. arXiv preprint arXiv:2203.01502, 2022.
  5. Cornet: Context-based ordinal regression network for monocular depth estimation. IEEE Transactions on Circuits and Systems for Video Technology, 32(7):4841–4853, 2022.
  6. Digging into self-supervised monocular depth estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3828–3838, 2019.
  7. Hr-depth: High resolution self-supervised monocular depth estimation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 2294–2301, 2021.
  8. Adaptive co-teaching for unsupervised monocular depth estimation. In European Conference on Computer Vision, pages 89–105. Springer, 2022.
  9. On the uncertainty of self-supervised monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3227–3237, 2020.
  10. Fapn: Feature-aligned pyramid network for dense image prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 864–873, 2021.
  11. Semantic flow for fast and accurate scene parsing. In European Conference on Computer Vision, pages 775–793. Springer, 2020.
  12. Brnet: Exploring comprehensive features for monocular depth estimation. In European Conference on Computer Vision, pages 586–602. Springer, 2022.
  13. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  14. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10012–10022, 2021.
  15. Learning depth from single monocular images. Advances in neural information processing systems, 18, 2005.
  16. Depth map prediction from a single image using a multi-scale deep network. Advances in neural information processing systems, 27, 2014.
  17. From big to small: Multi-scale local planar guidance for monocular depth estimation. arXiv preprint arXiv:1907.10326, 2019.
  18. Towards comprehensive monocular depth estimation: Multiple heads are better than one. IEEE Transactions on Multimedia, 2022.
  19. Unsupervised learning of depth and ego-motion from video. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1851–1858, 2017.
  20. Df-net: Unsupervised joint learning of depth and flow using cross-task consistency. In Proceedings of the European conference on computer vision (ECCV), pages 36–53, 2018.
  21. Channel-wise attention-based network for self-supervised monocular depth estimation. In 2021 International Conference on 3D Vision (3DV), pages 464–473. IEEE, 2021.
  22. Monovit: Self-supervised monocular depth estimation with a vision transformer. arXiv preprint arXiv:2208.03543, 2022.
  23. Self-supervised monocular depth estimation with internal feature fusion. arXiv preprint arXiv:2110.09482, 2021.
  24. Self-supervised monocular depth estimation for all day images using domain separation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12737–12746, 2021.
  25. Single image depth prediction with wavelet decomposition. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11084–11093, 2021.
  26. Fixing defect of photometric loss for self-supervised monocular depth estimation. IEEE Transactions on Circuits and Systems for Video Technology, 32(3):1328–1338, 2021.
  27. Safenet: Self-supervised monocular depth estimation with semantic-aware feature extraction. arXiv preprint arXiv:2010.02893, 2020.
  28. 3d packing for self-supervised monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2485–2494, 2020.
  29. Monocular depth estimation through virtual-world supervision and real-world sfm self-supervision. IEEE Transactions on Intelligent Transportation Systems, 23(8):12738–12751, 2021.
  30. Defeat-net: General monocular depth via simultaneous unsupervised representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14402–14413, 2020.
  31. Competitive collaboration: Joint unsupervised learning of depth, camera motion, optical flow and motion segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12240–12249, 2019.
  32. Lego: Learning edge with geometry all at once by watching videos. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 225–234, 2018.
  33. Unsupervised learning of geometry with edge-aware depth-normal consistency. arXiv preprint arXiv:1711.03665, 2017.
  34. Semantically-guided representation learning for self-supervised monocular depth. arXiv preprint arXiv:2002.12319, 2020.
  35. Self-supervised monocular depth estimation: Solving the dynamic object problem by semantic guidance. In European Conference on Computer Vision, pages 582–600. Springer, 2020.
  36. Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 535–541, 2006.
  37. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2(7), 2015.
  38. Learning from noisy labels with distillation. In Proceedings of the IEEE International Conference on Computer Vision, pages 1910–1918, 2017.
  39. Learning efficient object detection models with knowledge distillation. Advances in neural information processing systems, 30, 2017.
  40. Cross modal distillation for supervision transfer. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2827–2836, 2016.
  41. Born again neural networks. In International Conference on Machine Learning, pages 1607–1616. PMLR, 2018.
  42. Self-training with noisy student improves imagenet classification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10687–10698, 2020.
  43. Swin-depth: Using transformers and multi-scale fusion for monocular-based depth estimation. IEEE Sensors Journal, 21(23):26912–26920, 2021.
  44. Transformer-based attention networks for continuous pixel-wise prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16269–16279, 2021.
  45. Vision transformers for dense prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12179–12188, 2021.
  46. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 568–578, 2021.
  47. Mpvit: Multi-path vision transformer for dense prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7287–7296, 2022.
  48. Alignseg: Feature-aligned segmentation networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(1):550–557, 2021.
  49. Shallow features guide unsupervised domain adaptation for semantic segmentation at class boundaries. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1160–1170, 2022.
  50. Unsupervised monocular depth estimation with left-right consistency. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6602–6611, 2017.
  51. Unsupervised cnn for single view depth estimation: Geometry to the rescue. In European conference on computer vision, pages 740–756. Springer, 2016.
  52. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.
  53. Learning depth from monocular videos using direct methods. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2022–2030, 2018.
  54. Exploiting pseudo labels in a self-supervised learning framework for improved monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1578–1588, 2022.
  55. Mvsnet: Depth inference for unstructured multi-view stereo. In Proceedings of the European conference on computer vision (ECCV), pages 767–783, 2018.
  56. Learning depth from single monocular images using deep convolutional neural fields. IEEE transactions on pattern analysis and machine intelligence, 38(10):2024–2039, 2015.
  57. Supervising the new with the old: learning sfm from sfm. In Proceedings of the European Conference on Computer Vision (ECCV), pages 698–713, 2018.
  58. Adadepth: Unsupervised content congruent adaptation for depth estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2656–2665, 2018.
  59. Deep virtual stereo odometry: Leveraging deep depth prediction for monocular direct sparse odometry. In Proceedings of the European Conference on Computer Vision (ECCV), pages 817–833, 2018.
  60. Single view stereo matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 155–163, 2018.
  61. Learning monocular depth by distilling cross-domain stereo networks. In Proceedings of the European Conference on Computer Vision (ECCV), pages 484–500, 2018.
  62. R-msfm: Recurrent multi-scale feature modulation for monocular depth estimating. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12777–12786, 2021.
  63. Constant velocity constraints for self-supervised monocular depth estimation. In Proceedings of the 17th ACM SIGGRAPH European Conference on Visual Media Production, pages 1–8, 2020.
  64. Self-supervised monocular trained depth estimation using self-attention and discrete disparity volume. In Proceedings of the ieee/cvf conference on computer vision and pattern recognition, pages 4756–4765, 2020.
  65. Monoformer: Towards generalization of self-supervised monocular depth estimation with transformers. arXiv preprint arXiv:2205.11083, 2022.
  66. Gcndepth: Self-supervised monocular depth estimation based on graph convolutional network. arXiv preprint arXiv:2112.06782, 2021.
  67. Vision meets robotics: The kitti dataset. The International Journal of Robotics Research, 32(11):1231–1237, 2013.
  68. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016.
  69. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  70. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  71. Unsupervised scale-consistent depth and ego-motion learning from monocular video. Advances in neural information processing systems, 32, 2019.
  72. Make3d: Learning 3d scene structure from a single still image. IEEE transactions on pattern analysis and machine intelligence, 31(5):824–840, 2008.
  73. Deeper depth prediction with fully convolutional residual networks. In 2016 Fourth international conference on 3D vision (3DV), pages 239–248. IEEE, 2016.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Zhong Liu (73 papers)
  2. Ran Li (191 papers)
  3. Shuwei Shao (14 papers)
  4. Xingming Wu (20 papers)
  5. Weihai Chen (29 papers)
Citations (25)

Summary

We haven't generated a summary for this paper yet.