Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

$\mathrm{F^2Depth}$: Self-supervised Indoor Monocular Depth Estimation via Optical Flow Consistency and Feature Map Synthesis (2403.18443v1)

Published 27 Mar 2024 in cs.CV

Abstract: Self-supervised monocular depth estimation methods have been increasingly given much attention due to the benefit of not requiring large, labelled datasets. Such self-supervised methods require high-quality salient features and consequently suffer from severe performance drop for indoor scenes, where low-textured regions dominant in the scenes are almost indiscriminative. To address the issue, we propose a self-supervised indoor monocular depth estimation framework called $\mathrm{F2Depth}$. A self-supervised optical flow estimation network is introduced to supervise depth learning. To improve optical flow estimation performance in low-textured areas, only some patches of points with more discriminative features are adopted for finetuning based on our well-designed patch-based photometric loss. The finetuned optical flow estimation network generates high-accuracy optical flow as a supervisory signal for depth estimation. Correspondingly, an optical flow consistency loss is designed. Multi-scale feature maps produced by finetuned optical flow estimation network perform warping to compute feature map synthesis loss as another supervisory signal for depth learning. Experimental results on the NYU Depth V2 dataset demonstrate the effectiveness of the framework and our proposed losses. To evaluate the generalization ability of our $\mathrm{F2Depth}$, we collect a Campus Indoor depth dataset composed of approximately 1500 points selected from 99 images in 18 scenes. Zero-shot generalization experiments on 7-Scenes dataset and Campus Indoor achieve $\delta_1$ accuracy of 75.8% and 76.0% respectively. The accuracy results show that our model can generalize well to monocular images captured in unknown indoor scenes.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (71)
  1. Depthformer: Multiscale vision transformer for monocular depth estimation with global local information fusion, in: 2022 IEEE International Conference on Image Processing (ICIP), IEEE. pp. 3873–3877.
  2. Attention attention everywhere: Monocular depth prediction with skip attention, in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 5861–5870.
  3. Adabins: Depth estimation using adaptive bins, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4009–4018.
  4. Localbins: Improving depth estimation by learning local distributions, in: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part I, Springer. pp. 480–496.
  5. Auto-rectify network for unsupervised indoor depth estimation. IEEE Transactions on Pattern Analysis and Intelligence, Machine 44, 9802–9813.
  6. Unsupervised scale-consistent depth learning from video. International Journal of Computer Vision 129, 2548–2564.
  7. A naturalistic open source movie for optical flow evaluation, in: Computer Vision–ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part VI 12, Springer. pp. 611–625.
  8. Learning a similarity metric discriminatively, with application to face verification, in: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), IEEE. pp. 539–546.
  9. The cityscapes dataset for semantic urban scene understanding, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3213–3223.
  10. Flownet: Learning optical flow with convolutional networks, in: Proceedings of the IEEE international conference on computer vision, pp. 2758–2766.
  11. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture, in: Proceedings of the IEEE international conference on computer vision, pp. 2650–2658.
  12. Depth map prediction from a single image using a multi-scale deep network. Advances in neural information processing systems 27.
  13. Direct sparse odometry. IEEE transactions on pattern analysis and intelligence, machine 40, 611–625.
  14. Deep ordinal regression network for monocular depth estimation, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2002–2011.
  15. Unsupervised cnn for single view depth estimation: Geometry to the rescue, in: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VIII 14, Springer. pp. 740–756.
  16. Unsupervised monocular depth estimation with left-right consistency, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 270–279.
  17. Digging into self-supervised monocular depth estimation, in: Proceedings of the IEEE/CVF international conference on computer vision, pp. 3828–3838.
  18. Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778.
  19. Revisiting single image depth estimation: Toward higher resolution maps with accurate object boundaries, in: 2019 IEEE winter conference on applications of computer vision (WACV), IEEE. pp. 1043–1051.
  20. Flownet 2.0: Evolution of optical flow estimation with deep networks, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2462–2470.
  21. Unsupervised learning of multi-frame optical flow with occlusions, in: Proceedings of the European conference on computer vision (ECCV), pp. 690–706.
  22. Monoindoor: Towards good practice of self-supervised monocular depth estimation for indoor environments, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 12787–12796.
  23. Depth map decomposition for monocular depth estimation, in: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part II, Springer. pp. 18–34.
  24. Mdflow: Unsupervised optical flow learning by reliable mutual knowledge distillation. IEEE Transactions on Circuits and Systems for Video Technology 33, 677–688.
  25. Deeper depth prediction with fully convolutional residual networks, in: 2016 Fourth international conference on 3D vision (3DV), IEEE. pp. 239–248.
  26. Structdepth: Leveraging the structural regularities for self-supervised indoor depth estimation, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 12663–12673.
  27. Depth and surface normal estimation from monocular images using regression on deep features and hierarchical crfs, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1119–1127.
  28. A two-streamed network for estimating fine-scaled depth maps from single rgb images, in: Proceedings of the IEEE International Conference on Computer Vision, pp. 3372–3380.
  29. Monoindoor++: towards better practice of self-supervised monocular depth estimation for indoor environments. IEEE Transactions on Circuits and Systems for Video Technology 33, 830–846.
  30. Planenet: Piece-wise planar reconstruction from a single rgb image, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2579–2588.
  31. Learning depth from single monocular images using deep convolutional neural fields. IEEE transactions on pattern analysis and machine intelligence 38, 2024–2039.
  32. Learning by analogy: Reliable supervision from transformations for unsupervised optical flow estimation, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 6489–6498.
  33. Discrete-continuous depth estimation from a single image, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 716–723.
  34. Brightflow: Brightness-change-aware unsupervised learning of optical flow, in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2061–2070.
  35. Unflow: Unsupervised learning of optical flow with a bidirectional census loss, in: Proceedings of the AAAI conference on artificial intelligence.
  36. All in tokens: Unifying output space of visual tasks via soft token, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 19900–19910.
  37. P3depth: Monocular depth estimation with a piecewise planarity prior, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1610–1621.
  38. idisc: Internal discretization for monocular depth estimation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 21477–21487.
  39. Vision transformers for dense prediction, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 12179–12188.
  40. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE transactions on pattern analysis and intelligence, machine 44, 1623–1637.
  41. Unsupervised deep learning for optical flow estimation, in: Proceedings of the AAAI conference on artificial intelligence.
  42. Learning depth from single monocular images. Advances in neural information processing systems 18.
  43. Structure-from-motion revisited, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4104–4113.
  44. Urcdc-depth: Uncertainty rectified cross-distillation with cutflip for monocular depth estimation. IEEE Transactions on Multimedia 26, 3341–3353.
  45. Iebins: Iterative elastic bins for monocular depth estimation. Advances in Neural Information Processing Systems 36.
  46. Scene coordinate regression forests for camera relocalization in rgb-d images, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2930–2937.
  47. Indoor segmentation and support inference from rgbd images. ECCV 7576, 746–760.
  48. Unsupervised monocular estimation of depth and visual odometry using attention and depth-pose consistency loss. IEEE Transactions on Multimedia 26, 3517–3529.
  49. Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8934–8943.
  50. Sparsity invariant cnns, in: 2017 international conference on 3D Vision (3DV), IEEE. pp. 11–20.
  51. Attention is all you need. Advances in neural information processing systems 30.
  52. Learning depth from monocular videos using direct methods, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2022–2030.
  53. Occlusion aware unsupervised learning of optical flow, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4884–4893.
  54. Toward practical monocular indoor depth estimation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3814–3824.
  55. Lessons and insights from creating a synthetic optical flow benchmark, in: Computer Vision–ECCV 2012. Workshops and Demonstrations: Florence, Italy, October 7-13, 2012, Proceedings, Part II 12, Springer. pp. 168–177.
  56. Deep3d: Fully automatic 2d-to-3d video conversion with deep convolutional neural networks, in: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, Springer. pp. 842–857.
  57. Unifying flow, stereo and depth estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence 45, 13941–13958.
  58. Enforcing geometric constraints of virtual normal for depth prediction, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5684–5693.
  59. Geonet: Unsupervised learning of dense depth, optical flow and camera pose, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1983–1992.
  60. P2⁢NetsuperscriptP2Net\mathrm{P^{2}Net}roman_P start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_Net: Patch-match and plane-regularization for unsupervised indoor depth estimation, in: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIV, Springer. pp. 206–222.
  61. Single-image piece-wise planar 3d reconstruction via associative embedding, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1029–1037.
  62. New crfs: Neural window fully-connected crfs for monocular depth estimation. arXiv preprint arXiv:.01502 .
  63. Unsupervised learning of monocular depth estimation and visual odometry with deep feature reconstruction, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 340–349.
  64. Improving deep regression with ordinal entropy. arXiv preprint arXiv:.08915 .
  65. Self-supervised monocular depth estimation with multiscale perception. IEEE transactions on image processing 31, 3251–3266.
  66. Gasmono: Geometry-aided self-supervised monocular depth estimation for indoor scenes, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16209–16220.
  67. Towards better generalization: Joint depth-pose learning without posenet, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9151–9161.
  68. Unsupervised deep epipolar flow for stationary or dynamic scenes, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 12095–12104.
  69. Moving indoor: Unsupervised video depth learning in challenging environments, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8618–8627.
  70. Unsupervised learning of depth and ego-motion from video, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1851–1858.
  71. Df-net: Unsupervised joint learning of depth and flow using cross-task consistency, in: Proceedings of the European conference on computer vision (ECCV), pp. 36–53.

Summary

We haven't generated a summary for this paper yet.