Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
162 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SPIdepth: Strengthened Pose Information for Self-supervised Monocular Depth Estimation (2404.12501v3)

Published 18 Apr 2024 in cs.CV and eess.IV

Abstract: Self-supervised monocular depth estimation has garnered considerable attention for its applications in autonomous driving and robotics. While recent methods have made strides in leveraging techniques like the Self Query Layer (SQL) to infer depth from motion, they often overlook the potential of strengthening pose information. In this paper, we introduce SPIdepth, a novel approach that prioritizes enhancing the pose network for improved depth estimation. Building upon the foundation laid by SQL, SPIdepth emphasizes the importance of pose information in capturing fine-grained scene structures. By enhancing the pose network's capabilities, SPIdepth achieves remarkable advancements in scene understanding and depth estimation. Experimental results on benchmark datasets such as KITTI, Cityscapes, and Make3D showcase SPIdepth's state-of-the-art performance, surpassing previous methods by significant margins. Specifically, SPIdepth tops the self-supervised KITTI benchmark. Additionally, SPIdepth achieves the lowest AbsRel (0.029), SqRel (0.069), and RMSE (1.394) on KITTI, establishing new state-of-the-art results. On Cityscapes, SPIdepth shows improvements over SQLdepth of 21.7% in AbsRel, 36.8% in SqRel, and 16.5% in RMSE, even without using motion masks. On Make3D, SPIdepth in zero-shot outperforms all other models. Remarkably, SPIdepth achieves these results using only a single image for inference, surpassing even methods that utilize video sequences for inference, thus demonstrating its efficacy and efficiency in real-world applications. Our approach represents a significant leap forward in self-supervised monocular depth estimation, underscoring the importance of strengthening pose information for advancing scene understanding in real-world applications. The code and pre-trained models are publicly available at https://github.com/Lavreniuk/SPIdepth.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. Stereo vision and laser odometry for autonomous helicopters in gps-denied indoor environments. In Unmanned Systems Technology XI, volume 7332, pages 336–345. SPIE, 2009.
  2. Juan Luis Gonzalez Bello and Munchurl Kim. Plade-net: Towards pixel-level accuracy for self-supervised single-view depth estimation with neural positional encoding and distilled matting loss. CoRR, abs/2103.07362, 2021.
  3. Adabins: Depth estimation using adaptive bins. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4009–4018, 2021.
  4. Zoedepth: Zero-shot transfer by combining relative and metric depth, 2023.
  5. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016.
  6. Soft labels for ordinal regression. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4738–4747, 2019.
  7. Computational principles of mobile robotics. Cambridge university press, 2010.
  8. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proceedings of the IEEE international conference on computer vision, pages 2650–2658, 2015.
  9. Depth map prediction from a single image using a multi-scale deep network. Advances in neural information processing systems, 27, 2014.
  10. Disentangling object motion and occlusion for unsupervised multi-frame monocular depth. arXiv preprint arXiv:2203.15174, 2022.
  11. Deep ordinal regression network for monocular depth estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2002–2011, 2018.
  12. Unsupervised cnn for single view depth estimation: Geometry to the rescue. In European conference on computer vision, pages 740–756. Springer, 2016.
  13. Vision meets robotics: The KITTI dataset. International Journal of Robotics Research, 32(11):1231 – 1237, Sept. 2013.
  14. Unsupervised monocular depth estimation with left-right consistency. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 270–279, 2017.
  15. Digging into self-supervised monocular depth estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3828–3838, 2019.
  16. Depth from videos in the wild: Unsupervised monocular depth learning from unknown cameras. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8977–8986, 2019.
  17. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  18. Guiding monocular depth estimation using depth-attention volume. In European Conference on Computer Vision, pages 581–597. Springer, 2020.
  19. Self-supervised monocular trained depth estimation using self-attention and discrete disparity volume. In Proceedings of the ieee/cvf conference on computer vision and pattern recognition, pages 4756–4765, 2020.
  20. Text-image alignment for diffusion-based perception, 2023.
  21. Evp: Enhanced visual perception using inverse multi-attentive feature refinement and regularized image-text alignment, 2023.
  22. From big to small: Multi-scale local planar guidance for monocular depth estimation. CoRR, abs/1907.10326, 2019.
  23. Hr-depth: High resolution self-supervised monocular depth estimation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 2294–2301, 2021.
  24. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pages 8162–8171. PMLR, 2021.
  25. Excavating the potential capacity of self-supervised monocular depth estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15560–15569, 2021.
  26. idisc: Internal discretization for monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21477–21487, 2023.
  27. Superdepth: Self-supervised, super-resolved monocular depth estimation. In 2019 International Conference on Robotics and Automation (ICRA), pages 9250–9256. IEEE, 2019.
  28. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
  29. Competitive collaboration: Joint unsupervised learning of depth, camera motion, optical flow and motion segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12240–12249, 2019.
  30. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022.
  31. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
  32. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. International Journal of Computer Vision, 2001.
  33. Nddepth: Normal-distance assisted monocular depth estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 7931–7940, 2023.
  34. Feature-metric loss for self-supervised learning of depth and egomotion. In European Conference on Computer Vision, pages 572–588. Springer, 2020.
  35. Sfm-net: Learning of structure and motion from video. arXiv: Computer Vision and Pattern Recognition, 2017.
  36. Self-supervised joint learning framework of depth estimation via implicit cues. arXiv preprint arXiv:2006.09876, 2020.
  37. Sqldepth: Generalizable self-supervised fine-structured monocular depth estimation. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 5713–5721, 2024.
  38. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.
  39. Self-supervised monocular depth hints. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2162–2171, 2019.
  40. The temporal opportunist: Self-supervised multi-frame monocular depth. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1164–1174, 2021.
  41. Revealing the dark secrets of masked image modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14475–14485, 2023.
  42. Channel-wise attention-based network for self-supervised monocular depth estimation. In 2021 International Conference on 3D Vision (3DV), pages 464–473. IEEE, 2021.
  43. Gedepth: Ground embedding for monocular depth estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 12719–12727, 2023.
  44. Unsupervised learning of geometry from videos with edge-aware depth-normal consistency. national conference on artificial intelligence, 2018.
  45. Geonet: Unsupervised learning of dense depth, optical flow and camera pose. computer vision and pattern recognition, 2018.
  46. New crfs: Neural window fully-connected crfs for monocular depth estimation. arXiv preprint arXiv:2203.01502, 2022.
  47. Unsupervised learning of monocular depth estimation and visual odometry with deep feature reconstruction. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 340–349, 2018.
  48. Monovit: Self-supervised monocular depth estimation with a vision transformer. arXiv preprint arXiv:2208.03543, 2022.
  49. Transformer-based dual relation graph for multi-label image recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 163–172, 2021.
  50. Unleashing text-to-image diffusion models for visual perception. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 5729–5739, 2023.
  51. Self-supervised monocular depth estimation with internal feature fusion. arXiv preprint arXiv:2110.09482, 2021.
  52. Unsupervised learning of depth and ego-motion from video. IEEE, 2017.
  53. Unsupervised learning of depth and ego-motion from video. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1851–1858, 2017.
  54. The edge of depth: Explicit constraints between segmentation and depth. arXiv: Computer Vision and Pattern Recognition, 2020.
  55. Lighteddepth: Video depth estimation in light of limited inference view angles. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 5003–5012, 2023.

Summary

We haven't generated a summary for this paper yet.