Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
162 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Manydepth2: Motion-Aware Self-Supervised Multi-Frame Monocular Depth Estimation in Dynamic Scenes (2312.15268v6)

Published 23 Dec 2023 in cs.CV

Abstract: Despite advancements in self-supervised monocular depth estimation, challenges persist in dynamic scenarios due to the dependence on assumptions about a static world. In this paper, we present Manydepth2, to achieve precise depth estimation for both dynamic objects and static backgrounds, all while maintaining computational efficiency. To tackle the challenges posed by dynamic content, we incorporate optical flow and coarse monocular depth to create a pseudo-static reference frame. This frame is then utilized to build a motion-aware cost volume in collaboration with the vanilla target frame. Furthermore, to improve the accuracy and robustness of the network architecture, we propose an attention-based depth network that effectively integrates information from feature maps at different resolutions by incorporating both channel and non-local attention mechanisms. Compared to methods with similar computational costs, Manydepth2 achieves a significant reduction of approximately five percent in root-mean-square error for self-supervised monocular depth estimation on the KITTI-2015 dataset. The code could be found at https://github.com/kaichen-z/Manydepth2.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (66)
  1. Unsupervised scale-consistent depth learning from video. International Journal of Computer Vision, 129(9):2548–2564, 2021.
  2. A naturalistic open source movie for optical flow evaluation. In Computer Vision–ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part VI 12, pages 611–625. Springer, 2012.
  3. Depth prediction without the sensors: Leveraging structure for unsupervised learning from monocular videos. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 8001–8008, 2019.
  4. A compacted structure for cross-domain learning on monocular depth and flow estimation. arXiv preprint arXiv:2208.11993, 2022.
  5. Self-supervised learning with geometric constraints in monocular video: Connecting flow, depth, and camera. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7063–7072, 2019.
  6. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016.
  7. On deep learning techniques to boost monocular depth estimation for autonomous navigation. Robotics and Autonomous Systems, 136:103701, 2021.
  8. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  9. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In ICCV, 2015.
  10. Depth map prediction from a single image using a multi-scale deep network. arXiv preprint arXiv:1406.2283, 2014.
  11. Disentangling object motion and occlusion for unsupervised multi-frame monocular depth. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXII, pages 228–244. Springer, 2022.
  12. Are we ready for autonomous driving? the kitti vision benchmark suite. In CVPR, 2012.
  13. Unsupervised monocular depth estimation with left-right consistency. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 270–279, 2017.
  14. Digging into self-supervised monocular depth estimation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3828–3838, 2019.
  15. Depth from videos in the wild: Unsupervised monocular depth learning from unknown cameras. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8977–8986, 2019.
  16. Cascade cost volume for high-resolution multi-view stereo and stereo matching. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2495–2504, 2020.
  17. 3d packing for self-supervised monocular depth estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2485–2494, 2020.
  18. Semantically-guided representation learning for self-supervised monocular depth. arXiv preprint arXiv:2002.12319, 2020.
  19. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  20. Heiko Hirschmuller. Stereo processing by semiglobal matching and mutual information. IEEE Transactions on pattern analysis and machine intelligence, 30(2):328–341, 2007.
  21. Real-time feature tracking and outlier rejection with changes in illumination. In ICCV, 2001.
  22. Self-supervised monocular trained depth estimation using self-attention and discrete disparity volume. In Proceedings of the ieee/cvf conference on computer vision and pattern recognition, pages 4756–4765, 2020.
  23. Self-supervised monocular depth estimation: Solving the dynamic object problem by semantic guidance. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XX 16, pages 582–600. Springer, 2020.
  24. Comoda: Continuous monocular depth adaptation using past experiences. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2907–2917, 2021.
  25. Multi-resolution distillation for self-supervised monocular depth estimation. Pattern Recognition Letters, 176:215–222, 2023.
  26. Learning monocular depth in dynamic scenes via instance-aware projection consistency. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 1863–1872, 2021.
  27. Attentive and contrastive learning for joint depth and motion field estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4862–4871, 2021.
  28. Unsupervised monocular depth learning in dynamic scenes. In Conference on Robot Learning, pages 1908–1917. PMLR, 2021.
  29. Learning depth via leveraging semantics: Self-supervised monocular depth estimation with both implicit and explicit semantic guidance. Pattern Recognition, 137:109297, 2023.
  30. Hierarchical object relationship constrained monocular depth estimation. Pattern Recognition, 120:108116, 2021.
  31. Pose graph optimization for unsupervised monocular visual odometry. In 2019 International Conference on Robotics and Automation (ICRA), pages 5439–5445. IEEE, 2019.
  32. Every pixel counts++: Joint learning of geometry and motion with 3d holistic understanding. TPAMI, 42(10):2624–2641, 2019.
  33. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2016. arXiv:1512.02134.
  34. Monocular depth estimation with self-supervised instance adaptation. arXiv preprint arXiv:2004.05821, 2020.
  35. Ds-depth: Dynamic and static depth estimation via a fusion cost volume. IEEE Transactions on Circuits and Systems for Video Technology, 2023.
  36. Overview and efficiency of decoder-side depth estimation in mpeg immersive video. IEEE Transactions on Circuits and Systems for Video Technology, 32(9):6360–6374, 2022.
  37. Dtam: Dense tracking and mapping in real-time. In 2011 international conference on computer vision, pages 2320–2327. IEEE, 2011.
  38. Regnerf: Regularizing neural radiance fields for view synthesis from sparse inputs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5480–5490, 2022.
  39. Low power depth estimation of rigid objects for time-of-flight imaging. IEEE Transactions on Circuits and Systems for Video Technology, 30(6):1524–1534, 2019.
  40. Don’t forget the past: Recurrent depth estimation from monocular video. IEEE Robotics and Automation Letters, 5(4):6813–6820, 2020.
  41. Unsupervised adversarial depth estimation using cycled generative networks. In 2018 international conference on 3D vision (3DV), pages 587–595. IEEE, 2018.
  42. Dense monocular depth estimation in complex dynamic scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4058–4066, 2016.
  43. Competitive collaboration: Joint unsupervised learning of depth, camera motion, optical flow and motion segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12240–12249, 2019.
  44. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer, 2015.
  45. Feature-metric loss for self-supervised learning of depth and egomotion. In ECCV, 2020.
  46. Indoor segmentation and support inference from rgbd images. In ECCV, 2012.
  47. Learning depth from monocular videos using direct methods. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2022–2030, 2018.
  48. Self-supervised joint learning framework of depth estimation via implicit cues. arXiv preprint arXiv:2006.09876, 2020.
  49. Deep high-resolution representation learning for visual recognition. IEEE transactions on pattern analysis and machine intelligence, 43(10):3349–3364, 2020.
  50. Recurrent neural network for (un-) supervised learning of monocular video visual odometry and depth. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5555–5564, 2019.
  51. Salient object detection in the deep learning era: An in-depth survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(6):3239–3259, 2021.
  52. Pseudo-lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8445–8453, 2019.
  53. The temporal opportunist: Self-supervised multi-frame monocular depth. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1164–1174, 2021.
  54. Monorec: Semi-supervised dense reconstruction in dynamic environments from a single moving camera. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6112–6122, 2021.
  55. Gmflow: Learning optical flow via global matching. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8121–8130, 2022.
  56. Learning to segment rigid motions from two frames. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1266–1275, 2021.
  57. Lego: Learning edge with geometry all at once by watching videos. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 225–234, 2018.
  58. Mvsnet: Depth inference for unstructured multi-view stereo. In Proceedings of the European conference on computer vision (ECCV), pages 767–783, 2018.
  59. Pseudo-lidar++: Accurate depth for 3d object detection in autonomous driving. arXiv preprint arXiv:1906.06310, 2019.
  60. Unsupervised learning of monocular depth estimation and visual odometry with deep feature reconstruction. In CVPR, 2018.
  61. Exploiting temporal consistency for real-time video depth estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1725–1734, 2019.
  62. Towards better generalization: Joint depth-pose learning without posenet. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9151–9161, 2020.
  63. Devnet: Self-supervised monocular depth learning via density volume construction. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXIX, pages 125–142. Springer, 2022.
  64. Dynpoint: Dynamic neural point for view synthesis. arXiv preprint arXiv:2310.18999, 2023.
  65. Unsupervised learning of depth and ego-motion from video. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1851–1858, 2017.
  66. R-MSFM: Recurrent multi-scale feature modulation for monocular depth estimating. In ICCV, 2021.
Citations (2)

Summary

We haven't generated a summary for this paper yet.