FA-Depth: Toward Fast and Accurate Self-supervised Monocular Depth Estimation (2405.10885v3)
Abstract: Most existing methods often rely on complex models to predict scene depth with high accuracy, resulting in slow inference that is not conducive to deployment. To better balance precision and speed, we first designed SmallDepth based on sparsity. Second, to enhance the feature representation ability of SmallDepth during training under the condition of equal complexity during inference, we propose an equivalent transformation module(ETM). Third, to improve the ability of each layer in the case of a fixed SmallDepth to perceive different context information and improve the robustness of SmallDepth to the left-right direction and illumination changes, we propose pyramid loss. Fourth, to further improve the accuracy of SmallDepth, we utilized the proposed function approximation loss (APX) to transfer knowledge in the pretrained HQDecv2, obtained by optimizing the previous HQDec to address grid artifacts in some regions, to SmallDepth. Extensive experiments demonstrate that each proposed component improves the precision of SmallDepth without changing the complexity of SmallDepth during inference, and the developed approach achieves state-of-the-art results on KITTI at an inference speed of more than 500 frames per second and with approximately 2 M parameters. The code and models will be publicly available at https://github.com/fwucas/FA-Depth.
- Y. D. Yasuda et al., “Autonomous visual navigation for mobile robots: A systematic literature review,” ACM Comput. Surv., vol. 53, no. 1, pp. 1–34, 2020.
- W. Chuah et al., “Deep learning-based incorporation of planar constraints for robust stereo depth estimation in autonomous vehicle applications,” IEEE Trans. Intell. Transp. Syst., vol. 23, no. 7, pp. 6654–6665, 2021.
- C. Chen et al., “Improved saliency detection in rgb-d images using two-phase depth estimation and selective deep fusion,” IEEE Trans. Image Process., vol. 29, pp. 4296–4307, 2020.
- X. Zhao et al., “Joint learning of salient object detection, depth estimation and contour extraction,” IEEE Trans. Image Process., vol. 31, pp. 7350–7362, 2022.
- Z. Li et al., “Revisiting stereo depth estimation from a sequence-to-sequence perspective with transformers,” in Proc. IEEE Int. Conf. Comput. Vis., 2021, pp. 6197–6206.
- D. Eigen et al., “Depth map prediction from a single image using a multi-scale deep network,” Proc. Int. Conf. Adv. Neural Inf. Process. Syst., vol. 27, pp. 2366–2374, 2014.
- F. Liu et al., “Learning depth from single monocular images using deep convolutional neural fields,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 38, no. 10, pp. 2024–2039, 2015.
- W. Yuan et al., “Neural window fully-connected crfs for monocular depth estimation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 3916–3925.
- H. Fu et al., “Deep ordinal regression network for monocular depth estimation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2018, pp. 2002–2011.
- R. Ranftl et al., “Vision transformers for dense prediction,” in Proc. IEEE Int. Conf. Comput. Vis., 2021, pp. 12 179–12 188.
- S. F. Bhat et al., “Adabins: Depth estimation using adaptive bins,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2021, pp. 4009–4018.
- V. Patil et al., “P3depth: Monocular depth estimation with a piecewise planarity prior,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 1610–1621.
- U. Shin et al., “Deep depth estimation from thermal image,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 1043–1053.
- R. Li et al., “Learning to fuse monocular and multi-view cues for multi-frame depth estimation in dynamic scenes,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 21 539–21 548.
- X. Yang et al., “Gedepth: Ground embedding for monocular depth estimation,” in Proc. IEEE Int. Conf. Comput. Vis., 2023, pp. 12 719–12 727.
- W. Yin et al., “Virtual normal: Enforcing geometric constraints for accurate and robust depth prediction,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 10, pp. 7282–7295, 2021.
- Y. Wang et al., “Joint-confidence-guided multi-task learning for 3d reconstruction and understanding from monocular camera,” IEEE Trans. Image Process., vol. 32, pp. 1120–1133, 2023.
- C. Godard et al., “Digging into self-supervised monocular depth estimation,” in Proc. IEEE Int. Conf. Comput. Vis., 2019, pp. 3828–3838.
- J.-W. Bian et al., “Unsupervised scale-consistent depth and ego-motion learning from monocular video,” in Proc. Int. Conf. Adv. Neural Inf. Process. Syst., 2019, pp. 35–45.
- F. Wang et al., “Cbwloss: Constrained bidirectional weighted loss for self-supervised learning of depth and pose,” IEEE Trans. Intell. Transp. Syst., vol. 24, no. 6, pp. 5803–5821, 2023.
- J. L. G. Bello et al., “Self-supervised deep monocular depth estimation with ambiguity boosting,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 12, pp. 9131–9149, 2021.
- R. Wang et al., “Planedepth: Self-supervised depth estimation via orthogonal planes,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 21 425–21 434.
- A. Bangunharcana et al., “Dualrefine: Self-supervised depth and pose estimation through iterative epipolar sampling and refinement toward equilibrium,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 726–738.
- M. He et al., “Ra-depth: Resolution adaptive self-supervised monocular depth estimation,” in Proc. Eur. Conf. Comput. Vis., 2022, pp. 565–581.
- R. Garg et al., “Unsupervised cnn for single view depth estimation: Geometry to the rescue,” in Proc. Eur. Conf. Comput. Vis., 2016, pp. 740–756.
- T. Zhou et al., “Unsupervised learning of depth and ego-motion from video,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2017, pp. 1851–1858.
- X. Lyu et al., “Hr-depth: High resolution self-supervised monocular depth estimation,” in Proc. AAAI Conf. Artif. Intell., vol. 35, 2021, pp. 2294–2301.
- V. Guizilini et al., “3d packing for self-supervised monocular depth estimation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2020, pp. 2482–2491.
- X. Song et al., “Mlda-net: Multi-level dual attention-based network for self-supervised monocular depth estimation,” IEEE Trans. Image Process., vol. 30, pp. 4691–4705, 2021.
- Y. Zhang et al., “Unsupervised multi-view constrained convolutional network for accurate depth estimation,” IEEE Trans. Image Process., vol. 29, pp. 7019–7031, 2020.
- F. Wang et al., “Hqdec: Self-supervised monocular depth estimation based on a high-quality decoder,” IEEE Trans. Circuits Syst. Video Technol., 2023.
- Z. Wang et al., “Unsupervised monocular depth estimation with channel and spatial attention,” IEEE Trans. Neural Netw. Learn. Syst., 2022.
- J. Zhang et al., “Dpsnet: Multitask learning using geometry reasoning for scene depth and semantics,” IEEE Trans. Neural Netw. Learn. Syst., 2021.
- A. Ranjan et al., “Competitive collaboration: Joint unsupervised learning of depth, camera motion, optical flow and motion segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2019, pp. 12 240–12 249.
- Q. Sun et al., “Unsupervised estimation of monocular depth and vo in dynamic environments via hybrid masks,” IEEE Trans. Neural Netw. Learn. Syst., vol. 33, no. 5, pp. 2023–2033, 2021.
- A. Wang et al., “Adversarial learning for joint optimization of depth and ego-motion,” IEEE Trans. Image Process., vol. 29, pp. 4130–4142, 2020.
- V. Guizilini et al., “Multi-frame self-supervised depth with transformers,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 160–170.
- J. Watson et al., “The temporal opportunist: Self-supervised multi-frame monocular depth,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2021, pp. 1164–1174.
- X. Wang et al., “Crafting monocular cues and velocity guidance for self-supervised multi-frame depth learning,” in Proc. AAAI Conf. Artif. Intell., vol. 37, 2023, pp. 2689–2697.
- M. Klingner et al., “Self-supervised monocular depth estimation: Solving the dynamic object problem by semantic guidance,” in Proc. Eur. Conf. Comput. Vis., 2020, pp. 582–600.
- H. Jung et al., “Fine-grained semantics-aware representation enhancement for self-supervised monocular depth estimation,” in Proc. IEEE Int. Conf. Comput. Vis., 2021, pp. 12 642–12 652.
- X. Xu et al., “Multi-scale spatial attention-guided monocular depth estimation with semantic enhancement,” IEEE Trans. Image Process., vol. 30, pp. 8811–8822, 2021.
- M. Poggi et al., “Real-time self-supervised monocular depth estimation without gpu,” IEEE Trans. Intell. Transp. Syst., vol. 23, no. 10, pp. 17 342–17 353, 2022.
- V. Peluso et al., “Monocular depth perception on microcontrollers for edge applications,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 3, pp. 1524–1536, 2021.
- N. Zhang et al., “Lite-mono: A lightweight cnn and transformer architecture for self-supervised monocular depth estimation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 18 537–18 546.
- L. Song et al., “Spatial-aware dynamic lightweight self-supervised monocular depth estimation,” IEEE Robot. Auton. Lett., vol. 9, no. 1, pp. 883–890, 2023.
- A. Petrovai et al., “Exploiting pseudo labels in a self-supervised learning framework for improved monocular depth estimation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 1578–1588.
- L. Sun et al., “Sc-depthv3: Robust self-supervised monocular depth estimation for dynamic scenes,” IEEE Trans. Pattern Anal. Mach. Intell., 2023.
- X. Shi et al., “3d distillation: Improving self-supervised monocular depth estimation on reflective surfaces,” in Proc. IEEE Int. Conf. Comput. Vis., 2023, pp. 9133–9143.
- G. Li et al., “Sense: Self-evolving learning for self-supervised monocular depth estimation,” IEEE Trans. Image Process., 2023.
- X. Meng et al., “Cornet: Context-based ordinal regression network for monocular depth estimation,” IEEE Trans. Circuits Syst. Video Technol., 2021.
- A. Vaswani et al., “Attention is all you need,” in Proc. Int. Conf. Adv. Neural Inf. Process. Syst., 2017, pp. 5998–6008.
- A. Dosovitskiy et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” Proc. Int. Conf. Learn. Representations, 2021.
- L. Papa et al., “Meter: a mobile vision transformer architecture for monocular depth estimation,” IEEE Trans. Circuits Syst. Video Technol., 2023.
- A. Ali et al., “Xcit: Cross-covariance image transformers,” Proc. Int. Conf. Adv. Neural Inf. Process. Syst., vol. 34, pp. 20 014–20 027, 2021.
- H. Zhou et al., “Self-supervised monocular depth estimation with internal feature fusion,” Br. Mach. Vis. Conf. Proc., pp. 1–13, 2021.
- K. He et al., “Deep residual learning for image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2016, pp. 770–778.
- A. Howard et al., “Searching for mobilenetv3,” in Proc. IEEE Int. Conf. Comput. Vis., 2019, pp. 1314–1324.
- X. Ding et al., “Acnet: Strengthening the kernel skeletons for powerful cnn via asymmetric convolution blocks,” in Proc. IEEE Int. Conf. Comput. Vis., 2019, pp. 1911–1920.
- C. Szegedy et al., “Going deeper with convolutions,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2015, pp. 1–9.
- S. Mehta et al., “Espnet: Efficient spatial pyramid of dilated convolutions for semantic segmentation,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 552–568.
- A. Geiger et al., “Vision meets robotics: The kitti dataset,” Int. J. Robot. Res., vol. 32, no. 11, pp. 1231–1237, 2013.
- C. Godard et al., “Unsupervised monocular depth estimation with left-right consistency,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2017, pp. 270–279.
- A. Paszke et al., “Automatic differentiation in pytorch,” in Proc. Int. Conf. Adv. Neural Inf. Process. Syst., 2017.
- I. Loshchilov et al., “Decoupled weight decay regularization,” in Proc. Int. Conf. Learn. Representations, 2019.
- C. Zhao et al., “Monovit: Self-supervised monocular depth estimation with a vision transformer,” Int. Conf. 3D Vis., pp. 668–678, 2022.
- J. Yan et al., “Channel-wise attention-based network for self-supervised monocular depth estimation,” in Int. Conf. 3D Vis., 2021, pp. 464–473.
- Z. Zhou et al., “Self-distilled feature aggregation for self-supervised monocular depth estimation,” in Proc. Eur. Conf. Comput. Vis., 2022, pp. 709–726.
- H. Li et al., “Unsupervised monocular depth learning in dynamic scenes,” in Proc. Conf. Robot. Learn., vol. 155, 16–18 Nov 2021, pp. 1908–1917.
- D. Han et al., “Transdssl: Transformer based depth estimation via self-supervised learning,” IEEE Robot. Auton. Lett., vol. 7, no. 4, pp. 10 969–10 976, 2022.
- J. Uhrig et al., “Sparsity invariant cnns,” in Int. Conf. 3D Vis, 2017, pp. 11–20.
- V. Guizilini et al., “Geometric unsupervised domain adaptation for semantic segmentation,” in Proc. IEEE Int. Conf. Comput. Vis., 2021, pp. 8537–8547.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.