Towards Better Data Exploitation in Self-Supervised Monocular Depth Estimation (2309.05254v3)
Abstract: Depth estimation plays an important role in the robotic perception system. Self-supervised monocular paradigm has gained significant attention since it can free training from the reliance on depth annotations. Despite recent advancements, existing self-supervised methods still underutilize the available training data, limiting their generalization ability. In this paper, we take two data augmentation techniques, namely Resizing-Cropping and Splitting-Permuting, to fully exploit the potential of training datasets. Specifically, the original image and the generated two augmented images are fed into the training pipeline simultaneously and we leverage them to conduct self-distillation. Additionally, we introduce the detail-enhanced DepthNet with an extra full-scale branch in the encoder and a grid decoder to enhance the restoration of fine details in depth maps. Experimental results demonstrate our method can achieve state-of-the-art performance on the KITTI benchmark, with both raw ground truth and improved ground truth. Moreover, our models also show superior generalization performance when transferring to Make3D and NYUv2 datasets. Our codes are available at https://github.com/Sauf4896/BDEdepth.
- D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from a single image using a multi-scale deep network,” in Proc. Adv. Neural Inf. Process. Syst., 2014, pp. 2366–2374.
- F. Liu, C. Shen, and G. Lin, “Deep convolutional neural fields for depth estimation from a single image,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 5162–5170.
- B. Li, C. Shen, Y. Dai, A. Hengel, and M. He, “Depth and surface normal estimation from monocular images using regression on deep features and hierarchical CRFs,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 1119–1127.
- I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab, “Deeper depth prediction with fully convolutional residual networks,” in Proc. IEEE Int. Conf. 3D Vis., 2016, pp. 239–248.
- H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao, “Deep ordinal regression network for monocular depth estimation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 2002–2011.
- D. Eigen and R. Fergus, “Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture,” in Proc. IEEE Int. Conf. Comput. Vis., 2015, pp. 2650–2658.
- H. Zhang, C. Shen, Y. Li, Y. Cao, Y. Liu, and Y. Yan, “Exploiting temporal consistency for real-time video depth estimation,” in Proc. IEEE Int. Conf. Comput. Vis., 2019, pp. 1725–1734.
- J. Liu, L. Kong, and Y. Jie, “Designing and searching for lightweight monocular depth network,” in Proc. Int. Conf. Neural Inf. Process., 2021, pp. 477–488.
- R. Garg, V. K. B.G., G. Carneiro, and I. Reid, “Unsupervised CNN for single view depth estimation: Geometry to the rescue,” in Proc. Eur. Conf. Comput. Vis., 2016, pp. 740–756.
- C. Godard, O. M. Aodha, and G. J. Brostow, “Unsupervised monocular depth estimation with left-right consistency,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 6602–6611.
- T. Zhou, M. Brown, N. Snavely, and D. G. Lowe, “Unsupervised learning of depth and ego-motion from video,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 6612–6619.
- L. Kong and Y. Jie, “MDFlow: Unsupervised optical flow learning by reliable mutual knowledge distillation,” in IEEE Trans. Circuits Syst. Video Tech., 2023, pp. 677–688.
- Z. Yin and J. Shi, “GeoNet: Unsupervised learning of dense depth, optical flow and camera pose,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 1983–1992.
- M. Klingner, J. Termöhlen, J. Mikolajczyk, and T. Fingscheidt, “Self-supervised monocular depth estimation: Solving the dynamic object problem by semantic guidance,” in Proc. Eur. Conf. Comput. Vis., 2020, pp. 582–600.
- M. Poggi, F. Aleotti, F. Tosi, and S. Mattoccia, “On the uncertainty of self-supervised monocular depth estimation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020, pp. 3224–3234.
- C. Godard, O. Mac Aodha, M. Firman, and G. J. Brostow, “Digging into self-supervised monocular depth estimation,” in Proc. IEEE Int. Conf. Comput. Vis., 2019, pp. 3828–3838.
- V. Guizilini, R. Ambrus, S. Pillai, A. Raventos, and A. Gaidon, “3D packing for self-supervised monocular depth estimation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020, pp. 2482–2491.
- Z. Zhou, X. Fan, P. Shi, and Y. Xin, “R-MSFM: Recurrent multi-scale feature modulation for monocular depth estimating,” in Proc. IEEE Int. Conf. Comput. Vis., 2021, pp. 12757–12766.
- V. Kaushik, K. Jindgar, and B. Lall, “ADAADepth: Adapting data augmentation and attention for self-supervised monocular depth estimation,” IEEE Robot. Automat. Lett., vol. 6, no. 4, pp. 7791–7798, Oct. 2021.
- W. Han, J. Yin, X. Jin, X. Dai, and J. Shen, “BRNet: Exploring comprehensive features for monocular depth estimation,” in Proc. Eur. Conf. Comput. Vis., 2022, pp. 586–602.
- G. Cha, H.-D. Jang, and D. Wee, “Self-supervised depth estimation with isometric-self-sample-based learning,” IEEE Robot. Automat. Lett., vol. 8, no. 4, pp. 2173–2180, Apr. 2023.
- N. Zhang, F. Nex, G. Vosselman, and N. Kerle, “Lite-Mono: A lightweight CNN and Transformer architecture for self-supervised monocular depth estimation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2023, pp. 18537–18546.
- R. Wang, Z. Yu, and S. Gao, “PlaneDepth: Self-supervised depth estimation via orthogonal planes,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2023, pp. 21425–21434.
- H. Zhou, D. Greenwood, and S. Taylor, “Self-supervised monocular depth estimation with internal feature fusion,” in Proc. Bri. Mach. Vis. Conf., 2021.
- M. He, L. Hui, Y. Bian, J. Ren, J. Xie, and J. Yang, “RA-Depth: Resolution adaptive self-supervised monocular depth estimation,” in Proc. Eur. Conf. Comput. Vis., 2022, pp. 565–581.
- R. Peng, R. Wang, Y. Lai, L. Tang, and Y. Cai, “Excavating the potential capacity of self-supervised monocular depth estimation,” in Proc. IEEE Int. Conf. Comput. Vis., 2021, pp. 15540–15549.
- A. Pilzer, S. Lathuiliere, N. Sebe, and E. Ricci, “Refine and distill: Exploiting cycle-inconsistency and knowledge distillation for unsupervised monocular depth estimation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 9760–9769.
- J. Bello and M. Kim, “Self-supervised deep monocular depth estimation with ambiguity boosting,” in IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 12, pp. 9131–9149, Dec. 2022.
- C. Wang, J. M. Buenaposada, R. Zhu, and S. Lucey, “Learning depth from monocular videos using direct methods,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 2022–2030.
- S. Pillai, R. Ambru¸s, and A. Gaidon, “Superdepth: Self-supervised, super-resolved monocular depth estimation,” in Proc. IEEE Int. Conf. Robot. Automat, 2019, pp. 9250–9256.
- H. Jiang, L. Ding, Z. Sun, and R. Huang, “Dipe: Deeper into photometric errors for unsupervised learning of depth and ego-motion from monocular videos,” in Proc. IEEE. Int. Conf. Intell. Robot. Syst., 2020, pp. 10061–10067.
- H. Li, A. Gordon, H. Zhao, V. Casser, and A. Angelova, “Unsupervised monocular depth learning in dynamic scenes,” in n Proc. Conf. Robot Learn., 2021, pp. 1908–1917.
- B. Wagstaff, V. Peretroukhin, and J. Kelly, “On the coupling of depth and egomotion networks for self-supervised structure from motion,” IEEE Robot. Automat. Lett., vol. 7, no. 3, pp. 6766–6773, Jul. 2022.
- J. Watson, O. Mac Aodha, V. Prisacariu, G. Brostow, and M. Firman, “The temporal opportunist: Self-supervised multi-frame monocular depth,” Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021, pp. 1164–1174.
- O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional networks for biomedical image segmentation,” in Proc. Med. Image Comput. and Computer-Assisted Interv., 2015, pp. 234–241.
- A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The kitti dataset,” Int. J. Robot. Res., vol. 32, no. 11, pp. 1231–1237, Sept. 2013.
- A. Saxena, M. Sun, and A. Y. Ng, “Make3d: Learning 3d scene structure from a single still image,” in IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, no. 5, pp. 824–840, May. 2009.
- Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: From error visibility to structural similarity,” IEEE Trans. Image Process., vol. 13, no. 4, pp. 600–612, Apr. 2004.
- D. Fourure, R. Emonet, E. Fromont, D. Muselet, A. Tremeau, and C. Wolf, “Residual conv-deconv grid network for semantic segmentation,” in Proc. Bri. Mach. Vis. Conf., 2017.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 770–778.
- I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in Proc. Int. Conf. Learn. Representations, 2019.