Self-supervised Event-based Monocular Depth Estimation using Cross-modal Consistency (2401.07218v1)
Abstract: An event camera is a novel vision sensor that can capture per-pixel brightness changes and output a stream of asynchronous ``events''. It has advantages over conventional cameras in those scenes with high-speed motions and challenging lighting conditions because of the high temporal resolution, high dynamic range, low bandwidth, low power consumption, and no motion blur. Therefore, several supervised monocular depth estimation from events is proposed to address scenes difficult for conventional cameras. However, depth annotation is costly and time-consuming. In this paper, to lower the annotation cost, we propose a self-supervised event-based monocular depth estimation framework named EMoDepth. EMoDepth constrains the training process using the cross-modal consistency from intensity frames that are aligned with events in the pixel coordinate. Moreover, in inference, only events are used for monocular depth prediction. Additionally, we design a multi-scale skip-connection architecture to effectively fuse features for depth estimation while maintaining high inference speed. Experiments on MVSEC and DSEC datasets demonstrate that our contributions are effective and that the accuracy can outperform existing supervised event-based and unsupervised frame-based methods.
- H. Rebecq, R. Ranftl, V. Koltun, and D. Scaramuzza, “High speed and high dynamic range video with an event camera,” IEEE transactions on pattern analysis and machine intelligence, vol. 43, no. 6, pp. 1964–1980, 2019.
- J. Hidalgo-Carrió, G. Gallego, and D. Scaramuzza, “Event-aided direct sparse odometry,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5781–5790, 2022.
- M. Gehrig, M. Millhäusler, D. Gehrig, and D. Scaramuzza, “E-raft: Dense optical flow from event cameras,” in 2021 International Conference on 3D Vision (3DV), pp. 197–206, IEEE, 2021.
- J. Hidalgo-Carrió, D. Gehrig, and D. Scaramuzza, “Learning monocular dense depth from events,” in 2020 International Conference on 3D Vision (3DV), pp. 534–542, 2020.
- D. Gehrig, M. Rüegg, M. Gehrig, J. Hidalgo-Carrió, and D. Scaramuzza, “Combining events and frames using recurrent asynchronous multimodal networks for monocular depth prediction,” IEEE Robotics and Automation Letters, vol. 6, no. 2, pp. 2822–2829, 2021.
- S. F. Bhat, I. Alhashim, and P. Wonka, “Adabins: Depth estimation using adaptive bins,” in CVPR, 2020.
- W. Yuan, X. Gu, Z. Dai, S. Zhu, and P. Tan, “Newcrfs: Neural window fully-connected crfs for monocular depth estimation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022.
- C. Godard, O. Mac Aodha, M. Firman, and G. J. Brostow, “Digging into self-supervised monocular depth estimation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019.
- Y. Kuznietsov, J. Stuckler, and B. Leibe, “Semi-supervised deep learning for monocular depth map prediction,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6647–6655, 2017.
- R. Ji, K. Li, Y. Wang, X. Sun, F. Guo, X. Guo, Y. Wu, F. Huang, and J. Luo, “Semi-supervised adversarial monocular depth estimation,” IEEE transactions on pattern analysis and machine intelligence, vol. 42, no. 10, pp. 2410–2422, 2019.
- G. Gallego, H. Rebecq, and D. Scaramuzza, “A unifying contrast maximization framework for event cameras, with applications to motion, depth, and optical flow estimation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
- A. Z. Zhu, L. Yuan, K. Chaney, and K. Daniilidis, “Unsupervised event-based learning of optical flow, depth, and egomotion,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
- C. Ye, A. Mitrokhin, C. Fermüller, J. A. Yorke, and Y. Aloimonos, “Unsupervised learning of dense optical flow, depth and egomotion with event-based sensors,” in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 5831–5838, IEEE, 2020.
- D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from a single image using a multi-scale deep network,” in NeurIPS, 2014.
- H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao, “Deep ordinal regression network for monocular depth estimation,” in CVPR, 2018.
- A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” in International Conference for Learning Representations (ICLR), 2021.
- J. Mei, M. Wang, Y. Lin, Y. Yuan, and Y. Liu, “Transvos: Video object segmentation with transformers,” arXiv preprint arXiv:2106.00588, 2021.
- N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in European Conference on Computer Vision, 2020.
- R. Garg, B. V. Kumar, G. Carneiro, and I. Reid, “Unsupervised cnn for single view depth estimation: Geometry to the rescue,” in European Conference on Computer Vision, pp. 740–756, Springer, 2016.
- C. Godard, O. M. Aodha, and G. J. Brostow, “Unsupervised monocular depth estimation with left-right consistency,” in CVPR, 2017.
- T. Zhou, M. Brown, N. Snavely, and D. G. Lowe, “Unsupervised learning of depth and ego-motion from video,” in CVPR, 2017.
- F. Xue, G. Zhuo, Z. Huang, W. Fu, Z. Wu, and M. H. Ang, “Toward hierarchical self-supervised monocular absolute depth estimation for autonomous driving applications,” in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 2330–2337, IEEE, 2020.
- X. Lyu, L. Liu, M. Wang, X. Kong, L. Liu, Y. Liu, X. Chen, and Y. Yuan, “Hr-depth: High resolution self-supervised monocular depth estimation,” in AAAI, 2021.
- S. Zhu, G. Brazil, and X. Liu, “The edge of depth: Explicit constraints between segmentation and depth,” in CVPR, 2020.
- M. Klingner, J.-A. Termöhlen, J. Mikolajczyk, and T. Fingscheidt, “Self-supervised monocular depth estimation: Solving the dynamic object problem by semantic guidance,” in European Conference on Computer Vision, pp. 582–600, Springer, 2020.
- R. Peng, R. Wang, Y. Lai, L. Tang, and Y. Cai, “Excavating the potential capacity of self-supervised monocular depth estimation,” in ICCV, 2021.
- H. Mu, H. Le, B. Yikai, R. Jian, X. Jin, and Y. Jian, “Ra-depth: Resolution adaptive self-supervised monocular depth estimation,” in ECCV, 2022.
- A. Petrovai and S. Nedevschi, “Exploiting pseudo labels in a self-supervised learning framework for improved monocular depth estimation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1578–1588, June 2022.
- V. Casser, S. Pirk, R. Mahjourian, and A. Angelova, “Depth prediction without the sensors: Leveraging structure for unsupervised learning from monocular videos,” in Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19), 2019.
- S. P. A. R. A. G. Vitor Guizilini Rares, Ambrus, “3d packing for self-supervised monocular depth estimation,” in CVPR, 2020.
- Y. Zhou, G. Gallego, H. Rebecq, L. Kneip, H. Li, and D. Scaramuzza, “Semi-dense 3d reconstruction with a stereo event camera,” in Proceedings of the European Conference on Computer Vision (ECCV), September 2018.
- M. Cui, Y. Zhu, Y. Liu, Y. Liu, G. Chen, and K. Huang, “Dense depth-map estimation based on fusion of event camera and sparse lidar,” IEEE Transactions on Instrumentation and Measurement, vol. 71, pp. 1–11, 2022.
- A. Zhu, L. Yuan, K. Chaney, and K. Daniilidis, “Ev-flownet: Self-supervised optical flow estimation for event-based cameras,” in Proceedings of Robotics: Science and Systems, (Pittsburgh, Pennsylvania), June 2018.
- R. Benosman, C. Clercq, X. Lagorce, S.-H. Ieng, and C. Bartolozzi, “Event-based visual flow,” IEEE transactions on neural networks and learning systems, vol. 25, no. 2, pp. 407–417, 2013.
- A. Zihao Zhu, L. Yuan, K. Chaney, and K. Daniilidis, “Unsupervised event-based optical flow using motion compensation,” in Proceedings of the European Conference on Computer Vision (ECCV) Workshops, pp. 0–0, 2018.
- A. Sironi, M. Brambilla, N. Bourdis, X. Lagorce, and R. Benosman, “Hats: Histograms of averaged time surfaces for robust event-based object classification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1731–1740, 2018.
- D. Gehrig, A. Loquercio, K. G. Derpanis, and D. Scaramuzza, “End-to-end learning of representations for asynchronous event-based data,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5633–5643, 2019.
- Z. Wang, “Image quality assessment : From error visibility to structural similarity,” IEEE Transactions on Image Processing, 2004.
- H. Zhou, D. Greenwood, and S. Taylor, “Self-supervised monocular depth estimation with internal feature fusion,” in British Machine Vision Conference (BMVC), 2021.
- H. Huang, L. Lin, R. Tong, H. Hu, Q. Zhang, Y. Iwamoto, X. Han, Y.-W. Chen, and J. Wu, “Unet 3+: A full-scale connected unet for medical image segmentation,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1055–1059, IEEE, 2020.
- M. Mostafavi, K.-J. Yoon, and J. Choi, “Event-intensity stereo: Estimating depth by the best of both worlds,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4258–4267, 2021.
- Junyu Zhu (7 papers)
- Lina Liu (22 papers)
- Bofeng Jiang (2 papers)
- Feng Wen (19 papers)
- Hongbo Zhang (54 papers)
- Wanlong Li (8 papers)
- Yong Liu (724 papers)