Camera Height Doesn't Change: Unsupervised Training for Metric Monocular Road-Scene Depth Estimation (2312.04530v3)
Abstract: In this paper, we introduce a novel training method for making any monocular depth network learn absolute scale and estimate metric road-scene depth just from regular training data, i.e., driving videos. We refer to this training framework as FUMET. The key idea is to leverage cars found on the road as sources of scale supervision and to incorporate them in network training robustly. FUMET detects and estimates the sizes of cars in a frame and aggregates scale information extracted from them into an estimate of the camera height whose consistency across the entire video sequence is enforced as scale supervision. This realizes robust unsupervised training of any, otherwise scale-oblivious, monocular depth network so that they become not only scale-aware but also metric-accurate without the need for auxiliary sensors and extra supervision. Extensive experiments on the KITTI and the Cityscapes datasets show the effectiveness of FUMET, which achieves state-of-the-art accuracy. We also show that FUMET enables training on mixed datasets of different camera heights, which leads to larger-scale training and better generalization. Metric depth reconstruction is essential in any road-scene visual modeling, and FUMET democratizes its deployment by establishing the means to convert any model into a metric depth estimator.
- AdaBins: Depth Estimation Using Adaptive Bins. In CVPR, pages 4009–4018, 2021.
- Kinematic 3D Object Detection in Monocular Video. In ECCV, Virtual, 2020.
- Unsupervised Monocular Depth and Ego-Motion Learning with Structure and Semantics. In CVPRW, 2019.
- Multimodal Scale Consistency and Awareness for Monocular Self-Supervised Depth Estimation. In ICRA, pages 5140–5146, 2021.
- 3D Object Proposals for Accurate Object Class Detection. NeurIPS, 28, 2015.
- Frequency-Aware Self-Supervised Monocular Depth Estimation. In WACV, pages 5808–5817, 2023.
- The Cityscapes Dataset for Semantic Urban Scene Understanding. In CVPR, 2016.
- Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-Scale Convolutional Architecture. In ICCV, 2015.
- Depth Map Prediction from a Single Image Using a Multi-Scale Deep Network. In NeurIPS, 2014.
- Recovering Stable Scale in Monocular SLAM Using Object-Supplemented Bundle Adjustment. IEEE Transactions on Robotics, 34(3):736–747, 2018.
- Unsupervised CNN for Single View Depth Estimation: Geometry to the Rescue. In ECCV, pages 740–756. Springer, 2016.
- Vision Meets Robotics: The KITTI Dataset. International Journal of Robotics Research (IJRR), 2013.
- A2D2: Audi Autonomous Driving Dataset. CoRR, abs/2004.06320, 2020.
- Unsupervised Monocular Depth Estimation with Left-Right Consistency. In CVPR, 2017.
- Digging Into Self-Supervised Monocular Depth Estimation. In ICCV, 2019.
- 3D Packing for Self-Supervised Monocular Depth Estimation. In CVPR, 2020.
- MonoCInIS: Camera Independent Monocular 3D Object Detection Using Instance Segmentation. In ICCVW, pages 923–934, 2021.
- Putting Objects in Perspective. IJCV, 80:3–15, 2008.
- One Thousand and One Hours: Self-Driving Motion Prediction Dataset. In Conference on Robot Learning, pages 409–418, 2021.
- DVM-CAR: A Large-Scale Automotive Dataset for Visual Marketing Research and Applications. In International Conference on Big Data, pages 4140–4147, Los Alamitos, CA, USA, 2022.
- OneFormer: One Transformer to Rule Universal Image Segmentation. In CVPR, pages 2989–2998, 2023.
- Global-Local Path Networks for Monocular Depth Estimation with Vertical CutDepth. CoRR, abs/2201.07436, 2022.
- Adam: A Method for Stochastic Optimization. CoRR, abs/1412.6980, 2014.
- Diversity Matters: Fully Exploiting Depth Clues for Reliable Monocular 3D Object Detection. In CVPR, pages 2791–2800, 2022.
- Decoupled Weight Decay Regularization. In ICLR, 2019.
- HR-Depth: High Resolution Self-Supervised Monocular Depth Estimation. 35:2294–2301, 2021.
- Multi-Modal Multi-Task (3MT) Road Segmentation. IEEE Robotics and Automation Letters, 8(9):5408–5415, 2023.
- RA-Depth: Resolution Adaptive Self-Supervised Monocular Depth Estimation. In ECCV, 2022.
- Trap Attention: Monocular Depth Estimation with Manual Traps. In CVPR, pages 5033–5043, 2023.
- PyTorch: An Imperative Style, High-Performance Deep Learning Library. In NeurIPS, 2019.
- Imagenet: A Large Scale Visual Recognition Challenge. IJCV, 115:211–252, 2015.
- Kick Back & Relax: Learning to Reconstruct the World by Watching SlowTV, 2023.
- Bayesian Scale Estimation for Monocular SLAM Based on Generic Object Detection for Correcting Scale Drift. In ICRA, pages 5152–5158, 2018.
- Sparsity Invariant CNNs. In 3DV, pages 11–20. IEEE, 2017.
- Self-Supervised Scale Recovery for Monocular Depth and Egomotion Estimation. In IROS, pages 2620–2627, 2021.
- Learning Depth from Monocular Videos Using Direct Methods. In CVPR, 2018.
- End-to-End Interactive Prediction and Planning with Optical Flow Distillation for Autonomous Driving. In CVPRW, pages 2229–2238, 2021.
- Train in Germany, Test in the USA: Making 3D Object Detectors Generalize. In CVPR, 2020.
- Unsupervised Learning of Depth and Pose Based on Monocular Camera and Inertial Measurement Unit (IMU). In ICRA, pages 10010–10017, 2023.
- Image Quality Assessment: From Error Visibility to Structural Similarity. IEEE Transactions on Image Processing, 13(4):600–612, 2004.
- The Temporal Opportunist: Self-Supervised Multi-Frame Monocular Depth. In CVPR, pages 1164–1174, 2021.
- Argoverse 2: Next Generation Datasets for Self-driving Perception and Forecasting. In NeurIPS Datasets and Benchmarks, 2021.
- ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders. In CVPR, pages 16133–16142, 2023.
- Visual Attention-Based Self-Supervised Absolute Depth Estimation Using Geometric Priors in Autonomous Driving. IEEE Robotics and Automation Letters, 7(4):11998–12005, 2022.
- ONCE-3DLanes: Building Monocular 3D Lane Detection. In CVPR, pages 17143–17152, 2022.
- LEGO: Learning Edge with Geometry All at Once by Watching Videos. In CVPR, 2018.
- Metric3D: Towards Zero-Shot Metric 3D Prediction from a Single Image. In ICCV, pages 9043–9053, 2023.
- GeoNet: Unsupervised Learning of Dense Depth, Optical Flow and Camera Pose. In CVPR, 2018.
- Neural Window Fully-Connected CRFs for Monocular Depth Estimation. In CVPR, pages 3916–3925, 2022.
- Lite-Mono: A Lightweight CNN and Transformer Architecture for Self-Supervised Monocular Depth Estimation. In CVPR, pages 18537–18546, 2023a.
- MonoDETR: Depth-guided Transformer for Monocular 3D Object Detection. In ICCV, pages 9155–9166, 2023b.
- Scale-aware Insertion of Virtual Objects in Monocular Videos. In ISMAR, pages 36–44, 2020.
- Towards Scale-Aware, Robust, and Generalizable Unsupervised Monocular Depth Estimation by Integrating IMU Motion Dynamics. In ECCV, pages 143–160, Cham, 2022.
- JPerceiver: Joint Perception Network for Depth, Pose and Layout Estimation in Driving Scenes. In ECCV, pages 708–726, Cham, 2022.
- Self-Supervised Monocular Depth Estimation with Internal Feature Fusion. In BMVC, 2021.
- Unsupervised Learning of Depth and Ego-Motion from Video. In CVPR, 2017.
- Single View Metrology in the Wild. In ECCV, pages 316–333, Cham, 2020.