Depth Prompting for Sensor-Agnostic Depth Estimation (2405.11867v1)
Abstract: Dense depth maps have been used as a key element of visual perception tasks. There have been tremendous efforts to enhance the depth quality, ranging from optimization-based to learning-based methods. Despite the remarkable progress for a long time, their applicability in the real world is limited due to systematic measurement biases such as density, sensing pattern, and scan range. It is well-known that the biases make it difficult for these methods to achieve their generalization. We observe that learning a joint representation for input modalities (e.g., images and depth), which most recent methods adopt, is sensitive to the biases. In this work, we disentangle those modalities to mitigate the biases with prompt engineering. For this, we design a novel depth prompt module to allow the desirable feature representation according to new depth distributions from either sensor types or scene configurations. Our depth prompt can be embedded into foundation models for monocular depth estimation. Through this embedding process, our method helps the pretrained model to be free from restraint of depth scan range and to provide absolute scale depth maps. We demonstrate the effectiveness of our method through extensive evaluations. Source code is publicly available at https://github.com/JinhwiPark/DepthPrompting .
- Flamingo: a visual language model for few-shot learning. In Proceedings of the Neural Information Processing Systems (NeurIPS), 2022.
- Deep digging into the generalization of self-supervised monocular depth estimation. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2023.
- Resolving multipath interference in time-of-flight imaging via modulation frequency diversity and sparse regularization. Optics letters, 39(6):1705–1708, 2014.
- Midas v3.1 – a model zoo for robust monocular relative depth estimation. arXiv preprint arXiv:2307.14460, 2023.
- On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
- Language models are few-shot learners. Proceedings of the Neural Information Processing Systems (NeurIPS), 2020.
- Tinytl: Reduce memory, not parameters for efficient on-device learning. 2020.
- A simple framework for contrastive learning of visual representations. In Proceedings of the International Conference on Machine Learning (ICML), 2020.
- Depth estimation via affinity learned with convolutional spatial propagation network. In Proceedings of the European Conference on Computer Vision (ECCV), 2018.
- Sparsity agnostic depth completion. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 5871–5880, 2023.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Learning a depth covariance function. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- How do neural networks see depth in single images? In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019.
- A survey for in-context learning. arXiv preprint arXiv:2301.00234, 2022.
- Depth map prediction from a single image using a multi-scale deep network. 2014.
- Stylegan-nada: Clip-guided domain adaptation of image generators. ACM Transactions on Graphics (TOG), 41(4), 2022.
- Vision meets robotics: The kitti dataset. The International Journal of Robotics Research (IJRR), 32(11):1231–1237, 2013.
- A systematic survey of prompt engineering on vision-language foundation models. arXiv preprint arXiv:2307.12980, 2023.
- Sparse auxiliary networks for unified monocular depth prediction and completion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
- Learning monocular depth by distilling cross-domain stereo networks. In Proceedings of the European Conference on Computer Vision (ECCV), 2018.
- Deep residual learning for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
- Parameter-efficient transfer learning for nlp. In Proceedings of the International Conference on Machine Learning (ICML), 2019.
- Kinectfusion: Real-time 3d reconstruction and interaction using a moving depth camera. In Proceedings of the 24th Annual ACM Symposium on User Interface Software and Technology, 2011.
- Scaling up visual and vision-language representation learning with noisy text supervision. In Proceedings of the International Conference on Machine Learning (ICML), 2021.
- Visual prompt tuning. In Proceedings of the European Conference on Computer Vision (ECCV), 2022.
- Costdcnet: Cost volume based depth completion for a single rgb-d image. In Proceedings of the European Conference on Computer Vision (ECCV), 2022.
- Intel realsense stereoscopic depth cameras. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshop (CVPRW), 2017.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Segment anything. arXiv preprint arXiv:2304.02643, 2023.
- Demodulation pixels in ccd and cmos technologies for time-of-flight ranging. In Sensors and camera systems for scientific, industrial, and digital photography applications, pages 177–188. SPIE, 2000.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023a.
- Aads: Augmented autonomous driving simulation using data-driven algorithms. Science robotics, 4(28):eaaw0863, 2019.
- Lightweight neural network for enhancing imaging performance of under-display camera. IEEE Transactions on Circuits and Systems for Video Technology (CSVT), 2023b.
- Depthformer: Exploiting long-range correlation and local information for accurate monocular depth estimation. Machine Intelligence Research, pages 1–18, 2023c.
- Dynamic spatial propagation network for depth completion. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2022.
- Learning affinity via spatial propagation networks. In Proceedings of the Neural Information Processing Systems (NeurIPS), 2017.
- Explicit visual prompting for low-level structure segmentations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023a.
- Degae: A new pretraining paradigm for low-level vision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023b.
- From depth what can you see? depth completion via auxiliary image reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
- Evaluation of the apple iphone 12 pro lidar for an application in geosciences. Scientific reports, 11(1):22221, 2021.
- Self-supervised sparse-to-dense: Self-supervised depth completion from lidar and monocular camera. arXiv preprint arXiv:1807.00275, 2018.
- Donald W Marquardt. An algorithm for least-squares estimation of nonlinear parameters. Journal of the society for Industrial and Applied Mathematics, 11(2):431–441, 1963.
- Reducing domain gap by reducing style bias. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
- Multipath interference suppression in time-of-flight sensors by exploiting the amplitude envelope of the transmission signal. IEEE Access, 8:167527–167536, 2020.
- Non-local spatial propagation network for depth completion. In Proceedings of the European Conference on Computer Vision (ECCV), 2020.
- Learning affinity with hyperbolic representation for spatial propagation. In Proceedings of the International Conference on Machine Learning (ICML), 2023.
- Automatic differentiation in pytorch. In Proceedings of the Neural Information Processing Systems Workshop (NeurIPS-W), 2017.
- Deeplidar: Deep surface normal guided depth prediction for outdoor scene from sparse lidar data and single color image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
- Improving language understanding by generative pre-training. In preprint. OpenAI, 2018.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning (ICML), 2021.
- Robust super-resolution for mixed-resolution multiview image plus depth data. IEEE Transactions on Circuits and Systems for Video Technology (CSVT), 26(5):814–828, 2015.
- Detecting formal thought disorder by deep contextualized word representations. Psychiatry Research, 304:114135, 2021.
- Brent Schwarz. Mapping the world in 3d. Nature Photonics, 4(7):429–430, 2010.
- Indoor segmentation and support inference from rgbd images. In Proceedings of the European Conference on Computer Vision (ECCV), 2012.
- Kick back & relax: Learning to reconstruct the world by watching slowtv. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023.
- Fourier features let networks learn high frequency functions in low dimensional domains. In Proceedings of the Neural Information Processing Systems (NeurIPS), 2020.
- Direct iterative closest point for real-time visual odometry. In Proceedings of the IEEE International Conference on Computer Vision Workshop (ICCVW), 2011.
- Sparsity invariant cnns. In International Conference on 3D Vision (3DV), 2017.
- Winning ariac 2020 by kissing the bear: Keeping things simple in best effort agile robotics. Robotics and Computer-Integrated Manufacturing, 71:102166, 2021.
- Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In Proceedings of the International Conference on Machine Learning (ICML), 2020.
- Towards fairness in visual recognition: Effective strategies for bias mitigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
- Is imitation all you need? generalized decision-making with dual-phase training. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023.
- Unsupervised depth completion from visual inertial odometry. IEEE Robotics and Automation Letters, 5(2):1899–1906, 2020.
- Sparse fuse dense: Towards high quality 3d detection with depth completion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- View invariant human action recognition using histograms of 3d joints. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshop (CVPRW), 2012.
- Generating and exploiting probabilistic monocular depth estimates. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
- Towards domain-agnostic depth completion. arXiv preprint arXiv:2207.14466, 2022.
- Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432, 2021.
- Plug-and-play image restoration with deep denoiser prior. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 44(10):6360–6376, 2022.
- Monodetr: Depth-guided transformer for monocular 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023a.
- Song Zhang. High-speed 3d shape measurement with structured light methods: A review. Optics and lasers in engineering, 106:119–131, 2018.
- Completionformer: Depth completion with convolutions and vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023b.
- Zhengyou Zhang. Microsoft kinect sensor and its effect. IEEE multimedia, 19(2):4–10, 2012.
- Learning to reconstruct 3d manhattan wireframes from a single image. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.