WorDepth: Variational Language Prior for Monocular Depth Estimation (2404.03635v4)
Abstract: Three-dimensional (3D) reconstruction from a single image is an ill-posed problem with inherent ambiguities, i.e. scale. Predicting a 3D scene from text description(s) is similarly ill-posed, i.e. spatial arrangements of objects described. We investigate the question of whether two inherently ambiguous modalities can be used in conjunction to produce metric-scaled reconstructions. To test this, we focus on monocular depth estimation, the problem of predicting a dense depth map from a single image, but with an additional text caption describing the scene. To this end, we begin by encoding the text caption as a mean and standard deviation; using a variational framework, we learn the distribution of the plausible metric reconstructions of 3D scenes corresponding to the text captions as a prior. To "select" a specific reconstruction or depth map, we encode the given image through a conditional sampler that samples from the latent space of the variational text encoder, which is then decoded to the output depth map. Our approach is trained alternatingly between the text and image branches: in one optimization step, we predict the mean and standard deviation from the text description and sample from a standard Gaussian, and in the other, we sample using a (image) conditional sampler. Once trained, we directly predict depth from the encoded text using the conditional sampler. We demonstrate our approach on indoor (NYUv2) and outdoor (KITTI) scenarios, where we show that language can consistently improve performance in both.
- Learning to prompt clip for monocular depth estimation: Exploring the limits of human language. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2039–2047, 2023.
- Adabins: Depth estimation using adaptive bins. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4009–4018, 2021.
- Unsupervised scale-consistent depth learning from video. International Journal of Computer Vision, 129(9):2548–2564, 2021.
- Surface versus edge-based determinants of visual recognition. Cognitive psychology, 20(1):38–64, 1988.
- Codeslam—learning a compact, optimisable representation for dense visual slam. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2560–2568, 2018.
- Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021.
- Transformer-based monocular depth estimation with attention supervision. In 32nd British Machine Vision Conference (BMVC 2021), 2021.
- iquery: Instruments as queries for audio-visual sound separation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14675–14686, 2023.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
- Georgi Dikov and Joris van Vugt. Variational depth networks: Uncertainty-aware monocular self-supervised depth estimation. In European Conference on Computer Vision, pages 43–60. Springer, 2022.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Tactile-augmented radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024.
- Depth map prediction from a single image using a multi-scale deep network. Advances in neural information processing systems, 27, 2014.
- Visual-inertial object detection and mapping. In Proceedings of the European conference on computer vision (ECCV), pages 301–317, 2018.
- Geo-supervised visual depth prediction. IEEE Robotics and Automation Letters, 4(2):1661–1668, 2019.
- Deep ordinal regression network for monocular depth estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2002–2011, 2018a.
- Deep ordinal regression network for monocular depth estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2002–2011, 2018b.
- Clip-adapter: Better vision-language models with feature adapters. arXiv preprint arXiv:2110.04544, 2021.
- Unsupervised cnn for single view depth estimation: Geometry to the rescue. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VIII 14, pages 740–756. Springer, 2016.
- Are we ready for autonomous driving? the kitti vision benchmark suite. In 2012 IEEE conference on computer vision and pattern recognition, pages 3354–3361. IEEE, 2012.
- Imagebind: One embedding space to bind them all. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15180–15190, 2023.
- Unsupervised monocular depth estimation with left-right consistency. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 270–279, 2017.
- Point-bind & point-llm: Aligning point cloud with multi-modality for 3d understanding, generation, and instruction following. arXiv preprint arXiv:2309.00615, 2023.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Expansionnet v2: Block static expansion in fast end to end training for image captioning. arXiv preprint arXiv:2208.06551, 2022.
- Learning to adapt clip for few-shot monocular depth estimation. arXiv preprint arXiv:2311.01034, 2023.
- Monoindoor: Towards good practice of self-supervised monocular depth estimation for indoor environments. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12787–12796, 2021.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
- Text-image alignment for diffusion-based perception. arXiv preprint arXiv:2310.00031, 2023.
- The importance of shape in early lexical learning. Cognitive development, 3(3):299–321, 1988.
- Object shape, object function, and object name. Journal of memory and language, 38(1):1–27, 1998.
- On the viability of monocular depth pre-training for semantic segmentation. arXiv preprint arXiv:2203.13987, 2022.
- From big to small: Multi-scale local planar guidance for monocular depth estimation. arXiv preprint arXiv:1907.10326, 2019a.
- From big to small: Multi-scale local planar guidance for monocular depth estimation. arXiv preprint arXiv:1907.10326, 2019b.
- Structdepth: Leveraging the structural regularities for self-supervised indoor depth estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12663–12673, 2021.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR, 2022a.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023a.
- From isolated islands to pangea: Unifying semantic space for human action understanding. arXiv preprint arXiv:2304.00553, 2023b.
- Depthformer: Exploiting long-range correlation and local information for accurate monocular depth estimation. arXiv preprint arXiv:2203.14211, 2022b.
- Gaufre: Gaussian deformation fields for real-time dynamic novel view synthesis, 2023a.
- Semantic attention flow fields for monocular dynamic scene decomposition. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023b.
- Va-depthnet: A variational approach to single image depth prediction. arXiv preprint arXiv:2302.06556, 2023.
- Monitored distillation for positive congruent depth completion. In European Conference on Computer Vision, pages 35–53. Springer, 2022.
- Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
- Adaptive surface normal constraint for depth estimation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 12849–12858, 2021.
- End-to-end learning for joint depth and image reconstruction from diffracted rotation. arXiv preprint arXiv:2204.07076, 2022.
- Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
- Efficient vision-language pre-training by cluster masking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024.
- Test-time adaptation for depth completion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024.
- Excavating the potential capacity of self-supervised monocular depth estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15560–15569, 2021.
- Geonet: Geometric neural network for joint depth and surface normal estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 283–291, 2018.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Vision transformers for dense prediction. In Proceedings of the IEEE/CVF international conference on computer vision, pages 12179–12188, 2021.
- Denseclip: Language-guided dense prediction with context-aware prompting. arXiv preprint arXiv:2112.01518, 2021.
- Monocular depth estimation using diffusion models. arXiv preprint arXiv:2302.14816, 2023.
- The surprising effectiveness of diffusion models for optical flow and monocular depth estimation. Advances in Neural Information Processing Systems, 36, 2024.
- Indoor segmentation and support inference from rgbd images. In Computer Vision–ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part V 12, pages 746–760. Springer, 2012.
- Depth estimation from camera image and mmwave radar point cloud. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9275–9285, 2023.
- Sun rgb-d: A rgb-d scene understanding benchmark suite. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 567–576, 2015.
- Enhancing diffusion models with 3d perspective geometry constraints. ACM Transactions on Graphics (TOG), 42(6):1–15, 2023.
- Planedepth: Self-supervised depth estimation via orthogonal planes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21425–21434, 2023a.
- Sqldepth: Generalizable self-supervised fine-structured monocular depth estimation. arXiv preprint arXiv:2309.00526, 2023b.
- Bilateral cyclic constraint and adaptive regularization for unsupervised monocular depth prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5644–5653, 2019.
- Unsupervised depth completion with calibrated backprojection layers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12747–12756, 2021.
- Targeted adversarial perturbations for monocular depth prediction. Advances in neural information processing systems, 33:8486–8497, 2020a.
- Unsupervised depth completion from visual inertial odometry. IEEE Robotics and Automation Letters, 5(2):1899–1906, 2020b.
- Learning topology from synthetic data for unsupervised depth completion. IEEE Robotics and Automation Letters, 6(2):1495–1502, 2021a.
- An adaptive framework for learning unsupervised depth completion. IEEE Robotics and Automation Letters, 6(2):3120–3127, 2021b.
- Toward practical monocular indoor depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3814–3824, 2022.
- Boosting detection in crowd analysis via underutilized output features. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15609–15618, 2023.
- Augundo: Scaling up augmentations for unsupervised depth completion. arXiv preprint arXiv:2310.09739, 2023.
- Generating and exploiting probabilistic monocular depth estimates. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 65–74, 2020.
- Sparse and complete latent organization for geospatial semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1809–1818, 2022.
- Touch and go: Learning from human-collected vision and touch. Neural Information Processing Systems (NeurIPS) - Datasets and Benchmarks Track, 2022.
- Generating visual scenes from touch. International Conference on Computer Vision (ICCV), 2023.
- Binding touch to everything: Learning unified multimodal tactile representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024.
- Transformer-based attention networks for continuous pixel-wise prediction. In Proceedings of the IEEE/CVF International Conference on Computer vision, pages 16269–16279, 2021.
- Dense depth posterior (ddp) from single image and sparse range. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3353–3362, 2019.
- Enforcing geometric constraints of virtual normal for depth prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5684–5693, 2019.
- Mine your own anatomy: Revisiting medical image segmentation with extremely limited labels. arXiv preprint arXiv:2209.13476, 2022.
- Implicit anatomical rendering for medical image segmentation with stochastic experts. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 561–571. Springer, 2023.
- Rethinking semi-supervised medical image segmentation: A variance-reduction perspective. Advances in Neural Information Processing Systems, 36, 2024.
- Monocular depth estimation network based on swin transformer. In Journal of Physics: Conference Series, page 012019. IOP Publishing, 2023.
- P 2 net: Patch-match and plane-regularization for unsupervised indoor depth estimation. In European Conference on Computer Vision, pages 206–222. Springer, 2020.
- Neural window fully-connected crfs for monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3916–3925, 2022.
- Tip-adapter: Training-free clip-adapter for better vision-language modeling. arXiv preprint arXiv:2111.03930, 2021a.
- Vt-clip: Enhancing vision-language models with visual-guided texts. arXiv preprint arXiv:2112.02399, 2021b.
- Dspoint: Dual-scale point cloud recognition with high-frequency fusion. arXiv preprint arXiv:2111.10332, 2021c.
- Pointclip: Point cloud understanding by clip. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8552–8562, 2022a.
- Can language understand depth? In Proceedings of the 30th ACM International Conference on Multimedia, pages 6868–6874, 2022b.
- Llama-adapter: Efficient fine-tuning of language models with zero-init attention. ICLR 2024, 2023.
- Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? arXiv preprint arXiv:2403.14624, 2024.
- Monovit: Self-supervised monocular depth estimation with a vision transformer. In 2022 International Conference on 3D Vision (3DV), pages 668–678. IEEE, 2022a.
- Rbc: Rectifying the biased context in continual semantic segmentation. ECCV, 2022b.
- Towards better generalization: Joint depth-pose learning without posenet. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9151–9161, 2020.
- Unleashing text-to-image diffusion models for visual perception. arXiv preprint arXiv:2303.02153, 2023.
- Iterated learning improves compositionality in large vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024.
- Denseclip: Extract free dense labels from clip. arXiv preprint arXiv:2112.01071, 2021a.
- Moving indoor: Unsupervised video depth learning in challenging environments. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8618–8627, 2019.
- Learning to prompt for vision-language models. arXiv preprint arXiv:2109.01134, 2021b.
- Unsupervised learning of depth and ego-motion from video. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1851–1858, 2017.
- Detecting twenty-thousand classes using image-level supervision. arXiv preprint arXiv:2201.02605, 2022.
- Languagebind: Extending video-language pretraining to n-modality by language-based semantic alignment. arXiv preprint arXiv:2310.01852, 2023a.
- Pointclip v2: Prompting clip and gpt for powerful 3d open-world learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2639–2650, 2023b.