Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
157 tokens/sec
GPT-4o
43 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Know Your Neighbors: Improving Single-View Reconstruction via Spatial Vision-Language Reasoning (2404.03658v1)

Published 4 Apr 2024 in cs.CV

Abstract: Recovering the 3D scene geometry from a single view is a fundamental yet ill-posed problem in computer vision. While classical depth estimation methods infer only a 2.5D scene representation limited to the image plane, recent approaches based on radiance fields reconstruct a full 3D representation. However, these methods still struggle with occluded regions since inferring geometry without visual observation requires (i) semantic knowledge of the surroundings, and (ii) reasoning about spatial context. We propose KYN, a novel method for single-view scene reconstruction that reasons about semantic and spatial context to predict each point's density. We introduce a vision-language modulation module to enrich point features with fine-grained semantic information. We aggregate point representations across the scene through a language-guided spatial attention mechanism to yield per-point density predictions aware of the 3D semantic context. We show that KYN improves 3D shape recovery compared to predicting density for each 3D point in isolation. We achieve state-of-the-art results in scene and object reconstruction on KITTI-360, and show improved zero-shot generalization compared to prior work. Project page: https://ruili3.github.io/kyn.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (64)
  1. Adabins: Depth estimation using adaptive bins. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4009–4018, 2021.
  2. Localbins: Improving depth estimation by learning local distributions. In European Conference on Computer Vision, pages 480–496. Springer, 2022.
  3. Zoedepth: Zero-shot transfer by combining relative and metric depth. arXiv preprint arXiv:2302.12288, 2023.
  4. Anh-Quan Cao and Raoul de Charette. Monoscene: Monocular 3d semantic scene completion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3991–4001, 2022.
  5. Depth prediction without the sensors: Leveraging structure for unsupervised learning from monocular videos. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 8001–8008, 2019.
  6. Coatrsnet: Fully exploiting convolution and attention for stereo matching by region separation. International Journal of Computer Vision, 132(1):56–73, 2024a.
  7. Adaptive fusion of single-view and multi-view depth for autonomous driving. arXiv preprint arXiv:2403.07535, 2024b.
  8. Safenet: Self-supervised monocular depth estimation with semantic-aware feature extraction. arXiv preprint arXiv:2010.02893, 2020.
  9. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016.
  10. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  11. Depth map prediction from a single image using a multi-scale deep network. In Advances in neural information processing systems, pages 2366–2374, 2014.
  12. Cam-convs: Camera-aware multi-scale convolutions for single-view depth. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11826–11835, 2019.
  13. Disentangling object motion and occlusion for unsupervised multi-frame monocular depth. arXiv preprint arXiv:2203.15174, 2022.
  14. Deep ordinal regression network for monocular depth estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2002–2011, 2018.
  15. Panoptic nerf: 3d-to-2d label transfer for panoptic urban scene segmentation. In 2022 International Conference on 3D Vision (3DV), pages 1–11. IEEE, 2022.
  16. Are we ready for autonomous driving? the kitti vision benchmark suite. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 3354–3361. IEEE, 2012.
  17. Digging into self-supervised monocular depth estimation. In Proceedings of the IEEE International Conference on Computer Vision, pages 3828–3838, 2019.
  18. 3d packing for self-supervised monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2485–2494, 2020a.
  19. Semantically-guided representation learning for self-supervised monocular depth. In International Conference on Learning Representations, 2020b.
  20. Towards zero-shot scale-aware monocular depth estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9233–9243, 2023.
  21. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  22. Fine-grained semantics-aware representation enhancement for self-supervised monocular depth estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12642–12652, 2021.
  23. Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning, pages 5156–5165. PMLR, 2020.
  24. Lerf: Language embedded radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 19729–19739, 2023.
  25. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  26. Panoptic neural fields: A semantic object-aware neural scene representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12871–12881, 2022.
  27. Edgeconv with attention module for monocular depth estimation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2858–2867, 2022.
  28. Learning monocular depth in dynamic scenes via instance-aware projection consistency. arXiv preprint arXiv:2102.02629, 2021.
  29. Language-driven semantic segmentation. In International Conference on Learning Representations, 2022.
  30. Enhancing self-supervised monocular depth estimation via incorporating robust constraints. In Proceedings of the 28th ACM International Conference on Multimedia, pages 3108–3117, 2020.
  31. Learning to fuse monocular and multi-view cues for multi-frame depth estimation in dynamic scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21539–21548, 2023a.
  32. Learning depth via leveraging semantics: Self-supervised monocular depth estimation with both implicit and explicit semantic guidance. Pattern Recognition, page 109297, 2023b.
  33. Movideo: Motion-aware video generation with diffusion models. arXiv preprint arXiv:2311.11325, 2023.
  34. Kitti-360: A novel dataset and benchmarks for urban scene understanding in 2d and 3d. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3):3292–3310, 2022.
  35. Single image depth prediction made better: A multivariate gaussian take. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17346–17356, 2023a.
  36. Va-depthnet: A variational approach to single image depth prediction. arXiv preprint arXiv:2302.06556, 2023b.
  37. Hr-depth: High resolution self-supervised monocular depth estimation. arXiv preprint arXiv:2012.07356, 2020.
  38. Occdepth: A depth-aware method for 3d semantic scene completion. arXiv preprint arXiv:2302.13540, 2023.
  39. Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020.
  40. Automatic differentiation in pytorch. 2017.
  41. Openscene: 3d scene understanding with open vocabularies. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 815–824, 2023.
  42. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE transactions on pattern analysis and machine intelligence, 44(3):1623–1637, 2020.
  43. Vision transformers for dense prediction. In Proceedings of the IEEE/CVF international conference on computer vision, pages 12179–12188, 2021.
  44. R3d3: Dense 3d reconstruction of dynamic scenes from multiple cameras. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3216–3226, 2023.
  45. Efficient attention: Attention with linear complexities. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 3531–3539, 2021.
  46. Kick back & relax: Learning to reconstruct the world by watching slowtv. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15768–15779, 2023.
  47. Sc-depthv3: Robust self-supervised monocular depth estimation for dynamic scenes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  48. Deepv2d: Video to depth with differentiable structure from motion. arXiv preprint arXiv:1812.04605, 2018.
  49. Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction. arXiv preprint arXiv:2106.10689, 2021a.
  50. Ibrnet: Learning multi-view image-based rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4690–4699, 2021b.
  51. Behind the scenes: Density fields for single view reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9076–9086, 2023.
  52. Structure-guided ranking loss for single image depth prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 611–620, 2020.
  53. Attention concatenation volume for accurate and efficient stereo matching. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12981–12990, 2022.
  54. Volume rendering of neural implicit surfaces. Advances in Neural Information Processing Systems, 34:4805–4815, 2021.
  55. Enforcing geometric constraints of virtual normal for depth prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5684–5693, 2019.
  56. Learning to recover 3d scene shape from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 204–213, 2021.
  57. Towards accurate reconstruction of 3d scene shape from a single monocular image. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
  58. Metric3d: Towards zero-shot metric 3d prediction from a single image. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9043–9053, 2023.
  59. pixelnerf: Neural radiance fields from one or few images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4578–4587, 2021.
  60. Monosdf: Exploring monocular geometric cues for neural implicit surface reconstruction. Advances in neural information processing systems, 35:25018–25032, 2022.
  61. New crfs: Neural window fully-connected crfs for monocular depth estimation. arXiv preprint arXiv:2203.01502, 2022.
  62. Monovit: Self-supervised monocular depth estimation with a vision transformer. In 2022 International Conference on 3D Vision (3DV), pages 668–678. IEEE, 2022.
  63. In-place scene labelling and understanding with implicit scene representation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15838–15847, 2021.
  64. Self-supervised monocular depth estimation with internal feature fusion. arXiv preprint arXiv:2110.09482, 2021.
Citations (2)

Summary

  • The paper introduces KYN, a novel approach that integrates vision-language modulation and spatial reasoning to enhance reconstruction of occluded areas.
  • It employs a guided spatial attention mechanism that aggregates enriched 3D point features for coherent and semantically informed density predictions.
  • Experiments on KITTI-360 and DDAD datasets validate its state-of-the-art performance and robust zero-shot generalization in scene and object-level reconstructions.

Improving Single-View Reconstruction with Spatial Vision-Language Reasoning

Introduction

Single-view reconstruction aims to infer the 3D geometry of a scene from a single image, a task fundamental to various applications in computer vision. Despite advancements in depth estimation and radiance field methods, reconstructing occluded regions remains challenging. These methods often lack the semantic understanding and spatial reasoning necessary for accurate geometry inference in unobserved areas. This paper introduces Know Your Neighbors (KYN), a novel approach that leverages semantic knowledge and spatial context to improve the accuracy of single-view scene reconstruction. KYN features a vision-language (VL) modulation module and a language-guided spatial attention mechanism, enhancing point feature representations with semantic information and aggregating these across the scene for informed density predictions.

Vision-Language Modulation

The proposed method, KYN, begins by extracting fused visual features from both standard and VL image encoders, followed by enriching these features with semantic information obtained from category-wise text features. This enrichment is accomplished through a VL modulation module that iteratively augments 3D point features with semantic data, integrating both visual and textual information to provide a richer representation of each 3D point in space.

Spatial Attention with Vision-Language Guidance

Building upon enriched point-wise features, KYN utilizes a VL spatial attention mechanism to aggregate these features across the scene. This process ensures that the density predictions for each point consider the semantic context of neighboring points, facilitating a more coherent and plausible reconstruction of occluded regions. By guiding the attention mechanism with text-based category features, the model leverages both global and local semantic contexts, significantly improving the reconstruction accuracy over previous methods that treat points in isolation.

Experimental Validation

KYN's effectiveness is demonstrated through extensive experiments on the KITTI-360 dataset, where it achieves state-of-the-art performance in both scene and object-level reconstructions. Notably, KYN shows considerable improvement in accurately modeling occluded areas, mitigating trailing effects commonly observed in prior work. Furthermore, the application of KYN to the DDAD dataset illustrates its robust zero-shot generalization capability, underscoring the benefit of leveraging semantic and contextual knowledge in single-view reconstruction tasks.

Ablation Studies and Comparisons

A series of ablation studies highlight the individual contributions of the VL modulation and spatial attention mechanisms within KYN. These studies affirm the importance of integrating fine-grained semantic information and global-to-local spatial reasoning for improving single-view reconstruction outputs. Additionally, comparisons with existing semantic feature fusion techniques further validate the superiority of KYN's approach, combining VL features with spatial attention to achieve more accurate and semantically coherent reconstructions.

Conclusion and Future Directions

KYN represents a significant step forward in single-view scene reconstruction, addressing the limitations of existing methods by effectively incorporating semantic and spatial context into the reconstruction process. The introduction of VL features not only enhances single-view reconstruction accuracy but also presents exciting avenues for future research in open-vocabulary 3D scene understanding and modeling.

X Twitter Logo Streamline Icon: https://streamlinehq.com