OVO: Open-Vocabulary Occupancy (2305.16133v2)
Abstract: Semantic occupancy prediction aims to infer dense geometry and semantics of surroundings for an autonomous agent to operate safely in the 3D environment. Existing occupancy prediction methods are almost entirely trained on human-annotated volumetric data. Although of high quality, the generation of such 3D annotations is laborious and costly, restricting them to a few specific object categories in the training dataset. To address this limitation, this paper proposes Open Vocabulary Occupancy (OVO), a novel approach that allows semantic occupancy prediction of arbitrary classes but without the need for 3D annotations during training. Keys to our approach are (1) knowledge distillation from a pre-trained 2D open-vocabulary segmentation model to the 3D occupancy network, and (2) pixel-voxel filtering for high-quality training data generation. The resulting framework is simple, compact, and compatible with most state-of-the-art semantic occupancy prediction models. On NYUv2 and SemanticKITTI datasets, OVO achieves competitive performance compared to supervised semantic occupancy prediction approaches. Furthermore, we conduct extensive analyses and ablation studies to offer insights into the design of the proposed framework. Our code is publicly available at https://github.com/dzcgaara/OVO.
- Semantickitti: A dataset for semantic scene understanding of lidar sequences. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9297–9307, 2019.
- Omni3d: A large benchmark and model for 3d object detection in the wild. arXiv preprint arXiv:2207.10660, 2022.
- Monoscene: Monocular 3d semantic scene completion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3991–4001, 2022.
- 3d sketch-aware semantic scene completion via semi-supervised structure prior. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4193–4202, 2020.
- Vision meets robotics: The kitti dataset. The International Journal of Robotics Research, 32(11):1231–1237, 2013.
- Scaling open-vocabulary image segmentation with image-level labels. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVI, pages 540–557. Springer, 2022.
- Open-vocabulary object detection via vision and language knowledge distillation. arXiv preprint arXiv:2104.13921, 2021.
- Semantic abstraction: Open-world 3d scene understanding from 2d vision-language models. In Conference on Robot Learning, 2022.
- Tri-perspective view for vision-based 3d semantic occupancy prediction. arXiv preprint arXiv:2302.07817, 2023a.
- Tri-perspective view for vision-based 3d semantic occupancy prediction. arXiv preprint arXiv:2302.07817, 2023b.
- Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, pages 4904–4916. PMLR, 2021.
- Language-driven semantic segmentation. arXiv preprint arXiv:2201.03546, 2022.
- Anisotropic convolutional networks for 3d semantic scene completion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3351–3359, 2020.
- Voxformer: Sparse voxel transformer for camera-based 3d semantic scene completion. arXiv preprint arXiv:2302.12251, 2023.
- Occdepth: A depth-aware method for 3d semantic scene completion. arXiv preprint arXiv:2302.13540, 2023.
- Openscene: 3d scene understanding with open vocabularies. arXiv preprint arXiv:2211.15654, 2022.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Indoor segmentation and support inference from rgbd images. ECCV (5), 7576:746–760, 2012.
- Semantic scene completion from a single depth image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
- Pseudo-lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8445–8453, 2019.
- Open-vocabulary object detection using captions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14393–14402, 2021.
- Clip-fo3d: Learning free open-world 3d scene representations from 2d dense clip. arXiv preprint arXiv:2303.04748, 2023.
- Zhiyu Tan (26 papers)
- Zichao Dong (20 papers)
- Cheng Zhang (388 papers)
- Weikun Zhang (6 papers)
- Hang Ji (8 papers)
- Hao Li (803 papers)