FM-OV3D: Foundation Model-based Cross-modal Knowledge Blending for Open-Vocabulary 3D Detection (2312.14465v1)
Abstract: The superior performances of pre-trained foundation models in various visual tasks underscore their potential to enhance the 2D models' open-vocabulary ability. Existing methods explore analogous applications in the 3D space. However, most of them only center around knowledge extraction from singular foundation models, which limits the open-vocabulary ability of 3D models. We hypothesize that leveraging complementary pre-trained knowledge from various foundation models can improve knowledge transfer from 2D pre-trained visual LLMs to the 3D space. In this work, we propose FM-OV3D, a method of Foundation Model-based Cross-modal Knowledge Blending for Open-Vocabulary 3D Detection, which improves the open-vocabulary localization and recognition abilities of 3D model by blending knowledge from multiple pre-trained foundation models, achieving true open-vocabulary without facing constraints from original 3D datasets. Specifically, to learn the open-vocabulary 3D localization ability, we adopt the open-vocabulary localization knowledge of the Grounded-Segment-Anything model. For open-vocabulary 3D recognition ability, We leverage the knowledge of generative foundation models, including GPT-3 and Stable Diffusion models, and cross-modal discriminative models like CLIP. The experimental results on two popular benchmarks for open-vocabulary 3D object detection show that our model efficiently learns knowledge from multiple foundation models to enhance the open-vocabulary ability of the 3D model and successfully achieves state-of-the-art performance in open-vocabulary 3D object detection tasks. Code is released at https://github.com/dmzhang0425/FM-OV3D.git.
- Bridging the gap between object and image-level representations for open-vocabulary detection. Advances in Neural Information Processing Systems, 35: 33781–33794.
- Language models are few-shot learners. Advances in neural information processing systems, 33: 1877–1901.
- Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, 5828–5839.
- Structural Knowledge Distillation for Object Detection. Advances in Neural Information Processing Systems, 35: 3858–3870.
- CALIP: Zero-Shot Enhancement of CLIP with Parameter-free Attention. AAAI 2023 Oral.
- Point-bind & point-llm: Aligning point cloud with multi-modality for 3d understanding, generation, and instruction following. arXiv preprint arXiv:2309.00615.
- Lvis: A dataset for large vocabulary instance segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 5356–5364.
- When sam meets medical images: An investigation of segment anything model (sam) on multi-phase liver tumor segmentation. arXiv preprint arXiv:2304.08506.
- Segment anything. arXiv preprint arXiv:2304.02643.
- Open-vocabulary semantic segmentation with mask-adapted clip. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7061–7070.
- Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499.
- Samm (segment any medical model): A 3d slicer integration to sam. arXiv preprint arXiv:2304.05622.
- Group-free 3d object detection via transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2949–2958.
- Open-vocabulary 3d detection via image-level class and debiased cross-modal contrastive learning. arXiv preprint arXiv:2207.01987.
- Open-vocabulary point-cloud object detection without 3d annotation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1190–1199.
- An end-to-end transformer model for 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2906–2917.
- Unified Text Structuralization with Instruction-tuned Language Models. arXiv preprint arXiv:2303.14956.
- Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748.
- What does a platypus look like? Generating customized prompts for zero-shot image classification. arXiv:2209.03320.
- Deep hough voting for 3d object detection in point clouds. In proceedings of the IEEE/CVF International Conference on Computer Vision, 9277–9286.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, 8748–8763. PMLR.
- Improved visual-semantic alignment for zero-shot object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 11932–11939.
- Zero-shot object detection: Joint recognition and localization of novel concepts. International Journal of Computer Vision, 128: 2979–2999.
- ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3): 211–252.
- Sun rgb-d: A rgb-d scene understanding benchmark suite. In Proceedings of the IEEE conference on computer vision and pattern recognition, 567–576.
- GPT-NER: Named Entity Recognition via Large Language Models. arXiv preprint arXiv:2304.10428.
- Image2Point: 3D Point-Cloud Understanding with 2D Image Pretrained Models. arXiv preprint arXiv:2106.04180.
- Open-vocabulary object detection using captions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 14393–14402.
- Pointclip: Point cloud understanding by clip. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8552–8562.
- Prompt, Generate, then Cache: Cascade of Foundation Models makes Strong Few-shot Learners. CVPR 2023.
- Personalize segment anything model with one shot. arXiv preprint arXiv:2305.03048.
- Learning 3D Representations from 2D Pre-trained Models via Image-to-Point Masked Autoencoders. CVPR 2023.
- H3dnet: 3d object detection using hybrid geometric primitives. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XII 16, 311–329. Springer.
- Detecting twenty-thousand classes using image-level supervision. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part IX, 350–368. Springer.
- Pointclip v2: Prompting clip and gpt for powerful 3d open-world learning. ICCV 2023.
- Dongmei Zhang (193 papers)
- Chang Li (60 papers)
- Ray Zhang (18 papers)
- Shenghao Xie (13 papers)
- Wei Xue (149 papers)
- Xiaodong Xie (23 papers)
- Shanghang Zhang (172 papers)