Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

FM-OV3D: Foundation Model-based Cross-modal Knowledge Blending for Open-Vocabulary 3D Detection (2312.14465v1)

Published 22 Dec 2023 in cs.CV

Abstract: The superior performances of pre-trained foundation models in various visual tasks underscore their potential to enhance the 2D models' open-vocabulary ability. Existing methods explore analogous applications in the 3D space. However, most of them only center around knowledge extraction from singular foundation models, which limits the open-vocabulary ability of 3D models. We hypothesize that leveraging complementary pre-trained knowledge from various foundation models can improve knowledge transfer from 2D pre-trained visual LLMs to the 3D space. In this work, we propose FM-OV3D, a method of Foundation Model-based Cross-modal Knowledge Blending for Open-Vocabulary 3D Detection, which improves the open-vocabulary localization and recognition abilities of 3D model by blending knowledge from multiple pre-trained foundation models, achieving true open-vocabulary without facing constraints from original 3D datasets. Specifically, to learn the open-vocabulary 3D localization ability, we adopt the open-vocabulary localization knowledge of the Grounded-Segment-Anything model. For open-vocabulary 3D recognition ability, We leverage the knowledge of generative foundation models, including GPT-3 and Stable Diffusion models, and cross-modal discriminative models like CLIP. The experimental results on two popular benchmarks for open-vocabulary 3D object detection show that our model efficiently learns knowledge from multiple foundation models to enhance the open-vocabulary ability of the 3D model and successfully achieves state-of-the-art performance in open-vocabulary 3D object detection tasks. Code is released at https://github.com/dmzhang0425/FM-OV3D.git.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. Bridging the gap between object and image-level representations for open-vocabulary detection. Advances in Neural Information Processing Systems, 35: 33781–33794.
  2. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877–1901.
  3. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, 5828–5839.
  4. Structural Knowledge Distillation for Object Detection. Advances in Neural Information Processing Systems, 35: 3858–3870.
  5. CALIP: Zero-Shot Enhancement of CLIP with Parameter-free Attention. AAAI 2023 Oral.
  6. Point-bind & point-llm: Aligning point cloud with multi-modality for 3d understanding, generation, and instruction following. arXiv preprint arXiv:2309.00615.
  7. Lvis: A dataset for large vocabulary instance segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 5356–5364.
  8. When sam meets medical images: An investigation of segment anything model (sam) on multi-phase liver tumor segmentation. arXiv preprint arXiv:2304.08506.
  9. Segment anything. arXiv preprint arXiv:2304.02643.
  10. Open-vocabulary semantic segmentation with mask-adapted clip. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7061–7070.
  11. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499.
  12. Samm (segment any medical model): A 3d slicer integration to sam. arXiv preprint arXiv:2304.05622.
  13. Group-free 3d object detection via transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2949–2958.
  14. Open-vocabulary 3d detection via image-level class and debiased cross-modal contrastive learning. arXiv preprint arXiv:2207.01987.
  15. Open-vocabulary point-cloud object detection without 3d annotation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1190–1199.
  16. An end-to-end transformer model for 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2906–2917.
  17. Unified Text Structuralization with Instruction-tuned Language Models. arXiv preprint arXiv:2303.14956.
  18. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748.
  19. What does a platypus look like? Generating customized prompts for zero-shot image classification. arXiv:2209.03320.
  20. Deep hough voting for 3d object detection in point clouds. In proceedings of the IEEE/CVF International Conference on Computer Vision, 9277–9286.
  21. Learning transferable visual models from natural language supervision. In International conference on machine learning, 8748–8763. PMLR.
  22. Improved visual-semantic alignment for zero-shot object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 11932–11939.
  23. Zero-shot object detection: Joint recognition and localization of novel concepts. International Journal of Computer Vision, 128: 2979–2999.
  24. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3): 211–252.
  25. Sun rgb-d: A rgb-d scene understanding benchmark suite. In Proceedings of the IEEE conference on computer vision and pattern recognition, 567–576.
  26. GPT-NER: Named Entity Recognition via Large Language Models. arXiv preprint arXiv:2304.10428.
  27. Image2Point: 3D Point-Cloud Understanding with 2D Image Pretrained Models. arXiv preprint arXiv:2106.04180.
  28. Open-vocabulary object detection using captions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 14393–14402.
  29. Pointclip: Point cloud understanding by clip. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8552–8562.
  30. Prompt, Generate, then Cache: Cascade of Foundation Models makes Strong Few-shot Learners. CVPR 2023.
  31. Personalize segment anything model with one shot. arXiv preprint arXiv:2305.03048.
  32. Learning 3D Representations from 2D Pre-trained Models via Image-to-Point Masked Autoencoders. CVPR 2023.
  33. H3dnet: 3d object detection using hybrid geometric primitives. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XII 16, 311–329. Springer.
  34. Detecting twenty-thousand classes using image-level supervision. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part IX, 350–368. Springer.
  35. Pointclip v2: Prompting clip and gpt for powerful 3d open-world learning. ICCV 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Dongmei Zhang (193 papers)
  2. Chang Li (60 papers)
  3. Ray Zhang (18 papers)
  4. Shenghao Xie (13 papers)
  5. Wei Xue (149 papers)
  6. Xiaodong Xie (23 papers)
  7. Shanghang Zhang (172 papers)
Citations (10)