Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Bridging the Gap between 2D and 3D Visual Question Answering: A Fusion Approach for 3D VQA (2402.15933v1)

Published 24 Feb 2024 in cs.CV, cs.AI, cs.CL, and cs.LG

Abstract: In 3D Visual Question Answering (3D VQA), the scarcity of fully annotated data and limited visual content diversity hampers the generalization to novel scenes and 3D concepts (e.g., only around 800 scenes are utilized in ScanQA and SQA dataset). Current approaches resort supplement 3D reasoning with 2D information. However, these methods face challenges: either they use top-down 2D views that introduce overly complex and sometimes question-irrelevant visual clues, or they rely on globally aggregated scene/image-level representations from 2D VLMs, losing the fine-grained vision-language correlations. To overcome these limitations, our approach utilizes question-conditional 2D view selection procedure, pinpointing semantically relevant 2D inputs for crucial visual clues. We then integrate this 2D knowledge into the 3D-VQA system via a two-branch Transformer structure. This structure, featuring a Twin-Transformer design, compactly combines 2D and 3D modalities and captures fine-grained correlations between modalities, allowing them mutually augmenting each other. Integrating proposed mechanisms above, we present BridgeQA, that offers a fresh perspective on multi-modal transformer-based architectures for 3D-VQA. Experiments validate that BridgeQA achieves state-of-the-art on 3D-VQA datasets and significantly outperforms existing solutions. Code is available at $\href{https://github.com/matthewdm0816/BridgeQA}{\text{this URL}}$.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (21)
  1. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, 6077–6086.
  2. ScanQA: 3D question answering for spatial scene understanding. In proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 19129–19139.
  3. Uniter: Universal image-text representation learning. In European conference on computer vision, 104–120. Springer.
  4. Multi-CLIP: Contrastive Vision-Language Pre-training for Question Answering tasks in 3D Scenes. arXiv preprint arXiv:2306.02329.
  5. Vqa-lol: Visual question answering under the lens of logic. In European conference on computer vision, 379–396. Springer.
  6. Context-Aware Alignment and Mutual Masking for 3D-Language Pre-Training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 10984–10994.
  7. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. In ICML.
  8. Oscar: Object-semantics aligned pre-training for vision-language tasks. In European Conference on Computer Vision, 121–137. Springer.
  9. Sqa3d: Situated question answering in 3d scenes. arXiv preprint arXiv:2210.07474.
  10. OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774.
  11. CLIP-Guided Vision-Language Pre-training for Question Answering in 3D Scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5606–5611.
  12. Deep hough voting for 3d object detection in point clouds. In proceedings of the IEEE/CVF International Conference on Computer Vision, 9277–9286.
  13. Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020.
  14. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2556–2565.
  15. LXMERT: Learning Cross-Modality Encoder Representations from Transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 5100–5111.
  16. SimVLM: Simple Visual Language Model Pretraining with Weak Supervision. arXiv preprint arXiv:2108.10904.
  17. Deep modular co-attention networks for visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6281–6290.
  18. Vinvl: Revisiting visual representations in vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5579–5588.
  19. Vision-Language Pre-training with Object Contrastive Learning for 3D Scene Understanding. arXiv:2305.10714.
  20. Towards Explainable 3D Grounded Visual Question Answering: A New Benchmark and Strong Baseline. IEEE Transactions on Circuits and Systems for Video Technology.
  21. 3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment. arXiv:2308.04352.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Wentao Mo (3 papers)
  2. Yang Liu (2253 papers)
Citations (3)