Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Rethinking 3D Dense Caption and Visual Grounding in A Unified Framework through Prompt-based Localization (2404.11064v3)

Published 17 Apr 2024 in cs.CV and cs.AI

Abstract: 3D Visual Grounding (3DVG) and 3D Dense Captioning (3DDC) are two crucial tasks in various 3D applications, which require both shared and complementary information in localization and visual-language relationships. Therefore, existing approaches adopt the two-stage "detect-then-describe/discriminate" pipeline, which relies heavily on the performance of the detector, resulting in suboptimal performance. Inspired by DETR, we propose a unified framework, 3DGCTR, to jointly solve these two distinct but closely related tasks in an end-to-end fashion. The key idea is to reconsider the prompt-based localization ability of the 3DVG model. In this way, the 3DVG model with a well-designed prompt as input can assist the 3DDC task by extracting localization information from the prompt. In terms of implementation, we integrate a Lightweight Caption Head into the existing 3DVG network with a Caption Text Prompt as a connection, effectively harnessing the existing 3DVG model's inherent localization capacity, thereby boosting 3DDC capability. This integration facilitates simultaneous multi-task training on both tasks, mutually enhancing their performance. Extensive experimental results demonstrate the effectiveness of this approach. Specifically, on the ScanRefer dataset, 3DGCTR surpasses the state-of-the-art 3DDC method by 4.3% in [email protected] in MLE training and improves upon the SOTA 3DVG method by 3.16% in [email protected]. The codes are at https://github.com/Leon1207/3DGCTR.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72, 2005.
  2. 3djcg: A unified framework for joint dense captioning and visual grounding on 3d point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 16464–16473, 2022.
  3. End-to-end object detection with transformers. In European conference on computer vision, pages 213–229. Springer, 2020.
  4. Scanrefer: 3d object localization in rgb-d scans using natural language. In European conference on computer vision, pages 202–221. Springer, 2020.
  5. D3net: a speaker-listener architecture for semi-supervised dense captioning and visual grounding in rgb-d scans. 2021.
  6. Scan2cap: Context-aware dense captioning in rgb-d scans. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3193–3203, 2021.
  7. Ham: Hierarchical attention model with high performance for 3d visual grounding. arXiv preprint arXiv:2210.12513, 2022.
  8. End-to-end 3d dense captioning with vote2cap-detr. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11124–11133, 2023.
  9. Unit3d: A unified transformer for 3d dense captioning and visual grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 18109–18119, 2023.
  10. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017.
  11. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  12. Votenet: A deep learning label fusion method for multi-atlas segmentation. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2019: 22nd International Conference, Shenzhen, China, October 13–17, 2019, Proceedings, Part III 22, pages 202–210. Springer, 2019.
  13. Viewrefer: Grasp the multi-view knowledge for 3d visual grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15372–15383, 2023.
  14. Ns3d: Neuro-symbolic grounding of 3d objects and relations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2614–2623, 2023.
  15. Text-guided graph neural networks for referring 3d instance segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 1610–1618, 2021.
  16. Bottom up top down detection transformers for language grounding in images and point clouds. In European Conference on Computer Vision, pages 417–433. Springer, 2022.
  17. Pointgroup: Dual-set point grouping for 3d instance segmentation. In Proceedings of the IEEE conference on computer vision and Pattern recognition, pages 4867–4876, 2020.
  18. More: Multi-order relation mining for dense captioning in 3d scenes. In European Conference on Computer Vision, pages 528–545. Springer, 2022.
  19. Meta architecture for point cloud analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17682–17691, 2023.
  20. Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81, 2004.
  21. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  22. Group-free 3d object detection via transformers. In Proceedings of the IEEE International Conference on Computer Vision, pages 2949–2958, 2021.
  23. 3d-sps: Single-stage 3d visual grounding via referred point progressive selection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 16454–16463, 2022.
  24. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002.
  25. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Advances in neural information processing systems, 30, 2017.
  26. Deep hough voting for 3d object detection in point clouds. In proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9277–9286, 2019.
  27. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 658–666, 2019.
  28. Languagerefer: Spatial-language model for 3d visual grounding. In Conference on Robot Learning, pages 1046–1056. PMLR, 2022.
  29. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  30. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015.
  31. Spatiality-guided transformer for 3d dense captioning on point clouds. arXiv preprint arXiv:2204.10688, 2022.
  32. Eda: Explicit text-decoupling and dense alignment for 3d visual grounding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 19231–19242, 2023.
  33. Sat: 2d semantics assisted training for 3d visual grounding. In Proceedings of the IEEE International Conference on Computer Vision, pages 1856–1866, 2021.
  34. Lamm: Language-assisted multi-modal instruction-tuning dataset, framework, and benchmark. arXiv preprint arXiv:2306.06687, 2023.
  35. A comprehensive survey of 3d dense captioning: Localizing and describing objects in 3d scenes. IEEE Transactions on Circuits and Systems for Video Technology, pages 1–1, 2023.
  36. Instancerefer: Cooperative holistic understanding for visual grounding on point clouds through instance multi-level contextual referring. In Proceedings of the IEEE International Conference on Computer Vision, pages 1791–1800, 2021.
  37. X-trans2cap: Cross-modal knowledge transfer using transformer for 3d dense captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8563–8573, 2022.
  38. Multi3drefer: Grounding text description to multiple 3d objects. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15225–15236, 2023.
  39. 3dvg-transformer: Relation modeling for visual grounding on point clouds. In Proceedings of the IEEE International Conference on Computer Vision, pages 2928–2937, 2021.
  40. Contextual modeling for 3d dense captioning on point clouds. arXiv preprint arXiv:2210.03925, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Yongdong Luo (5 papers)
  2. Haojia Lin (7 papers)
  3. Xiawu Zheng (63 papers)
  4. Yigeng Jiang (1 paper)
  5. Fei Chao (53 papers)
  6. Jie Hu (187 papers)
  7. Guannan Jiang (24 papers)
  8. Songan Zhang (20 papers)
  9. Rongrong Ji (315 papers)
X Twitter Logo Streamline Icon: https://streamlinehq.com