Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
175 tokens/sec
GPT-4o
8 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CoT3DRef: Chain-of-Thoughts Data-Efficient 3D Visual Grounding (2310.06214v4)

Published 10 Oct 2023 in cs.CV

Abstract: 3D visual grounding is the ability to localize objects in 3D scenes conditioned by utterances. Most existing methods devote the referring head to localize the referred object directly, causing failure in complex scenarios. In addition, it does not illustrate how and why the network reaches the final decision. In this paper, we address this question Can we design an interpretable 3D visual grounding framework that has the potential to mimic the human perception system?. To this end, we formulate the 3D visual grounding problem as a sequence-to-sequence Seq2Seq task by first predicting a chain of anchors and then the final target. Interpretability not only improves the overall performance but also helps us identify failure cases. Following the chain of thoughts approach enables us to decompose the referring task into interpretable intermediate steps, boosting the performance and making our framework extremely data-efficient. Moreover, our proposed framework can be easily integrated into any existing architecture. We validate our approach through comprehensive experiments on the Nr3D, Sr3D, and Scanrefer benchmarks and show consistent performance gains compared to existing methods without requiring manually annotated data. Furthermore, our proposed framework, dubbed CoT3DRef, is significantly data-efficient, whereas on the Sr3D dataset, when trained only on 10% of the data, we match the SOTA performance that trained on the entire data. The code is available at https:eslambakr.github.io/cot3dref.github.io/.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (70)
  1. Scanents3d: Exploiting phrase-to-3d-object correspondences for improved visio-linguistic models in 3d scenes. arXiv preprint arXiv:2212.06250, 2022a.
  2. 3dreftransformer: fine-grained object identification in real-world scenes using natural language. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp.  3941–3950, 2022b.
  3. Referit3d: Neural listeners for fine-grained 3d object identification in real-world scenes. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16, pp.  422–440. Springer, 2020.
  4. Look around and refer: 2d synthetic semantics knowledge distillation for 3d visual grounding. Advances in Neural Information Processing Systems, 35:37146–37158, 2022.
  5. Language Models are Few-Shot Learners, July 2020. URL http://arxiv.org/abs/2005.14165. arXiv:2005.14165 [cs].
  6. Scanrefer: 3d object localization in rgb-d scans using natural language. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XX, pp.  202–221. Springer, 2020.
  7. A unified model of human semantic knowledge and its disorders. Nature human behaviour, 1(3):0039, 2017.
  8. Language conditioned spatial relation reasoning for 3d object grounding. arXiv preprint arXiv:2211.09646, 2022.
  9. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  10. Kevin Crowston. Amazon mechanical turk: A research tool for organizations and information systems scholars. In Anol Bhattacherjee and Brian Fitzgerald (eds.), Shaping the Future of ICT Research. Methods and Approaches, pp.  210–221, Berlin, Heidelberg, 2012. Springer Berlin Heidelberg. ISBN 978-3-642-35142-6.
  11. Deep learning for image and point cloud fusion in autonomous driving: A review. IEEE Transactions on Intelligent Transportation Systems, 23(2):722–739, 2021.
  12. Scannet: Richly-annotated 3d reconstructions of indoor scenes, 2017a.
  13. Detecting visual relationships with deep relational networks, 2017b.
  14. Prithiviraj Damodaran. Parrot: Paraphrase generation for nlu., 2021.
  15. Transvg: End-to-end visual grounding with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  1769–1779, 2021.
  16. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  17. Free-form description guided 3d visual graph network for object grounding in point cloud. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  3722–3731, 2021.
  18. Training-free structured diffusion guidance for compositional text-to-image synthesis, 2023.
  19. Deep learning-based object detection in augmented reality: A systematic review. Computers in Industry, 139:103661, 2022.
  20. Transrefer3d: Entity-and-relation aware transformer for fine-grained 3d visual grounding. In Proceedings of the 29th ACM International Conference on Multimedia, pp.  2344–2352, 2021.
  21. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pp.  1026–1034, 2015.
  22. Ns3d: Neuro-symbolic grounding of 3d objects and relations. arXiv preprint arXiv:2303.13483, 2023.
  23. Text-guided graph neural networks for referring 3d instance segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp.  1610–1618, 2021.
  24. Multi-view transformer for 3d visual grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  15524–15533, 2022.
  25. Hierarchical reinforcement learning: A survey and open research challenges. Machine Learning and Knowledge Extraction, 4(1):172–221, 2022.
  26. Looking outside the box to ground language in 3d scenes. arXiv preprint arXiv:2112.08879, 2021.
  27. Bottom up top down detection transformers for language grounding in images and point clouds. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVI, pp.  417–433. Springer, 2022.
  28. End-to-end learning of semantic grasping. arXiv preprint arXiv:1707.01932, 2017.
  29. Chain-of-thought predictive control. arXiv preprint arXiv:2304.00776, 2023.
  30. Unsupervised neural dependency parsing. pp.  763–771, 01 2016. doi: 10.18653/v1/D16-1073.
  31. Image generation from scene graphs, 2018.
  32. MDETR - modulated detection for end-to-end multi-modal understanding. CoRR, abs/2104.12763, 2021. URL https://arxiv.org/abs/2104.12763.
  33. Voila: Visual-observation-only imitation learning for autonomous navigation. In 2022 International Conference on Robotics and Automation (ICRA), pp.  2497–2503. IEEE, 2022.
  34. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  35. Vip-cnn: Visual phrase guided convolutional neural network, 2017a.
  36. Scene graph generation from objects, phrases and region captions, 2017b.
  37. Edge assisted real-time object detection for mobile augmented reality. In The 25th annual international conference on mobile computing and networking, pp.  1–16, 2019.
  38. Unsupervised vision-language parsing: Seamlessly bridging visual scene graphs with language structures via dependency relationships, 2022.
  39. 3d-sps: Single-stage 3d visual grounding via referred point progressive selection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  16454–16463, 2022.
  40. Faithful chain-of-thought reasoning. arXiv preprint arXiv:2301.13379, 2023.
  41. Text and patterns: For effective chain of thought, it takes two to tango. arXiv preprint arXiv:2209.07686, 2022.
  42. Conducting the train of thought: working memory capacity, goal neglect, and mind wandering in an executive-control task. Journal of Experimental Psychology: Learning, Memory, and Cognition, 35(1):196, 2009.
  43. Vision-based navigation with language-based assistance via imitation learning with indirect intervention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  12527–12537, 2019.
  44. Katherine Osborne-Crowley. Social cognition in the real world: reconnecting the study of social cognition with social reality. Review of General Psychology, 24(2):144–158, 2020.
  45. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155, 2022.
  46. Deep learning-based smart task assistance in wearable augmented reality. Robotics and Computer-Integrated Manufacturing, 63:101887, 2020.
  47. Virtualhome: Simulating household activities via programs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.  8494–8502, 2018.
  48. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Advances in neural information processing systems, 30, 2017.
  49. 3d object detection for autonomous driving: a survey. Pattern Recognition, 130:108796, 2022.
  50. Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR, abs/1910.10683, 2019. URL http://arxiv.org/abs/1910.10683.
  51. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084, 2019.
  52. Languagerefer: Spatial-language model for 3d visual grounding. In Conference on Robot Learning, pp.  1046–1056. PMLR, 2022.
  53. Trajectron++: Dynamically-feasible trajectory forecasting with heterogeneous data. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVIII 16, pp.  683–700. Springer, 2020.
  54. Generating semantically precise scene graphs from textual descriptions for improved image retrieval. In Proceedings of the Fourth Workshop on Vision and Language, pp.  70–80, Lisbon, Portugal, September 2015. Association for Computational Linguistics. doi: 10.18653/v1/W15-2812. URL https://aclanthology.org/W15-2812.
  55. Graph attention networks, 2018.
  56. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022.
  57. Dynamic graph cnn for learning on point clouds, 2019.
  58. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022.
  59. Embodied question answering in photorealistic environments with point cloud perception. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  6659–6668, 2019.
  60. Unified visual-semantic embeddings: Bridging vision and language with structured meaning representations. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  6602–6611, 2019. doi: 10.1109/CVPR.2019.00677.
  61. Eda: Explicit text-decoupling and dense alignment for 3d visual and language learning. arXiv preprint arXiv:2209.14941, 2022.
  62. Scene graph generation by iterative message passing, 2017.
  63. Graph r-cnn for scene graph generation, 2018.
  64. Chain of thought imitation with procedure cloning. Advances in Neural Information Processing Systems, 35:36366–36381, 2022.
  65. Sat: 2d semantics assisted training for 3d visual grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  1856–1866, 2021.
  66. Instancerefer: Cooperative holistic understanding for visual grounding on point clouds through instance multi-level contextual referring. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  1791–1800, 2021.
  67. Toward explainable and fine-grained 3d grounding through referring textual phrases. arXiv preprint arXiv:2207.01821, 2022.
  68. Multimodal chain-of-thought reasoning in language models. arXiv preprint arXiv:2302.00923, 2023.
  69. 3dvg-transformer: Relation modeling for visual grounding on point clouds. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  2928–2937, 2021.
  70. A Survey of Large Language Models, 2023.
Citations (7)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets