Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language Model as an Agent (2309.12311v1)

Published 21 Sep 2023 in cs.CV, cs.AI, cs.CL, cs.LG, and cs.RO

Abstract: 3D visual grounding is a critical skill for household robots, enabling them to navigate, manipulate objects, and answer questions based on their environment. While existing approaches often rely on extensive labeled data or exhibit limitations in handling complex language queries, we propose LLM-Grounder, a novel zero-shot, open-vocabulary, LLM-based 3D visual grounding pipeline. LLM-Grounder utilizes an LLM to decompose complex natural language queries into semantic constituents and employs a visual grounding tool, such as OpenScene or LERF, to identify objects in a 3D scene. The LLM then evaluates the spatial and commonsense relations among the proposed objects to make a final grounding decision. Our method does not require any labeled training data and can generalize to novel 3D scenes and arbitrary text queries. We evaluate LLM-Grounder on the ScanRefer benchmark and demonstrate state-of-the-art zero-shot grounding accuracy. Our findings indicate that LLMs significantly improve the grounding capability, especially for complex language queries, making LLM-Grounder an effective approach for 3D vision-language tasks in robotics. Videos and interactive demos can be found on the project website https://chat-with-nerf.github.io/ .

LLM-Grounder: Open-Vocabulary 3D Visual Grounding with LLM as an Agent

The paper "LLM-Grounder: Open-Vocabulary 3D Visual Grounding with LLM as an Agent" introduces a novel approach addressing the zero-shot open-vocabulary 3D visual grounding problem by leveraging LLMs like GPT-4. This methodology integrates the powerful language comprehension and reasoning capabilities of LLMs with the visual recognition abilities of CLIP-based models, such as OpenScene and LERF.

The core objective of 3D visual grounding is to locate objects in a 3D scene using natural language queries. This task is pivotal for household robots, enabling them to perform complex tasks related to navigation, manipulation, and information retrieval in dynamic environments. Traditional methods, which require extensive labeled datasets or exhibit limitations in handling nuanced language queries, are often inadequate in zero-shot and open-vocabulary contexts.

Methodology

LLM-Grounder seeks to overcome these limitations by employing a three-step process managed by an LLM agent:

  1. Query Decomposition: The LLM breaks down complex natural language queries into semantic components. This involves parsing the input into simpler constituent parts that describe object categories, attributes, landmarks, and spatial relations.
  2. Tool-Orchestration and Interaction: Utilizing visual grounding tools like OpenScene and LERF, the LLM directs these tools to find candidate objects in a 3D scene. These tools, based on CLIP models, propose potential bounding boxes for the identified components. Despite their strengths, these models often treat text input as a "bag of words" without considering the semantic structure. LLM-Grounder addresses this by using the LLM to orchestrate these tools efficiently.
  3. Spatial and Commonsense Reasoning: The LLM evaluates the proposed candidates using spatial and commonsense knowledge to make final grounding decisions. The agent can reason about spatial relationships and assess feedback from the visual grounders to determine the most contextually appropriate candidates.

Experimental Results

The authors evaluated their framework using the ScanRefer benchmark, a standard dataset for 3D visual grounding tasks that includes detailed natural language descriptions associated with objects in 3D scenes. The performance metrics used were [email protected] and [email protected], representing the proportion of correctly localized objects within specific IoU thresholds.

The results demonstrate that LLM-Grounder achieves state-of-the-art zero-shot grounding accuracy. Specifically, it improved grounding accuracy on ScanRefer from 4.4% to 6.9% ([email protected]) and from 0.3% to 1.6% ([email protected]) when integrated with LERF. When used with OpenScene, LLM-Grounder increased grounding accuracy from 13.0% to 17.1% ([email protected]) and made smaller improvements at higher IoU thresholds.

An important observation from the ablation studies is that the LLM agent's effectiveness increases with the complexity of the language query. However, its performance gains diminish in scenes with high visual complexity where instance disambiguation becomes challenging. The authors attribute this to the limitations of current LLMs in interpreting intricate visual cues.

Implications

From a practical standpoint, LLM-Grounder significantly extends the applicability of 3D visual grounding in real-world scenarios, particularly for robotic systems operating in diverse environments. By enabling zero-shot generalization, this approach circumvents the need for extensive labeled datasets, which are often costly and time-consuming to procure.

Theoretically, this framework illustrates the synergetic potential of combining advanced LLMs with visual grounding tools, enriching both domains. It highlights the advantages of leveraging LLMs not just as passive text processors but as active reasoning agents capable of complex task decomposition and tool orchestration.

Future Directions

Future research can explore enhancing the visual recognition capabilities to support more precise bounding box predictions, thus improving performance on higher IoU thresholds. Additionally, incorporating more sophisticated feedback loops and interactive learning paradigms between the LLM agent and visual tools could further refine spatial reasoning and instance disambiguation. Investigating the deployment of such systems in real-time robotics applications would also be a promising avenue, despite challenges related to computational cost and latency.

In conclusion, the paper "LLM-Grounder" presents a compelling strategy for open-vocabulary 3D visual grounding by effectively integrating LLMs with existing visual grounding techniques, setting a new standard for the field and opening multiple pathways for future advancements in AI-driven robotic systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (59)
  1. S. Peng, K. Genova, C. M. Jiang, A. Tagliasacchi, M. Pollefeys, and T. Funkhouser, “Openscene: 3d scene understanding with open vocabularies,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  2. L. Zhao, D. Cai, L. Sheng, and D. Xu, “3DVG-Transformer: Relation modeling for visual grounding on point clouds,” in ICCV, 2021, pp. 2928–2937.
  3. J. Roh, K. Desingh, A. Farhadi, and D. Fox, “Languagerefer: Spatial-language model for 3d visual grounding,” in Conference on Robot Learning.   PMLR, 2022, pp. 1046–1056.
  4. D. Cai, L. Zhao, J. Zhang, L. Sheng, and D. Xu, “3djcg: A unified framework for joint dense captioning and visual grounding on 3d point clouds,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16 464–16 473.
  5. J. Chen, W. Luo, R. Song, X. Wei, L. Ma, and W. Zhang, “Learning point-language hierarchical alignment for 3d visual grounding,” 2022.
  6. D. Z. Chen, Q. Wu, M. Nießner, and A. X. Chang, “D3net: A speaker-listener architecture for semi-supervised dense captioning and visual grounding in rgb-d scans,” 2021.
  7. Z. Yuan, X. Yan, Y. Liao, R. Zhang, S. Wang, Z. Li, and S. Cui, “Instancerefer: Cooperative holistic understanding for visual grounding on point clouds through instance multi-level contextual referring,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 1791–1800.
  8. E. Bakr, Y. Alsaedy, and M. Elhoseiny, “Look around and refer: 2d synthetic semantics knowledge distillation for 3d visual grounding,” Advances in Neural Information Processing Systems, vol. 35, pp. 37 146–37 158, 2022.
  9. H. Liu, A. Lin, X. Han, L. Yang, Y. Yu, and S. Cui, “Refer-it-in-rgbd: A bottom-up approach for 3d visual grounding in rgbd images,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 6032–6041.
  10. A. Jain, N. Gkanatsios, I. Mediratta, and K. Fragkiadaki, “Bottom up top down detection transformers for language grounding in images and point clouds,” in European Conference on Computer Vision.   Springer, 2022, pp. 417–433.
  11. S. Huang, Y. Chen, J. Jia, and L. Wang, “Multi-view transformer for 3d visual grounding,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 15 524–15 533.
  12. D. Z. Chen, A. X. Chang, and M. Nießner, “Scanrefer: 3d object localization in rgb-d scans using natural language,” 16th European Conference on Computer Vision (ECCV), 2020.
  13. P. Achlioptas, A. Abdelreheem, F. Xia, M. Elhoseiny, and L. J. Guibas, “ReferIt3D: Neural listeners for fine-grained 3d object identification in real-world scenes,” in 16th European Conference on Computer Vision (ECCV), 2020.
  14. B. Chen, F. Xia, B. Ichter, K. Rao, K. Gopalakrishnan, M. S. Ryoo, A. Stone, and D. Kappler, “Open-vocabulary queryable scene representations for real world planning,” in 2023 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2023, pp. 11 509–11 522.
  15. R. Ding, J. Yang, C. Xue, W. Zhang, S. Bai, and X. Qi, “Pla: Language-driven open-vocabulary 3d scene understanding,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 7010–7019.
  16. S. Y. Gadre, M. Wortsman, G. Ilharco, L. Schmidt, and S. Song, “Cows on pasture: Baselines and benchmarks for language-driven zero-shot object navigation,” CVPR, 2023.
  17. C. Huang, O. Mees, A. Zeng, and W. Burgard, “Visual language maps for robot navigation,” in 2023 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2023, pp. 10 608–10 615.
  18. K. M. Jatavallabhula, A. Kuwajerwala, Q. Gu, M. Omama, T. Chen, A. Maalouf, S. Li, G. S. Iyer, S. Saryazdi, N. V. Keetha et al., “Conceptfusion: Open-set multimodal 3d mapping,” in ICRA2023 Workshop on Pretraining for Robotics (PT4R), 2023.
  19. K. Mazur, E. Sucar, and A. J. Davison, “Feature-realistic neural fusion for real-time, open set scene understanding,” in 2023 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2023, pp. 8201–8207.
  20. N. M. M. Shafiullah, C. Paxton, L. Pinto, S. Chintala, and A. Szlam, “Clip-fields: Weakly supervised semantic fields for robotic memory,” in ICRA2023 Workshop on Pretraining for Robotics (PT4R), 2023.
  21. A. Takmaz, E. Fedele, R. W. Sumner, M. Pollefeys, F. Tombari, and F. Engelmann, “OpenMask3D: Open-Vocabulary 3D Instance Segmentation,” arXiv preprint arXiv:2306.13631, 2023.
  22. H. Ha and S. Song, “Semantic abstraction: Open-world 3d scene understanding from 2d vision-language models,” in 6th Annual Conference on Robot Learning, 2022. [Online]. Available: https://openreview.net/forum?id=lV-rNbXVSaO
  23. Y. Hong, C. Lin, Y. Du, Z. Chen, J. B. Tenenbaum, and C. Gan, “3d concept learning and reasoning from multi-view images,” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
  24. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning.   PMLR, 2021, pp. 8748–8763.
  25. M. Yuksekgonul, F. Bianchi, P. Kalluri, D. Jurafsky, and J. Zou, “When and why vision-language models behave like bags-of-words, and what to do about it?” in The Eleventh International Conference on Learning Representations, 2022.
  26. OpenAI, “Gpt-4 technical report,” 2023.
  27. J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou et al., “Chain-of-thought prompting elicits reasoning in large language models,” Advances in Neural Information Processing Systems, vol. 35, pp. 24 824–24 837, 2022.
  28. S. Yao, D. Yu, J. Zhao, I. Shafran, T. L. Griffiths, Y. Cao, and K. Narasimhan, “Tree of thoughts: Deliberate problem solving with large language models,” arXiv preprint arXiv:2305.10601, 2023.
  29. A. Zeng, M. Attarian, K. M. Choromanski, A. Wong, S. Welker, F. Tombari, A. Purohit, M. S. Ryoo, V. Sindhwani, J. Lee et al., “Socratic models: Composing zero-shot multimodal reasoning with language,” in The Eleventh International Conference on Learning Representations, 2022.
  30. T. Schick, J. Dwivedi-Yu, R. Dessì, R. Raileanu, M. Lomeli, L. Zettlemoyer, N. Cancedda, and T. Scialom, “Toolformer: Language models can teach themselves to use tools,” 2023.
  31. M. Ahn, A. Brohan, N. Brown, Y. Chebotar, O. Cortes, B. David, C. Finn, C. Fu, K. Gopalakrishnan, K. Hausman et al., “Do as i can, not as i say: Grounding language in robotic affordances,” arXiv preprint arXiv:2204.01691, 2022.
  32. W. Huang, F. Xia, T. Xiao, H. Chan, J. Liang, P. Florence, A. Zeng, J. Tompson, I. Mordatch, Y. Chebotar et al., “Inner monologue: Embodied reasoning through planning with language models,” in Conference on Robot Learning.   PMLR, 2023, pp. 1769–1782.
  33. J. Liang, W. Huang, F. Xia, P. Xu, K. Hausman, B. Ichter, P. Florence, and A. Zeng, “Code as policies: Language model programs for embodied control,” in 2023 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2023, pp. 9493–9500.
  34. D. Shah, B. Osinski, B. Ichter, and S. Levine, “LM-nav: Robotic navigation with large pre-trained models of language, vision, and action,” in 6th Annual Conference on Robot Learning, 2022. [Online]. Available: https://openreview.net/forum?id=UW5A3SweAH
  35. J. Kerr, C. M. Kim, K. Goldberg, A. Kanazawa, and M. Tancik, “Lerf: Language embedded radiance fields,” in International Conference on Computer Vision (ICCV), 2023.
  36. J. Hsu, J. Mao, and J. Wu, “Ns3d: Neuro-symbolic grounding of 3d objects and relations,” 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2614–2623, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:257687234
  37. A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner, “Scannet: Richly-annotated 3d reconstructions of indoor scenes,” in Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, 2017.
  38. B. Li, K. Q. Weinberger, S. Belongie, V. Koltun, and R. Ranftl, “Language-driven semantic segmentation,” in International Conference on Learning Representations, 2022. [Online]. Available: https://openreview.net/forum?id=RriDjddCLN
  39. G. Ghiasi, X. Gu, Y. Cui, and T.-Y. Lin, “Scaling open-vocabulary image segmentation with image-level labels,” in European Conference on Computer Vision.   Springer, 2022, pp. 540–557.
  40. F. Liang, B. Wu, X. Dai, K. Li, Y. Zhao, H. Zhang, P. Zhang, P. Vajda, and D. Marculescu, “Open-vocabulary semantic segmentation with mask-adapted clip,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 7061–7070.
  41. S. Kobayashi, E. Matsumoto, and V. Sitzmann, “Decomposing nerf for editing via feature field distillation,” in Advances in Neural Information Processing Systems, vol. 35, 2022. [Online]. Available: https://arxiv.org/pdf/2205.15585.pdf
  42. T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
  43. J. Wei, M. Bosma, V. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V. Le, “Finetuned language models are zero-shot learners,” in International Conference on Learning Representations, 2021.
  44. L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray et al., “Training language models to follow instructions with human feedback,” Advances in Neural Information Processing Systems, vol. 35, pp. 27 730–27 744, 2022.
  45. C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” The Journal of Machine Learning Research, vol. 21, no. 1, pp. 5485–5551, 2020.
  46. H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar et al., “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023.
  47. S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y. Cao, “React: Synergizing reasoning and acting in language models,” arXiv preprint arXiv:2210.03629, 2022.
  48. N. Shinn, F. Cassano, B. Labash, A. Gopinath, K. Narasimhan, and S. Yao, “Reflexion: Language agents with verbal reinforcement learning,” arXiv preprint arXiv:2303.11366, 2023.
  49. H. Liu, C. Sferrazza, and P. Abbeel, “Chain of hindsight aligns language models with feedback,” arXiv preprint arXiv:2302.02676, vol. 3, 2023.
  50. E. Jang, “Can llms critique and iterate on their own outputs?” evjang.com, Mar 2023. [Online]. Available: https://evjang.com/2023/03/26/self-reflection.html
  51. E. Karpas, O. Abend, Y. Belinkov, B. Lenz, O. Lieber, N. Ratner, Y. Shoham, H. Bata, Y. Levine, K. Leyton-Brown et al., “Mrkl systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning,” arXiv preprint arXiv:2205.00445, 2022.
  52. A. Parisi, Y. Zhao, and N. Fiedel, “Talm: Tool augmented language models,” arXiv preprint arXiv:2205.12255, 2022.
  53. H. Chase, “LangChain,” Oct. 2022. [Online]. Available: https://github.com/hwchase17/langchain
  54. Y. Shen, K. Song, X. Tan, D. Li, W. Lu, and Y. Zhuang, “Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface,” arXiv preprint arXiv:2303.17580, 2023.
  55. M. Li, F. Song, B. Yu, H. Yu, Z. Li, F. Huang, and Y. Li, “Api-bank: A benchmark for tool-augmented llms,” arXiv preprint arXiv:2304.08244, 2023.
  56. “Auto-gpt,” https://github.com/Significant-Gravitas/Auto-GPT, 2013.
  57. “Gpt-engineer,” https://github.com/AntonOsika/gpt-engineer, 2013.
  58. C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “Pointnet++: Deep hierarchical feature learning on point sets in a metric space,” Advances in neural information processing systems, vol. 30, 2017.
  59. M. Ester, H.-P. Kriegel, J. Sander, X. Xu et al., “A density-based algorithm for discovering clusters in large spatial databases with noise,” in kdd, vol. 96, no. 34, 1996, pp. 226–231.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Jianing Yang (21 papers)
  2. Xuweiyi Chen (11 papers)
  3. Shengyi Qian (17 papers)
  4. Nikhil Madaan (5 papers)
  5. Madhavan Iyengar (3 papers)
  6. David F. Fouhey (32 papers)
  7. Joyce Chai (52 papers)
Citations (63)
Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com