Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Vision-Language Model-based Physical Reasoning for Robot Liquid Perception (2404.06904v1)

Published 10 Apr 2024 in cs.RO

Abstract: There is a growing interest in applying LLMs in robotic tasks, due to their remarkable reasoning ability and extensive knowledge learned from vast training corpora. Grounding LLMs in the physical world remains an open challenge as they can only process textual input. Recent advancements in large vision-LLMs (LVLMs) have enabled a more comprehensive understanding of the physical world by incorporating visual input, which provides richer contextual information than language alone. In this work, we proposed a novel paradigm that leveraged GPT-4V(ision), the state-of-the-art LVLM by OpenAI, to enable embodied agents to perceive liquid objects via image-based environmental feedback. Specifically, we exploited the physical understanding of GPT-4V to interpret the visual representation (e.g., time-series plot) of non-visual feedback (e.g., F/T sensor data), indirectly enabling multimodal perception beyond vision and language using images as proxies. We evaluated our method using 10 common household liquids with containers of various geometry and material. Without any training or fine-tuning, we demonstrated that our method can enable the robot to indirectly perceive the physical response of liquids and estimate their viscosity. We also showed that by jointly reasoning over the visual and physical attributes learned through interactions, our method could recognize liquid objects in the absence of strong visual cues (e.g., container labels with legible text or symbols), increasing the accuracy from 69.0% -- achieved by the best-performing vision-only variant -- to 86.0%.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (25)
  1. B. M. Lake, T. D. Ullman, J. B. Tenenbaum, and S. J. Gershman, “Building machines that learn and think like people,” Behavioral and brain sciences, vol. 40, p. e253, 2017.
  2. Y. Bisk, R. Zellers, J. Gao, Y. Choi et al., “Piqa: Reasoning about physical commonsense in natural language,” in Proceedings of the AAAI conference on artificial intelligence, vol. 34, no. 05, 2020, pp. 7432–7439.
  3. A. Talmor, O. Yoran, R. L. Bras, C. Bhagavatula, Y. Goldberg, Y. Choi, and J. Berant, “Commonsenseqa 2.0: Exposing the limits of ai through gamification,” Datasets and Benchmarks Track, Thirty-fifth Conference on Neural Information Processing Systems, 2022.
  4. W. Huang, F. Xia, T. Xiao, H. Chan, J. Liang, P. Florence, A. Zeng, J. Tompson, I. Mordatch, Y. Chebotar et al., “Inner monologue: Embodied reasoning through planning with language models,” The Sixth Annual Conference on Robot Learning, 2022.
  5. X. Zhao, M. Li, C. Weber, M. B. Hafez, and S. Wermter, “Chat with the environment: Interactive multimodal perception using large language models,” 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems, 2023.
  6. Z. Yang, L. Li, K. Lin, J. Wang, C.-C. Lin, Z. Liu, and L. Wang, “The dawn of lmms: Preliminary explorations with gpt-4v (ision),” arXiv preprint arXiv:2309.17421, 2023.
  7. S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y. Cao, “React: Synergizing reasoning and acting in language models,” The Eleventh International Conference on Learning Representations, 2022.
  8. J. Wu, I. Yildirim, J. J. Lim, B. Freeman, and J. Tenenbaum, “Galileo: Perceiving physical object properties by integrating a physics engine with deep learning,” Advances in neural information processing systems, vol. 28, 2015.
  9. K. Fragkiadaki, P. Agrawal, S. Levine, and J. Malik, “Learning visual predictive models of physics for playing billiards,” The Fourth International Conference on Learning Representations, 2015.
  10. C. Wang, S. Wang, B. Romero, F. Veiga, and E. Adelson, “Swingbot: Learning physical features from in-hand tactile exploration for dynamic swing-up manipulation,” in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems.   IEEE, 2020, pp. 5633–5640.
  11. X. Guo, H.-J. Huang, and W. Yuan, “Estimating properties of solid particles inside container using touch sensing,” in 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems.   IEEE, 2023, pp. 8985–8992.
  12. R. Liu, J. Wei, S. S. Gu, T.-Y. Wu, S. Vosoughi, C. Cui, D. Zhou, and A. M. Dai, “Mind’s eye: Grounded language model reasoning through simulation,” The Eleventh International Conference on Learning Representations, 2022.
  13. I. Singh, V. Blukis, A. Mousavian, A. Goyal, D. Xu, J. Tremblay, D. Fox, J. Thomason, and A. Garg, “Progprompt: Generating situated robot task plans using large language models,” in 2023 IEEE International Conference on Robotics and Automation.   IEEE, 2023, pp. 11 523–11 530.
  14. A. Brohan, Y. Chebotar, C. Finn, K. Hausman, A. Herzog, D. Ho, J. Ibarz, A. Irpan, E. Jang, R. Julian et al., “Do as i can, not as i say: Grounding language in robotic affordances,” in The Seventh Annual Conference on Robot Learning.   PMLR, 2023, pp. 287–318.
  15. J. Gao, B. Sarkar, F. Xia, T. Xiao, J. Wu, B. Ichter, A. Majumdar, and D. Sadigh, “Physically grounded vision-language models for robotic manipulation,” arXiv preprint arXiv:2309.02561, 2023.
  16. L. Li, J. Xu, Q. Dong, C. Zheng, Q. Liu, L. Kong, and X. Sun, “Can language models understand physical concepts?” The 2023 Conference on Empirical Methods in Natural Language Processing, 2023.
  17. S. Chitta, J. Sturm, M. Piccoli, and W. Burgard, “Tactile sensing for mobile manipulation,” IEEE Transactions on Robotics, vol. 27, no. 3, pp. 558–568, 2011.
  18. P. Güler, Y. Bekiroglu, X. Gratal, K. Pauwels, and D. Kragic, “What’s in the container? classifying object contents from vision and touch,” in 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems.   IEEE, 2014, pp. 3961–3968.
  19. C. Matl, R. Matthew, and R. Bajcsy, “Haptic perception of liquids enclosed in containers,” in 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems.   IEEE, 2019, pp. 7142–7149.
  20. H.-J. Huang, X. Guo, and W. Yuan, “Understanding dynamic tactile sensing for liquid property estimation,” Robotics: Science and Systems XVIII, 2022.
  21. S. Buetti, J. Xu, and A. Lleras, “Predicting how color and shape combine in the human visual system to direct attention,” Scientific reports, vol. 9, no. 1, p. 20258, 2019.
  22. M. Minderer, A. Gritsenko, A. Stone, M. Neumann, D. Weissenborn, A. Dosovitskiy, A. Mahendran, A. Arnab, M. Dehghani, Z. Shen et al., “Simple open-vocabulary object detection,” in European Conference on Computer Vision.   Springer, 2022, pp. 728–755.
  23. P. Wu and S. Xie, “V*: Guided visual search as a core mechanism in multimodal llms,” arXiv preprint arXiv:2312.14135, 2023.
  24. W. Huang, P. Abbeel, D. Pathak, and I. Mordatch, “Language models as zero-shot planners: Extracting actionable knowledge for embodied agents,” in International Conference on Machine Learning.   PMLR, 2022, pp. 9118–9147.
  25. J. J. R. Van Assen, P. Barla, and R. W. Fleming, “Visual features in the perception of liquids,” Current biology, vol. 28, no. 3, pp. 452–458, 2018.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Wenqiang Lai (2 papers)
  2. Yuan Gao (335 papers)
  3. Tin Lun Lam (36 papers)
Citations (3)