Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Octopi: Object Property Reasoning with Large Tactile-Language Models (2405.02794v2)

Published 5 May 2024 in cs.RO

Abstract: Physical reasoning is important for effective robot manipulation. Recent work has investigated both vision and language modalities for physical reasoning; vision can reveal information about objects in the environment and language serves as an abstraction and communication medium for additional context. Although these works have demonstrated success on a variety of physical reasoning tasks, they are limited to physical properties that can be inferred from visual or language inputs. In this work, we investigate combining tactile perception with language, which enables embodied systems to obtain physical properties through interaction and apply commonsense reasoning. We contribute a new dataset PhysiCLeAR, which comprises both physical/property reasoning tasks and annotated tactile videos obtained using a GelSight tactile sensor. We then introduce Octopi, a system that leverages both tactile representation learning and large vision-LLMs to predict and reason about tactile inputs with minimal language fine-tuning. Our evaluations on PhysiCLeAR show that Octopi is able to effectively use intermediate physical property predictions to improve its performance on various tactile-related tasks. PhysiCLeAR and Octopi are available at https://github.com/clear-nus/octopi.

Object Property Reasoning with Large Tactile-LLMs: An Expert Review

The paper "Octopi: Object Property Reasoning with Large Tactile-LLMs" introduces a novel approach to enhancing robot manipulation capabilities by bridging the gap between tactile perception and language-based common-sense reasoning. This paper addresses the limitations of traditional modalities—vision and language—by integrating tactile information, which provides critical details about object properties that cannot be discerned through vision alone.

Contributions and Methodology

The core contribution of this research lies in the development of Octopi, a system that leverages a combination of tactile sensors and large vision-LLMs (LVLMs) for object property reasoning. The paper introduces the PhysiCLeAR dataset, which includes tactile video data annotated with physical properties such as hardness, roughness, and bumpiness. These annotations serve as the foundation for training Octopi to process and reason about tactile signals.

Octopi employs a unique approach in tactile representation learning, utilizing a GelSight tactile sensor to obtain high-resolution tactile images. These images are then interpreted using a CLIP-based encoder, enabling the alignment of tactile and language data. The integration of LLMs (e.g., LLaMA-based models) is crucial for performing higher-order reasoning based on both language instructions and tactile inputs.

Experimental Results

The research presents detailed experimental results showcasing Octopi's effectiveness. The system demonstrates substantial improvements in physical reasoning tasks, both in trained and zero-shot settings. Notably, Octopi achieved significant accuracy improvements over baseline methods in object property description, property comparison, and scenario reasoning tasks.

Moreover, the paper highlights Octopi's successful deployment in a real robotic system for an avocado ripeness classification task. This practical application underscores the model's ability to reason about real-world tactile properties and improve decision-making in scenarios where visual assessments are insufficient.

Implications and Future Directions

The implications of this research are vast, marking significant progress toward more autonomous and intelligent robotic systems. By equipping robots with tactile reasoning capabilities, Octopi opens new avenues for applications in diverse fields, including manufacturing, healthcare, and service robotics, where understanding material properties through touch is valuable.

Future developments could focus on expanding the dataset to incorporate more diverse tactile interactions and further refining the tactile-language integration. The exploration of additional sensors and sensory modalities could also enhance the system's ability to capture and utilize complex object properties, thereby broadening the scope of tasks that robots can perform autonomously.

Conclusion

The integration of tactile sensing with LLMs represents a critical advancement in embodied AI. This paper provides a robust framework for leveraging tactile data in conjunction with LVLMs, enhancing a robot's ability to interact with and reason about the physical world. Through the development of Octopi and the PhysiCLeAR dataset, the paper sets a foundation for future research in tactile-guided reasoning, with promising implications for the evolution of robotic capabilities.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (64)
  1. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
  2. Prost: Physical reasoning of objects through space and time. arXiv preprint arXiv:2106.03634, 2021. URL https://arxiv.org/pdf/2106.03634.pdf.
  3. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023. URL https://arxiv.org/pdf/2308.12966.pdf.
  4. PHYRE: A New Benchmark for Physical Reasoning. 2019. URL https://arxiv.org/pdf/1908.05656.pdf.
  5. Wouter M. Bergmann Tiest. Tactual perception of material properties. Vision Research, 50(24):2775–2782, 2010. ISSN 0042-6989. doi: https://doi.org/10.1016/j.visres.2010.10.005. URL https://www.sciencedirect.com/science/article/pii/S0042698910004967. Perception and Action: Part I.
  6. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439, 2020. URL https://arxiv.org/pdf/1911.11641.pdf.
  7. Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818, 2023. URL https://arxiv.org/pdf/2307.15818.pdf.
  8. Learn from Incomplete Tactile Data: Tactile Representation Learning with Masked Autoencoders. In 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 10800–10805. IEEE, 2023. URL https://arxiv.org/pdf/2307.07358.pdf.
  9. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478, 2023. URL https://arxiv.org/pdf/2310.09478.pdf.
  10. Exploring Relationships between Touch Perception and Surface Physical Properties. International Journal of Design, 3:67–76, 08 2009. URL https://arxiv.org/pdf/1704.03822.pdf.
  11. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
  12. Space: A simulator for physical interactions and causal learning in 3d environments. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2058–2063, 2021.
  13. PIP: Physical Interaction Prediction via Mental Simulation with Span Selection. In European Conference on Computer Vision, pages 405–421. Springer, 2022a. URL http://phys101.csail.mit.edu/papers/phys101_bmvc.pdf.
  14. A survey of embodied ai: From simulators to research tasks. IEEE Transactions on Emerging Topics in Computational Intelligence, 6(2):230–244, 2022b.
  15. Ar2-d2: Training a robot without a robot. arXiv preprint arXiv:2306.13818, 2023.
  16. Physically grounded vision-language models for robotic manipulation. arXiv preprint arXiv:2309.02561, 2023a. URL https://arxiv.org/pdf/2309.02561.pdf.
  17. Objectfolder 2.0: A multisensory object dataset for sim2real transfer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10598–10608, 2022. URL https://arxiv.org/pdf/2204.02389.pdf.
  18. The ObjectFolder Benchmark: Multisensory Learning With Neural and Real Objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 17276–17286, June 2023b. URL https://arxiv.org/pdf/2306.00956.pdf.
  19. Deep learning for tactile understanding from visual and haptic data. In 2016 IEEE international conference on robotics and automation (ICRA), pages 536–543. IEEE, 2016. URL https://arxiv.org/pdf/1511.06065.pdf.
  20. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.
  21. Multiply: A multisensory object-centric embodied large language model in 3d world. arXiv preprint arXiv:2401.08577, 2024.
  22. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021. URL https://arxiv.org/pdf/2106.09685.pdf.
  23. Understanding dynamic tactile sensing for liquid property estimation. arXiv preprint arXiv:2205.08771, 2022. URL https://arxiv.org/pdf/2205.08771.pdf.
  24. Visual prompt tuning. In European Conference on Computer Vision, pages 709–727. Springer, 2022.
  25. Robotic perception of object properties using tactile sensing. In Tactile Sensing, Skill Learning, and Robotic Dexterous Manipulation, pages 23–44. Elsevier, 2022.
  26. Haptic exploration. Scholarpedia of Touch, pages 177–183, 2016.
  27. Mark H. Lee. Tactile sensing: New directions, new challenges. The International Journal of Robotics Research, 19(7):636–643, 2000. doi: 10.1177/027836490001900702. URL https://doi.org/10.1177/027836490001900702.
  28. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023a. URL https://dl.acm.org/doi/10.5555/3618408.3619222.
  29. Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355, 2023b. URL https://arxiv.org/pdf/2305.06355.pdf.
  30. Can language models understand physical concepts? In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 11843–11861, Singapore, December 2023c. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.726. URL https://aclanthology.org/2023.emnlp-main.726.
  31. Sensing and recognizing surface textures using a gelsight sensor. In 2013 IEEE Conference on Computer Vision and Pattern Recognition, pages 1241–1247, 2013. doi: 10.1109/CVPR.2013.164.
  32. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023a.
  33. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023b. URL https://arxiv.org/pdf/2304.08485.pdf.
  34. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  35. Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models. arXiv preprint arXiv:2306.05424, 2023. URL https://arxiv.org/pdf/2306.05424.pdf.
  36. Benchmarks for Physical Reasoning AI. arXiv preprint arXiv:2312.10728, 2023. URL https://arxiv.org/pdf/2312.10728.pdf.
  37. Teaching cameras to feel: Estimating tactile physical properties of surfaces from images. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVII 16, pages 1–20. Springer, 2020. URL https://arxiv.org/pdf/2004.14487.pdf.
  38. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021. URL https://arxiv.org/pdf/2103.00020.pdf.
  39. Fine-tuned clip models are efficient video learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6545–6554, 2023.
  40. Exploring Tactile Perceptual Dimensions Using Materials Associated with Sensory Vocabulary. Frontiers in Psychology, 8, 2017. URL https://api.semanticscholar.org/CorpusID:14038261.
  41. Quilt-LLaVA: Visual Instruction Tuning by Extracting Localized Narratives from Open-Source Histopathology Videos. arXiv e-prints, pages arXiv–2312, 2023. URL https://arxiv.org/pdf/2312.04746.pdf.
  42. Deep visuo-tactile learning: Estimation of tactile properties from images. In 2019 International Conference on Robotics and Automation (ICRA), pages 8951–8957. IEEE, 2019. URL https://arxiv.org/pdf/1803.03435.pdf.
  43. Event-driven visual-tactile sensing and learning for robots. arXiv preprint arXiv:2009.07083, 2020. URL https://arxiv.org/pdf/2009.07083.pdf.
  44. Extended tactile perception: Vibration sensing through tools and grasped objects. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1755–1762. IEEE, 2021.
  45. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023. URL https://arxiv.org/pdf/2312.11805.pdf.
  46. Tactile sensing in intelligent robotic manipulation—a review. Industrial Robot: An International Journal, 32, 02 2005. doi: 10.1108/01439910510573318.
  47. Manipulation by feel: Touch-based control with deep predictive models. In 2019 International Conference on Robotics and Automation (ICRA), pages 818–824. IEEE, 2019. URL https://arxiv.org/pdf/1903.04128.pdf.
  48. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  49. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b. URL https://arxiv.org/pdf/2307.09288.pdf.
  50. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  51. NEWTON: Are Large Language Models Capable of Physical Reasoning?
  52. Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models. arXiv preprint arXiv:2312.06109, 2023. URL https://arxiv.org/pdf/2312.06109.pdf.
  53. Physics 101: Learning Physical Object Properties from Unlabeled Videos. In BMVC, volume 2, page 7, 2016. URL http://phys101.csail.mit.edu/papers/phys101_bmvc.pdf.
  54. Next-gpt: Any-to-any multimodal llm. arXiv preprint arXiv:2309.05519, 2023. URL https://arxiv.org/pdf/2309.05519.pdf.
  55. Touch and go: Learning from human-collected vision and touch. arXiv preprint arXiv:2211.12498, 2022. URL https://arxiv.org/pdf/2211.12498.pdf.
  56. Clevrer: Collision events for video representation and reasoning. arXiv preprint arXiv:1910.01442, 2019. URL https://arxiv.org/pdf/1910.01442.pdf.
  57. Estimating object hardness with a gelsight touch sensor. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 208–215, 2016. doi: 10.1109/IROS.2016.7759057.
  58. Gelsight: High-resolution robot tactile sensors for estimating geometry and force. Sensors, 17(12), 2017. ISSN 1424-8220. doi: 10.3390/s17122762. URL https://www.mdpi.com/1424-8220/17/12/2762.
  59. Active clothing material perception using tactile sensing and deep learning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 4842–4849. IEEE, 2018. URL https://arxiv.org/pdf/1711.00574.pdf.
  60. Investigating Vision Foundational Models for Tactile Representation Learning. arXiv preprint arXiv:2305.00596, 2023. URL https://arxiv.org/pdf/2305.00596.pdf.
  61. Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858, 2023a.
  62. Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802, 2023b. URL https://arxiv.org/pdf/2307.10802.pdf.
  63. Egoobjects: A large-scale egocentric dataset for fine-grained object understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023.
  64. Dark, beyond deep: A paradigm shift to cognitive ai with humanlike common sense. Engineering, 6(3):310–345, 2020.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Samson Yu (11 papers)
  2. Kelvin Lin (7 papers)
  3. Anxing Xiao (14 papers)
  4. Jiafei Duan (26 papers)
  5. Harold Soh (54 papers)
Citations (12)
Youtube Logo Streamline Icon: https://streamlinehq.com