Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
158 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Towards Comprehensive Multimodal Perception: Introducing the Touch-Language-Vision Dataset (2403.09813v3)

Published 14 Mar 2024 in cs.CV and cs.RO

Abstract: Tactility provides crucial support and enhancement for the perception and interaction capabilities of both humans and robots. Nevertheless, the multimodal research related to touch primarily focuses on visual and tactile modalities, with limited exploration in the domain of language. Beyond vocabulary, sentence-level descriptions contain richer semantics. Based on this, we construct a touch-language-vision dataset named TLV (Touch-Language-Vision) by human-machine cascade collaboration, featuring sentence-level descriptions for multimode alignment. The new dataset is used to fine-tune our proposed lightweight training framework, STLV-Align (Synergistic Touch-Language-Vision Alignment), achieving effective semantic alignment with minimal parameter adjustments (1%). Project Page: https://xiaoen0.github.io/touch.page/.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (44)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  2. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35:23716–23736.
  3. The feeling of success: Does touch sensing help predict grasp outcomes? arXiv preprint arXiv:1710.05512.
  4. iquery: Instruments as queries for audio-visual sound separation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14675–14686.
  5. Self-attention based visual-tactile fusion learning for predicting grasp outcomes. IEEE Robotics and Automation Letters, 5(4):5827–5834.
  6. Tactile sensing—from humans to humanoids. IEEE transactions on robotics, 26(1):1–20.
  7. Multimodal visual-tactile representation learning through self-supervised contrastive pre-training. In Proceedings/IEEE International Conference on Robotics and Automation. Institute of Electrical and Electronics Engineers.
  8. Improved gelsight tactile sensor for measuring geometry and slip. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 137–144.
  9. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations.
  10. See, feel, act: Hierarchical learning for complex manipulation skills with multisensory fusion. Science Robotics, 4(26):eaav3123.
  11. Supervised autoencoder joint learning on heterogeneous tactile sensory data: Improving material classification performance. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 10907–10913. IEEE.
  12. Objectfolder: A dataset of objects with implicit visual, auditory, and tactile representations. In 5th Annual Conference on Robot Learning.
  13. The objectfolder benchmark: Multisensory learning with neural and real objects. In IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  14. Objectfolder 2.0: A multisensory object dataset for sim2real transfer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10598–10608.
  15. Imagebind: One embedding space to bind them all. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15180–15190.
  16. Generation of gelsight tactile images for sim2real learning. IEEE Robotics and Automation Letters, 6(2):4177–4184.
  17. Point-bind & point-llm: Aligning point cloud with multi-modality for 3d understanding, generation, and instruction following. arXiv preprint arXiv:2309.00615.
  18. Audioclip: Extending clip to image, text and audio. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 976–980. IEEE.
  19. Visuotactile-rl: Learning multimodal manipulation policies with deep reinforcement learning. In 2022 International Conference on Robotics and Automation (ICRA), pages 8298–8304. IEEE.
  20. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations.
  21. Openclip.
  22. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pages 4904–4916. PMLR.
  23. Roland S Johansson and J Randall Flanagan. 2009. Coding and use of tactile signals from the fingertips in object manipulation tasks. Nature Reviews Neuroscience, 10(5):345–359.
  24. Microgeometry capture using an elastomeric sensor. ACM Transactions on Graphics (TOG), 30(4):1–8.
  25. Self-supervised visuo-tactile pretraining to locate and follow garment features. arXiv preprint arXiv:2209.13042.
  26. Digit: A novel design for a low-cost compact high-resolution tactile sensor with application to in-hand manipulation. IEEE Robotics and Automation Letters, 5(3):3838–3845.
  27. Vit-lens-2: Gateway to omni-modal intelligence. arXiv preprint arXiv:2311.16081.
  28. Connecting touch and vision via cross-modal prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10609–10618.
  29. Learning to identify object instances by touch: Tactile recognition via multimodal matching. In 2019 International Conference on Robotics and Automation (ICRA), pages 3644–3650.
  30. Openshape: Scaling up 3d shape representation towards open-world understanding. Advances in Neural Information Processing Systems, 36.
  31. General in-hand object rotation with vision and touch. In Conference on Robot Learning, pages 2549–2564. PMLR.
  32. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR.
  33. LAION-5b: An open large-scale dataset for training next generation image-text models. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
  34. Zilin Si and Wenzhen Yuan. 2022. Taxim: An example-based simulation model for gelsight tactile sensors. IEEE Robotics and Automation Letters, 7(2):2361–2368.
  35. Midastouch: Monte-carlo inference over distributions across sliding touch. In Conference on Robot Learning, pages 319–331.
  36. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
  37. Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1179–1189.
  38. Touch and go: Learning from human-collected vision and touch. Advances in Neural Information Processing Systems, 35:8081–8103.
  39. Generating visual scenes from touch. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22070–22080.
  40. Connecting look and feel: Associating the visual and tactile properties of physical materials. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5580–5588.
  41. Shape-independent hardness estimation using deep learning and a gelsight tactile sensor. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pages 951–958.
  42. Moving object detection using keypoints reference model. EURASIP J. Image Video Process., 2011(1):13.
  43. Can language understand depth? In Proceedings of the 30th ACM International Conference on Multimedia, pages 6868–6874.
  44. Languagebind: Extending video-language pretraining to n-modality by language-based semantic alignment. In The Twelfth International Conference on Learning Representations.
Citations (2)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com