Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Touch, Vision, and Language Dataset for Multimodal Alignment (2402.13232v1)

Published 20 Feb 2024 in cs.CV and cs.RO

Abstract: Touch is an important sensing modality for humans, but it has not yet been incorporated into a multimodal generative LLM. This is partially due to the difficulty of obtaining natural language labels for tactile data and the complexity of aligning tactile readings with both visual observations and language descriptions. As a step towards bridging that gap, this work introduces a new dataset of 44K in-the-wild vision-touch pairs, with English language labels annotated by humans (10%) and textual pseudo-labels from GPT-4V (90%). We use this dataset to train a vision-language-aligned tactile encoder for open-vocabulary classification and a touch-vision-language (TVL) model for text generation using the trained encoder. Results suggest that by incorporating touch, the TVL model improves (+29% classification accuracy) touch-vision-language alignment over existing models trained on any pair of those modalities. Although only a small fraction of the dataset is human-labeled, the TVL model demonstrates improved visual-tactile understanding over GPT-4V (+12%) and open-source vision-LLMs (+32%) on a new touch-vision understanding benchmark. Code and data: https://tactile-vlm.github.io.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (99)
  1. ajbarnett. 400 words to describe texture, 2023.
  2. Multimae: Multi-modal multi-task masked autoencoders. arXiv:2204.01678, 2022.
  3. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond, 2023.
  4. The psychology of multimodal perception. Crossmodal space and crossmodal attention, pp.  141–177, 2004.
  5. Vision and touch are automatically integrated for the perception of sequences of events. Journal of vision, 6(5):2–2, 2006.
  6. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In arXiv preprint arXiv:2307.15818, 2023.
  7. Instructpix2pix: Learning to follow image editing instructions, 2023.
  8. Language models are few-shot learners, 2020.
  9. Cross-modal perception of identity by sound and taste in bottlenose dolphins. Science Advances, 8(20):eabm7684, 2022.
  10. Less labels, more modalities: A self-training framework to reuse pretrained networks. In Rousseau, J.-J. and Kapralos, B. (eds.), Pattern Recognition, Computer Vision, and Image Processing. ICPR 2022 International Workshops and Challenges, pp.  287–302, Cham, 2023. Springer Nature Switzerland. ISBN 978-3-031-37731-0.
  11. Making large multimodal models understand arbitrary visual prompts, 2023a.
  12. Making large multimodal models understand arbitrary visual prompts. In arXiv:2312.00784, 2023b.
  13. More than a feeling: Learning to grasp and regrasp using vision and touch. IEEE Robotics and Automation Letters, 3(4):3300–3307, 2018.
  14. Integration of haptics and vision in human multisensory grasping. Cortex, 135:173–185, 2021.
  15. Emerging properties in self-supervised vision transformers, 2021.
  16. Shikra: Unleashing multimodal llm’s referential dialogue magic, 2023a.
  17. Sharegpt4v: Improving large multi-modal models with better captions, 2023b.
  18. Generative pretraining from pixels. 2020.
  19. Visuo-tactile transformers for manipulation, 2022.
  20. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
  21. Development of a tactile sensor based on biologically inspired edge encoding. In 2009 International Conference on Advanced Robotics, pp.  1–6. IEEE, 2009.
  22. Tactile sensing—from humans to humanoids. IEEE transactions on robotics, 26(1):1–20, 2009.
  23. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
  24. Multimodal visual-tactile representation learning through self-supervised contrastive pre-training. arXiv preprint arXiv:2401.12024, 2024.
  25. An image is worth 16x16 words: Transformers for image recognition at scale. 2020.
  26. Palm-e: An embodied multimodal language model. In arXiv preprint arXiv:2303.03378, 2023.
  27. Safe self-supervised learning in real of visuo-tactile feedback policies for industrial insertion, 2023.
  28. Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010, 2023.
  29. Objectfolder: A dataset of objects with implicit visual, auditory, and tactile representations. In Conference on Robot Learning, 2021.
  30. Objectfolder 2.0: A multisensory object dataset for sim2real transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  10598–10608, June 2022.
  31. Multimodal masked autoencoders learn transferable representations. arXiv preprint arXiv:2205.14204, 2022.
  32. Imagebind: One embedding space to bind them all. In CVPR, 2023.
  33. Active touch and robot perception. Cognition and Brain Theory, 7(2):199–214, 1984.
  34. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv:1706.02677, 2017.
  35. Point-bind and point-llm: Aligning point cloud with multi-modality for 3d understanding, generation, and instruction following, 2023.
  36. Audioclip: Extending clip to image, text and audio, 2021.
  37. Imagebind-llm: Multi-modality instruction tuning, 2023.
  38. Instruct-nerf2nerf: Editing 3d scenes with instructions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023.
  39. Learning to read braille: Bridging the tactile reality gap with diffusion models. 2023.
  40. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9.
  41. Openclip, July 2021. URL https://doi.org/10.5281/zenodo.5143773. If you use this software, please cite it as below.
  42. Memory for curvature of objects: Haptic touch vs. vision. British Journal of Psychology, 98(4):589–610, 2007.
  43. Coding and use of tactile signals from the fingertips in object manipulation tasks. Nature Reviews Neuroscience, 10(5):345–359, 2009.
  44. A comparison of learning with haptic and visual modalities. 2005.
  45. Multi-sensorial and explorative recognition of garments and their material properties in unconstrained environment. In 2016 IEEE international conference on robotics and automation (ICRA), pp.  1656–1663. IEEE, 2016.
  46. Self-supervised visuo-tactile pretraining to locate and follow garment features, 2023.
  47. The skin and its receptors 148 pathways to cortex and major cortical areas. Handbook of psychology, experimental psychology, 4:147, 2003.
  48. Digit: A novel design for a low-cost compact high-resolution tactile sensor with application to in-hand manipulation. IEEE Robotics and Automation Letters, 5(3):3838–3845, 2020. doi: 10.1109/LRA.2020.2977257.
  49. Lee, D.-H. et al. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML, volume 3, pp.  896. Atlanta, 2013.
  50. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023a.
  51. Sensing and recognizing surface textures using a gelsight sensor. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.  1241–1247, 2013.
  52. Connecting touch and vision via cross-modal prediction, 2019.
  53. Scaling language-image pre-training via masking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  23390–23400, 2023b.
  54. Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models, 2023.
  55. Improved baselines with visual instruction tuning, 2023a.
  56. Visual instruction tuning. In NeurIPS, 2023b.
  57. Sgdr: Stochastic gradient descent with warm restarts. 2017a.
  58. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017b.
  59. Unified-io 2: Scaling autoregressive multimodal models with vision, language, audio, and action, 2023.
  60. McLachlan, G. J. Iterative reclassification procedure for constructing an asymptotically optimal rule of allocation in discriminant analysis. Journal of the American Statistical Association, 70(350):365–369, 1975.
  61. Verbal labels facilitate tactile perception. Cognition, 171:172–179, 2018.
  62. Anymal: An efficient and scalable any-modality augmented language model, 2023.
  63. Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Transactions on pattern analysis and machine intelligence, 24(7):971–987, 2002.
  64. Gpt-4 technical report, 2023.
  65. Wearable haptic systems for the fingertip and the hand: Taxonomy, review, and perspectives. IEEE Transactions on Haptics, 10(4):580–600, 2017. doi: 10.1109/TOH.2017.2689006.
  66. General In-Hand Object Rotation with Vision and Touch. In Conference on Robot Learning (CoRL), 2023.
  67. Learning transferable visual models from natural language supervision, 2021.
  68. Robot learning with sensorimotor pre-training. arXiv preprint arXiv:2306.10007, 2023.
  69. A generalist agent, 2022.
  70. High-resolution image synthesis with latent diffusion models, 2022.
  71. Semi-supervised self-training of object detection models. 2005.
  72. Neuronal correlates of label facilitated tactile perception. Scientific Reports, 9(1):1606, 2019.
  73. Design, motivation and evaluation of a full-resolution optical tactile sensor. Sensors, 19(4):928, 2019.
  74. Shimonomura, K. Tactile image sensors employing camera: A review. Sensors, 19(18):3933, 2019.
  75. Fixmatch: Simplifying semi-supervised learning with consistency and confidence. arXiv preprint arXiv:2001.07685, 2020.
  76. Crossmodal associations with olfactory, auditory, and tactile stimuli in children and adults. i-Perception, 12(6):20416695211048513, 2021.
  77. The contributions of vision and haptics to reaching and grasping. Frontiers in psychology, 6:1403, 2015.
  78. Pandagpt: One model to instruction-follow them all, 2023.
  79. Generative multimodal models are in-context learners, 2023.
  80. Midastouch: Monte-carlo inference over distributions across sliding touch. In 6th Annual Conference on Robot Learning, 2022. URL https://openreview.net/forum?id=JWROnOf4w-K.
  81. Codi-2: In-context, interleaved, and interactive any-to-any generation, 2023.
  82. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  83. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  84. Turk, M. Multimodal interaction: A review. Pattern recognition letters, 36:189–195, 2014.
  85. Long-tailed recognition by routing diverse distribution-aware experts. arXiv preprint arXiv:2010.01809, 2020.
  86. Debiased learning from naturally imbalanced pseudo-labels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  14647–14657, 2022a.
  87. Cut and learn for unsupervised object detection and instance segmentation, 2023.
  88. Self-instruct: Aligning language model with self generated instructions, 2022b.
  89. Next-gpt: Any-to-any multimodal llm. CoRR, abs/2309.05519, 2023.
  90. Combining finger vision and optical tactile sensing: Reducing and handling errors while cutting vegetables. In 2016 IEEE-RAS 16th International Conference on Humanoid Robots (Humanoids), pp.  1045–1051, 2016. doi: 10.1109/HUMANOIDS.2016.7803400.
  91. Touch and go: Learning from human-collected vision and touch. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022.
  92. Binding touch to everything: Learning unified multimodal tactile representations. arXiv preprint arXiv:2401.18084, 2024.
  93. Gelsight: High-resolution robot tactile sensors for estimating geometry and force. Sensors, 17(12):2762, 2017.
  94. Active clothing material perception using tactile sensing and deep learning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp.  4842–4849. IEEE, 2018.
  95. Visual-tactile learning of garment unfolding for robot-assisted dressing. IEEE Robotics and Automation Letters, 8(9):5512–5519, 2023. doi: 10.1109/LRA.2023.3296371.
  96. Pointclip: Point cloud understanding by clip, 2021.
  97. Llama-adapter: Efficient finetuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199, 2023.
  98. Touching a neRF: Leveraging neural radiance fields for tactile sensory data generation. In 6th Annual Conference on Robot Learning, 2022. URL https://openreview.net/forum?id=No3mbanRlZJ.
  99. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Letian Fu (13 papers)
  2. Gaurav Datta (5 papers)
  3. Huang Huang (64 papers)
  4. William Chung-Ho Panitch (4 papers)
  5. Jaimyn Drake (5 papers)
  6. Joseph Ortiz (15 papers)
  7. Mustafa Mukadam (43 papers)
  8. Mike Lambeta (14 papers)
  9. Roberto Calandra (60 papers)
  10. Ken Goldberg (162 papers)
Citations (17)

Summary

Enhancing Multimodal AI: A Dataset for Touch, Vision, and Language Alignment

Introduction to Multimodal AI and Tactile Sensing

AI research has significantly progressed toward understanding and integrating multimodal sensory inputs, mimicking human cognitive abilities to perceive, reason, and interact with the environment. Multimodal AI combines multiple types of data, such as visual (images), auditory (sound), and linguistic (text) inputs, to create systems that can process and interpret the world more akin to human cognition. But one sensory modality that remains underrepresented in AI research is tactile sensing. Tactile sensing, or the sense of touch, is crucial for humans to perform everyday tasks, recognizing textures, hardness, and shapes, which are invaluable for nuanced interactions with our surroundings.

The incorporation of touch into AI systems promises significant advancements in robotics and human-computer interactions, creating machines capable of more sensitive and intelligent responses to their environment. Despite its potential, the challenge lies in capturing touch sensations and aligning them with visual and linguistic data to construct comprehensive multimodal datasets. This paper introduces a novel dataset designed to bridge this gap by providing a rich collection of touch, vision, and language data for the development and training of AI models.

The Touch-Vision-Language (TVL) Dataset

The TVL dataset is a comprehensive collection designed to foster advancements in touch perception within the AI field. This dataset comprises 44,000 vision-touch pairs, annotated with both human-generated and machine-generated textual labels. This vast dataset enables the exploration of tactile sensations in conjunction with visual and textual data, facilitating a deeper understanding and modeling of how these different modalities can be integrated.

Data Collection and Challenges

The creation of the TVL dataset faced two primary challenges: the acquisition of tactile data alongside visual data and the subjective nature of tactile sensation descriptions. To overcome these hurdles, the researchers developed a custom handheld data collection device equipped with a DIGIT tactile sensor and a camera. This device allowed for the synchronized capture of tactile and visual data by pressing and sliding the sensor across various surfaces and objects. Human annotation, a costly and labor-intensive process prone to subjectivity, was applied to a small portion of the dataset. To efficiently scale and extend the dataset, the team utilized GPT-4V, an off-the-shelf LLM, to generate textual pseudo-labels for the vast majority of the dataset. This approach not only enriched the dataset with linguistic annotations but also demonstrated a novel use case of LLMs in automating the labeling process for tactile data.

Key Contributions and Findings

The analysis of the TVL dataset led to several critical insights:

  • Multimodal Model Training: Leveraging the TVL dataset, the researchers trained a vision-and-language-aligned tactile encoder, achieving significant improvements in touch-vision-language alignment compared to models trained on any pair of these modalities alone.
  • Benchmark Performance: The TVL model outperformed existing vision-LLMs and even GPT-4V in a new benchmark for touch-vision understanding, illustrating the benefit of incorporating tactile data into multimodal AI models.

Implications and Future Directions

The introduction of the TVL dataset marks a significant step toward the comprehensive integration of touch with vision and language in AI systems. The alignment of these modalities opens up new avenues for research in embodied AI, where agents can perceive and interact with the world with a depth of understanding that closely mirrors human capabilities.

Future research can leverage the TVL dataset to explore various aspects:

  • Robotic Manipulation: Enhanced touch sensation models could significantly improve robotics applications, particularly in delicate manipulation tasks where understanding the tactile properties of objects is paramount.
  • Virtual and Augmented Reality: Incorporating touch into VR and AR systems could lead to more immersive and interactive experiences, blurring the lines between digital and physical realities further.
  • Language and Sensory Processing: The dataset offers a unique opportunity to paper the intersection of language and sensory perception, potentially uncovering new insights into how tactile experiences are described and understood linguistically.

Conclusion

By aligning touch with vision and language, the TVL dataset lays the groundwork for future research in multimodal AI. While the paper presents a significant advancement in this direction, the challenges of accurate tactile data collection, labeling, and interpretation remain open areas of research. As the field progresses, the integration of touch alongside other sensory modalities is poised to enrich AI's understanding of the world, leading to more nuanced and capable models that can interact with their environment in ways previously unimaginable.

Github Logo Streamline Icon: https://streamlinehq.com