A Touch, Vision, and Language Dataset for Multimodal Alignment (2402.13232v1)
Abstract: Touch is an important sensing modality for humans, but it has not yet been incorporated into a multimodal generative LLM. This is partially due to the difficulty of obtaining natural language labels for tactile data and the complexity of aligning tactile readings with both visual observations and language descriptions. As a step towards bridging that gap, this work introduces a new dataset of 44K in-the-wild vision-touch pairs, with English language labels annotated by humans (10%) and textual pseudo-labels from GPT-4V (90%). We use this dataset to train a vision-language-aligned tactile encoder for open-vocabulary classification and a touch-vision-language (TVL) model for text generation using the trained encoder. Results suggest that by incorporating touch, the TVL model improves (+29% classification accuracy) touch-vision-language alignment over existing models trained on any pair of those modalities. Although only a small fraction of the dataset is human-labeled, the TVL model demonstrates improved visual-tactile understanding over GPT-4V (+12%) and open-source vision-LLMs (+32%) on a new touch-vision understanding benchmark. Code and data: https://tactile-vlm.github.io.
- ajbarnett. 400 words to describe texture, 2023.
- Multimae: Multi-modal multi-task masked autoencoders. arXiv:2204.01678, 2022.
- Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond, 2023.
- The psychology of multimodal perception. Crossmodal space and crossmodal attention, pp. 141–177, 2004.
- Vision and touch are automatically integrated for the perception of sequences of events. Journal of vision, 6(5):2–2, 2006.
- Rt-2: Vision-language-action models transfer web knowledge to robotic control. In arXiv preprint arXiv:2307.15818, 2023.
- Instructpix2pix: Learning to follow image editing instructions, 2023.
- Language models are few-shot learners, 2020.
- Cross-modal perception of identity by sound and taste in bottlenose dolphins. Science Advances, 8(20):eabm7684, 2022.
- Less labels, more modalities: A self-training framework to reuse pretrained networks. In Rousseau, J.-J. and Kapralos, B. (eds.), Pattern Recognition, Computer Vision, and Image Processing. ICPR 2022 International Workshops and Challenges, pp. 287–302, Cham, 2023. Springer Nature Switzerland. ISBN 978-3-031-37731-0.
- Making large multimodal models understand arbitrary visual prompts, 2023a.
- Making large multimodal models understand arbitrary visual prompts. In arXiv:2312.00784, 2023b.
- More than a feeling: Learning to grasp and regrasp using vision and touch. IEEE Robotics and Automation Letters, 3(4):3300–3307, 2018.
- Integration of haptics and vision in human multisensory grasping. Cortex, 135:173–185, 2021.
- Emerging properties in self-supervised vision transformers, 2021.
- Shikra: Unleashing multimodal llm’s referential dialogue magic, 2023a.
- Sharegpt4v: Improving large multi-modal models with better captions, 2023b.
- Generative pretraining from pixels. 2020.
- Visuo-tactile transformers for manipulation, 2022.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
- Development of a tactile sensor based on biologically inspired edge encoding. In 2009 International Conference on Advanced Robotics, pp. 1–6. IEEE, 2009.
- Tactile sensing—from humans to humanoids. IEEE transactions on robotics, 26(1):1–20, 2009.
- Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
- Multimodal visual-tactile representation learning through self-supervised contrastive pre-training. arXiv preprint arXiv:2401.12024, 2024.
- An image is worth 16x16 words: Transformers for image recognition at scale. 2020.
- Palm-e: An embodied multimodal language model. In arXiv preprint arXiv:2303.03378, 2023.
- Safe self-supervised learning in real of visuo-tactile feedback policies for industrial insertion, 2023.
- Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010, 2023.
- Objectfolder: A dataset of objects with implicit visual, auditory, and tactile representations. In Conference on Robot Learning, 2021.
- Objectfolder 2.0: A multisensory object dataset for sim2real transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10598–10608, June 2022.
- Multimodal masked autoencoders learn transferable representations. arXiv preprint arXiv:2205.14204, 2022.
- Imagebind: One embedding space to bind them all. In CVPR, 2023.
- Active touch and robot perception. Cognition and Brain Theory, 7(2):199–214, 1984.
- Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv:1706.02677, 2017.
- Point-bind and point-llm: Aligning point cloud with multi-modality for 3d understanding, generation, and instruction following, 2023.
- Audioclip: Extending clip to image, text and audio, 2021.
- Imagebind-llm: Multi-modality instruction tuning, 2023.
- Instruct-nerf2nerf: Editing 3d scenes with instructions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023.
- Learning to read braille: Bridging the tactile reality gap with diffusion models. 2023.
- LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9.
- Openclip, July 2021. URL https://doi.org/10.5281/zenodo.5143773. If you use this software, please cite it as below.
- Memory for curvature of objects: Haptic touch vs. vision. British Journal of Psychology, 98(4):589–610, 2007.
- Coding and use of tactile signals from the fingertips in object manipulation tasks. Nature Reviews Neuroscience, 10(5):345–359, 2009.
- A comparison of learning with haptic and visual modalities. 2005.
- Multi-sensorial and explorative recognition of garments and their material properties in unconstrained environment. In 2016 IEEE international conference on robotics and automation (ICRA), pp. 1656–1663. IEEE, 2016.
- Self-supervised visuo-tactile pretraining to locate and follow garment features, 2023.
- The skin and its receptors 148 pathways to cortex and major cortical areas. Handbook of psychology, experimental psychology, 4:147, 2003.
- Digit: A novel design for a low-cost compact high-resolution tactile sensor with application to in-hand manipulation. IEEE Robotics and Automation Letters, 5(3):3838–3845, 2020. doi: 10.1109/LRA.2020.2977257.
- Lee, D.-H. et al. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML, volume 3, pp. 896. Atlanta, 2013.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023a.
- Sensing and recognizing surface textures using a gelsight sensor. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1241–1247, 2013.
- Connecting touch and vision via cross-modal prediction, 2019.
- Scaling language-image pre-training via masking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 23390–23400, 2023b.
- Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models, 2023.
- Improved baselines with visual instruction tuning, 2023a.
- Visual instruction tuning. In NeurIPS, 2023b.
- Sgdr: Stochastic gradient descent with warm restarts. 2017a.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017b.
- Unified-io 2: Scaling autoregressive multimodal models with vision, language, audio, and action, 2023.
- McLachlan, G. J. Iterative reclassification procedure for constructing an asymptotically optimal rule of allocation in discriminant analysis. Journal of the American Statistical Association, 70(350):365–369, 1975.
- Verbal labels facilitate tactile perception. Cognition, 171:172–179, 2018.
- Anymal: An efficient and scalable any-modality augmented language model, 2023.
- Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Transactions on pattern analysis and machine intelligence, 24(7):971–987, 2002.
- Gpt-4 technical report, 2023.
- Wearable haptic systems for the fingertip and the hand: Taxonomy, review, and perspectives. IEEE Transactions on Haptics, 10(4):580–600, 2017. doi: 10.1109/TOH.2017.2689006.
- General In-Hand Object Rotation with Vision and Touch. In Conference on Robot Learning (CoRL), 2023.
- Learning transferable visual models from natural language supervision, 2021.
- Robot learning with sensorimotor pre-training. arXiv preprint arXiv:2306.10007, 2023.
- A generalist agent, 2022.
- High-resolution image synthesis with latent diffusion models, 2022.
- Semi-supervised self-training of object detection models. 2005.
- Neuronal correlates of label facilitated tactile perception. Scientific Reports, 9(1):1606, 2019.
- Design, motivation and evaluation of a full-resolution optical tactile sensor. Sensors, 19(4):928, 2019.
- Shimonomura, K. Tactile image sensors employing camera: A review. Sensors, 19(18):3933, 2019.
- Fixmatch: Simplifying semi-supervised learning with consistency and confidence. arXiv preprint arXiv:2001.07685, 2020.
- Crossmodal associations with olfactory, auditory, and tactile stimuli in children and adults. i-Perception, 12(6):20416695211048513, 2021.
- The contributions of vision and haptics to reaching and grasping. Frontiers in psychology, 6:1403, 2015.
- Pandagpt: One model to instruction-follow them all, 2023.
- Generative multimodal models are in-context learners, 2023.
- Midastouch: Monte-carlo inference over distributions across sliding touch. In 6th Annual Conference on Robot Learning, 2022. URL https://openreview.net/forum?id=JWROnOf4w-K.
- Codi-2: In-context, interleaved, and interactive any-to-any generation, 2023.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Turk, M. Multimodal interaction: A review. Pattern recognition letters, 36:189–195, 2014.
- Long-tailed recognition by routing diverse distribution-aware experts. arXiv preprint arXiv:2010.01809, 2020.
- Debiased learning from naturally imbalanced pseudo-labels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14647–14657, 2022a.
- Cut and learn for unsupervised object detection and instance segmentation, 2023.
- Self-instruct: Aligning language model with self generated instructions, 2022b.
- Next-gpt: Any-to-any multimodal llm. CoRR, abs/2309.05519, 2023.
- Combining finger vision and optical tactile sensing: Reducing and handling errors while cutting vegetables. In 2016 IEEE-RAS 16th International Conference on Humanoid Robots (Humanoids), pp. 1045–1051, 2016. doi: 10.1109/HUMANOIDS.2016.7803400.
- Touch and go: Learning from human-collected vision and touch. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022.
- Binding touch to everything: Learning unified multimodal tactile representations. arXiv preprint arXiv:2401.18084, 2024.
- Gelsight: High-resolution robot tactile sensors for estimating geometry and force. Sensors, 17(12):2762, 2017.
- Active clothing material perception using tactile sensing and deep learning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 4842–4849. IEEE, 2018.
- Visual-tactile learning of garment unfolding for robot-assisted dressing. IEEE Robotics and Automation Letters, 8(9):5512–5519, 2023. doi: 10.1109/LRA.2023.3296371.
- Pointclip: Point cloud understanding by clip, 2021.
- Llama-adapter: Efficient finetuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199, 2023.
- Touching a neRF: Leveraging neural radiance fields for tactile sensory data generation. In 6th Annual Conference on Robot Learning, 2022. URL https://openreview.net/forum?id=No3mbanRlZJ.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
- Letian Fu (13 papers)
- Gaurav Datta (5 papers)
- Huang Huang (64 papers)
- William Chung-Ho Panitch (4 papers)
- Jaimyn Drake (5 papers)
- Joseph Ortiz (15 papers)
- Mustafa Mukadam (43 papers)
- Mike Lambeta (14 papers)
- Roberto Calandra (60 papers)
- Ken Goldberg (162 papers)