Binding Touch to Everything: Learning Unified Multimodal Tactile Representations (2401.18084v1)
Abstract: The ability to associate touch with other modalities has huge implications for humans and computational systems. However, multimodal learning with touch remains challenging due to the expensive data collection process and non-standardized sensor outputs. We introduce UniTouch, a unified tactile model for vision-based touch sensors connected to multiple modalities, including vision, language, and sound. We achieve this by aligning our UniTouch embeddings to pretrained image embeddings already associated with a variety of other modalities. We further propose learnable sensor-specific tokens, allowing the model to learn from a set of heterogeneous tactile sensors, all at the same time. UniTouch is capable of conducting various touch sensing tasks in the zero-shot setting, from robot grasping prediction to touch image question answering. To the best of our knowledge, UniTouch is the first to demonstrate such capabilities. Project page: https://cfeng16.github.io/UniTouch/
- Simulation of vision-based tactile sensors using physics based rendering. 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 1–7, 2020.
- Labelling unlabelled videos from scratch with multi-modal self-supervision. Advances in Neural Information Processing Systems, 33:4660–4671, 2020.
- See, hear, and read: Deep aligned representations. ArXiv, abs/1706.00932, 2017.
- Learning to taste: A multimodal wine dataset. arXiv preprint arXiv:2308.16900, 2023.
- Visit-bench: A benchmark for vision-language instruction following inspired by real-world use. arXiv preprint arXiv:2308.06595, 2023.
- The feeling of success: Does touch sensing help predict grasp outcomes? Conference on Robot Learning (CoRL), 2017.
- More than a feeling: Learning to grasp and regrasp using vision and touch. IEEE Robotics and Automation Letters, 3:3300–3307, 2018.
- Multimodal perception for dexterous manipulation. ArXiv, abs/2112.14298, 2021.
- Spatio-temporal attention model for tactile texture recognition. 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 9896–9902, 2020.
- Touchroller: A rolling optical tactile sensor for rapid assessment of large surfaces. ArXiv, abs/2103.00595, 2021.
- Vis2hap: Vision-based haptic rendering by cross-modal generation. 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 12443–12449, 2023.
- Using collocated vision and tactile sensors for visual servoing and localization. IEEE Robotics and Automation Letters, 7:3427–3434, 2022.
- iquery: Instruments as queries for audio-visual sound separation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14675–14686, 2023a.
- Movies2scenes: Using movie metadata to learn scene representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6535–6544, 2023b.
- Sound localization from motion: Jointly learning sound direction and camera rotation. arXiv preprint arXiv:2303.11329, 2023c.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023.
- Tactile sim-to-real policy transfer via real-to-sim image translation. In Conference on Robot Learning, 2021.
- Contrastive vision-language pre-training with limited resources. In European Conference on Computer Vision, 2022.
- Force and tactile sensors. In Springer Handbook of Robotics, 2008.
- Instructblip: Towards general-purpose vision-language models with instruction tuning. ArXiv, abs/2305.06500, 2023.
- Canonical partial least squares and continuum power regression. Journal of Chemometrics, 15, 2001.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009.
- Virtex: Learning visual representations from textual annotations. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11162–11173, 2021.
- An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021.
- Conditional generation of audio from video via foley analogies. In Conference on Computer Vision and Pattern Recognition 2023, 2023.
- Clap learning audio concepts from natural language supervision. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023.
- Self-supervised video forensics by audio-visual anomaly detection. Computer Vision and Pattern Recognition (CVPR), 2023.
- Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010, 2023a.
- Supervised autoencoder joint learning on heterogeneous tactile sensory data: Improving material classification performance. 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 10907–10913, 2020.
- Objectfolder: A dataset of objects with implicit visual, auditory, and tactile representations. In CoRL, 2021a.
- On explainability and sensor-adaptability of a robot tactile texture representation using a two-stage recurrent networks. 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1296–1303, 2021b.
- Objectfolder 2.0: A multisensory object dataset for sim2real transfer. In CVPR, 2022.
- The objectfolder benchmark: Multisensory learning with neural and real objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 17276–17286, 2023b.
- Controllable visual-tactile synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 7040–7052, 2023c.
- Imagebind: One embedding space to bind them all. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15180–15190, 2023.
- Beyond flat gelsight sensors: Simulation of optical tactile sensors of complex morphologies for sim2real learning. ArXiv, abs/2305.12605, 2023.
- Point-bind & point-llm: Aligning point cloud with multi-modality for 3d understanding, generation, and instruction following. arXiv preprint arXiv:2309.00615, 2023.
- Tactile image-to-image disentanglement of contact geometry from motion-induced shear. In 5th Annual Conference on Robot Learning, 2021.
- Dexterity from touch: Self-supervised pre-training of tactile representations with robotic play, 2023.
- Audioclip: Extending clip to image, text and audio. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 976–980. IEEE, 2022.
- Learning an action-conditional model for haptic texture generation. 2020 IEEE International Conference on Robotics and Automation (ICRA), pages 11088–11095, 2019.
- Learning to read braille: Bridging the tactile reality gap with diffusion models. 2023.
- Harold Hotelling. Relations between two sets of variates. Biometrika, 28:321–377, 1936.
- Mix and localize: Localizing sound sources in mixtures. Computer Vision and Pattern Recognition (CVPR), 2022.
- Understanding dynamic tactile sensing for liquid property estimation. ArXiv, abs/2205.08771, 2022.
- Fabian Hutmacher. Why is there so much more research on vision than on any other sensory modality? Frontiers in psychology, 10:2246, 2019.
- Image-to-image translation with conditional adversarial networks. CVPR, 2017.
- Mrtnet: Multi-resolution temporal network for video sentence grounding. ICASSP, 2023a.
- Online distillation-enhanced multi-modal transformer for sequential recommendation. In Proceedings of the 31th ACM international conference on Multimedia, 2023b.
- Robotic perception of object properties using tactile sensing. ArXiv, abs/2112.14119, 2021.
- Vision-guided active tactile perception for crack detection and reconstruction. 2021 29th Mediterranean Conference on Control and Automation (MED), pages 930–936, 2021.
- Learn from incomplete tactile data: Tactile representation learning with masked autoencoders. In Proceedings of the 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2023.
- Reducing tactile sim2real domain gaps via deep texture generation networks. 2022 International Conference on Robotics and Automation (ICRA), pages 8305–8311, 2021.
- Retrographic sensing for the measurement of surface texture and shape. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 1070–1077. IEEE, 2009.
- Microgeometry capture using an elastomeric sensor. ACM SIGGRAPH 2011 papers, 2011.
- Tactile sensing in dexterous robot hands - review. Robotics Auton. Syst., 74:195–220, 2015.
- Self-supervised visuo-tactile pretraining to locate and follow garment features. In Robotics: Science and Systems, 2023.
- Adam: A method for stochastic optimization. In International Conference on Learning Representation, 2015.
- Digit: A novel design for a low-cost compact high-resolution tactile sensor with application to in-hand manipulation. IEEE Robotics and Automation Letters, 5:3838–3845, 2020.
- Pytouch: A machine learning library for touch processing. 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 13208–13214, 2021.
- Hand movements: A window into haptic object recognition. Cognitive Psychology, 19:342–368, 1987.
- Tutorial review haptic perception: A tutorial. 2009.
- Making sense of vision and touch: Learning multimodal representations for contact-rich tasks. IEEE Transactions on Robotics, 36:582–596, 2019.
- Sound-guided semantic image manipulation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3377–3386, 2022.
- In-hand manipulation of unknown objects with tactile sensing for insertion. In Embracing Contacts - Workshop at ICRA 2023, 2023.
- Digitac: A digit-tactip hybrid tactile sensor for comparing low-cost high-resolution robot touch. IEEE Robotics and Automation Letters, 7:9382–9388, 2022.
- See, hear, and feel: Smart sensory fusion for robotic manipulation. In Conference on Robot Learning, 2022.
- Vihope: Visuotactile in-hand object 6d pose estimation with shape completion. IEEE Robotics and Automation Letters, 8(11):6963–6970, 2023a.
- Dynamic network for language-based fashion retrieval. In Proceedings of the 1st International Workshop on Deep Multimodal Learning for Information Retrieval, pages 49–57, 2023b.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023c.
- Connecting touch and vision via cross-modal prediction. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10601–10610, 2019.
- Learning to identify object instances by touch: Tactile recognition via multimodal matching. 2019 International Conference on Robotics and Automation (ICRA), pages 3644–3650, 2019.
- David J Linden. Touch: The science of the hand, heart, and mind. Penguin Books, 2016.
- Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023.
- Goal-driven robotic pushing using tactile and proprioceptive feedback. IEEE Transactions on Robotics, 38:1201–1212, 2020.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning. Neurocomputing, 508:293–304, 2022.
- Vitac: Feature sharing between vision and tactile sensing for cloth texture recognition. 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 2722–2727, 2018.
- Paul R Manske. The sense of touch. Journal of Hand Surgery, 24(2):213–214, 1999.
- Audio-visual instance discrimination with cross-modal agreement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12475–12486, 2021.
- Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
- Learning sight from sound: Ambient sound provides supervision for visual learning. 2018.
- In-hand manipulation of unknown objects with tactile sensing for insertion. 2022.
- Visual-tactile multimodality for following deformable linear objects using reinforcement learning. 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3987–3994, 2022.
- General in-hand object rotation with vision and touch. ArXiv, abs/2309.09979, 2023.
- Vt-clip: Enhancing vision-language models with visual-guided texts. arXiv preprint arXiv:2112.02399, 2021.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, 2021a.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021b.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
- Taxim: An example-based simulation model for gelsight tactile sensors. IEEE Robotics and Automation Letters, PP:1–1, 2021.
- The development of embodied cognition: Six lessons from babies. Artificial life, 2005.
- Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
- Sound to visual scene generation by audio-to-visual latent alignment. Computer Vision and Pattern Recognition (CVPR), 2023.
- MidasTouch: Monte-Carlo inference over distributions across sliding touch. In Proc. Conf. on Robot Learning, CoRL, Auckland, NZ, 2022a.
- ShapeMap 3-D: Efficient shape mapping through dense touch and vision. In Proc. IEEE Intl. Conf. on Robotics and Automation, ICRA, Philadelphia, PA, USA, 2022b.
- Fast texture classification using tactile neural coding and spiking neural network. 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 9890–9895, 2020.
- Gelslim 3.0: High-resolution measurement of shape, force and slip in a compact tactile-sensing finger. 2022 International Conference on Robotics and Automation (ICRA), pages 10781–10787, 2021.
- Contrastive multiview coding. In European conference on computer vision, pages 776–794. Springer, 2020.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Real-time soft body 3d proprioception via deep vision-based sensing. IEEE Robotics and Automation Letters, 5:3382–3389, 2019.
- Tacto: A fast, flexible, and open-source simulator for high-resolution vision-based tactile sensors. IEEE Robotics and Automation Letters, 7:3930–3937, 2020.
- Mgh: Metadata guided hypergraph modeling for unsupervised person re-identification. In Proceedings of the 29th ACM International Conference on Multimedia, pages 1571–1580, 2021.
- Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023.
- Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3733–3742, 2018.
- Ava-avd: Audio-visual speaker diarization in the wild. In Proceedings of the 30th ACM International Conference on Multimedia, pages 3838–3847, 2022.
- Videoclip: Contrastive pre-training for zero-shot video-text understanding. arXiv preprint arXiv:2109.14084, 2021a.
- Towards learning to play piano with dexterous hands and touch. 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 10410–10416, 2021b.
- Visual-tactile sensing for in-hand object reconstruction. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8803–8812, 2023.
- Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1179–1189, 2022.
- Sparse and complete latent organization for geospatial semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1809–1818, 2022.
- Touch and go: Learning from human-collected vision and touch. Neural Information Processing Systems (NeurIPS) - Datasets and Benchmarks Track, 2022a.
- Generating visual scenes from touch. International Conference on Computer Vision (ICCV), 2023.
- Unified contrastive learning in image-text-label space. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19141–19151, 2022b.
- Rotating without seeing: Towards in-hand dexterity through touch. Robotics: Science and Systems, 2023.
- Mimictouch: Learning human’s control strategy with multi-modal tactile feedback. ArXiv, abs/2310.16917, 2023.
- Fully proprioceptive slip-velocity-aware state estimation for mobile robots via invariant kalman filtering and disturbance observer. arXiv preprint arXiv:2209.15140, 2022.
- Gelsight: High-resolution robot tactile sensors for estimating geometry and force. Sensors (Basel, Switzerland), 17, 2017a.
- Connecting look and feel: Associating the visual and tactile properties of physical materials. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4494–4502, 2017b.
- Shape-independent hardness estimation using deep learning and a gelsight tactile sensor. In International Conference on Robotics and Automation (ICRA), 2017c.
- Learning rich touch representations through cross-modal self-supervision. In Conference on Robot Learning, 2021.
- Investigating vision foundational models for tactile representation learning. ArXiv, abs/2305.00596, 2023.
- Pointclip: Point cloud understanding by clip. arXiv preprint arXiv:2112.02413, 2021.
- Can language understand depth? In Proceedings of the 30th ACM International Conference on Multimedia, pages 6868–6874, 2022.
- Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199, 2023.
- Deep supervised cross-modal retrieval. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10386–10395, 2019.
- Exif as language: Learning cross-modal associations between images and camera metadata. Computer Vision and Pattern Recognition (CVPR), 2023.
- Touching a nerf: Leveraging neural radiance fields for tactile sensory data generation. In Conference on Robot Learning, pages 1618–1628. PMLR, 2023.
- Pointclip v2: Prompting clip and gpt for powerful 3d open-world learning. ICCV 2023, 2022.