Tactile-Augmented Radiance Fields (2405.04534v1)
Abstract: We present a scene representation, which we call a tactile-augmented radiance field (TaRF), that brings vision and touch into a shared 3D space. This representation can be used to estimate the visual and tactile signals for a given 3D position within a scene. We capture a scene's TaRF from a collection of photos and sparsely sampled touch probes. Our approach makes use of two insights: (i) common vision-based touch sensors are built on ordinary cameras and thus can be registered to images using methods from multi-view geometry, and (ii) visually and structurally similar regions of a scene share the same tactile features. We use these insights to register touch signals to a captured visual scene, and to train a conditional diffusion model that, provided with an RGB-D image rendered from a neural radiance field, generates its corresponding tactile signal. To evaluate our approach, we collect a dataset of TaRFs. This dataset contains more touch samples than previous real-world datasets, and it provides spatially aligned visual signals for each captured touch signal. We demonstrate the accuracy of our cross-modal generative model and the utility of the captured visual-tactile data on several downstream tasks. Project page: https://dou-yiming.github.io/TaRF
- Referit3d: Neural listeners for fine-grained 3d object identification in real-world scenes. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16, 2020.
- Sim-to-real transfer for vision-and-language navigation. In Conference on Robot Learning, pages 671–681. PMLR, 2021.
- What am i touching? learning to classify terrain via haptic sensing. In 2019 International Conference on Robotics and Automation (ICRA), pages 7187–7193. IEEE, 2019.
- Estimating the material properties of fabric from video. In Proceedings of the IEEE international conference on computer vision, pages 1984–1991, 2013.
- Alexander Burka. Instrumentation, data, and algorithms for visually understanding haptic surface properties. 2018.
- The feeling of success: Does touch sensing help predict grasp outcomes? Conference on Robot Learning (CoRL), 2017.
- More than a feeling: Learning to grasp and regrasp using vision and touch. Robotics and Automation Letters (RA-L), 2018a.
- More than a feeling: Learning to grasp and regrasp using vision and touch. IEEE Robotics and Automation Letters, 3:3300–3307, 2018b.
- The ycb object and model set: Towards common benchmarks for manipulation research. In 2015 international conference on advanced robotics (ICAR), pages 510–517. IEEE, 2015.
- Scanrefer: 3d object localization in rgb-d scans using natural language. In European conference on computer vision, 2020.
- Deep reinforcement learning for tactile robotics: Learning to type on a braille keyboard. IEEE Robotics and Automation Letters, 5(4):6145–6152, 2020.
- Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2017.
- Image-space modal bases for plausible manipulation of objects in video. ACM Transactions on Graphics (TOG), 2015.
- Vq-gnn: A universal framework to scale up graph neural networks using vector quantization. Advances in Neural Information Processing Systems, 34:6733–6746, 2021.
- Accurate, dense, and robust multiview stereopsis. IEEE transactions on pattern analysis and machine intelligence, 32(8):1362–1376, 2009.
- Objectfolder: A dataset of objects with implicit visual, auditory, and tactile representations. In CoRL, 2021.
- Objectfolder 2.0: A multisensory object dataset for sim2real transfer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10598–10608, 2022.
- The objectfolder benchmark: Multisensory learning with neural and real objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17276–17286, 2023.
- The representation of extrapersonal space: A possible role for bimodal, visual-tactile neurons. 1995.
- Multiple view geometry in computer vision. Cambridge university press, 2003.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
- Learning to read braille: Bridging the tactile reality gap with diffusion models. arXiv preprint arXiv:2304.01182, 2023.
- Haptic terrain classification for legged robots. In 2010 IEEE International Conference on Robotics and Automation, pages 2828–2833. IEEE, 2010.
- Retrographic sensing for the measurement of surface texture and shape. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 1070–1077. IEEE, 2009.
- Microgeometry capture using an elastomeric sensor. ACM Transactions on Graphics (TOG), 2011.
- Learning self-supervised representations from vision and touch for active sliding perception of deformable surfaces. arXiv preprint arXiv:2209.13042, 2022.
- Self-supervised visuo-tactile pretraining to locate and follow garment features. In Robotics: Science and Systems, 2023a.
- Lerf: Language embedded radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023b.
- Adam: A method for stochastic optimization. ICLR, 2015.
- Haptic inspection of planetary soils with legged robots. IEEE Robotics and Automation Letters, 4(2):1626–1632, 2019.
- Digit: A novel design for a low-cost compact high-resolution tactile sensor with application to in-hand manipulation. IEEE Robotics and Automation Letters, 2020.
- A quasi-dense approach to surface reconstruction from uncalibrated images. IEEE transactions on pattern analysis and machine intelligence, 27(3):418–433, 2005.
- Vihope: Visuotactile in-hand object 6d pose estimation with shape completion. IEEE Robotics and Automation Letters, 8(11):6963–6970, 2023.
- Connecting touch and vision via cross-modal prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10609–10618, 2019.
- Learning to identify object instances by touch: Tactile recognition via multimodal matching. In 2019 International Conference on Robotics and Automation (ICRA), pages 3644–3650. IEEE, 2019.
- Learning visual locomotion with cross-modal supervision. In 2023 IEEE International Conference on Robotics and Automation (ICRA), 2023.
- Learning physically grounded robot vision with active sensing motor policies. In 7th Annual Conference on Robot Learning, 2023.
- Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021.
- Visually indicated sounds. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2405–2413, 2016.
- Bounce and learn: Modeling scene dynamics with real-world bounces. arXiv preprint arXiv:1904.06827, 2019.
- General in-hand object rotation with vision and touch. arXiv preprint arXiv:2309.09979, 2023.
- Learning transferable visual models from natural language supervision, 2021.
- Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821–8831. PMLR, 2021.
- High-resolution image synthesis with latent diffusion models, 2021.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
- Structure-from-motion revisited. In Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
- Design, motivation and evaluation of a full-resolution optical tactile sensor. Sensors, 19(4):928, 2019.
- Indoor segmentation and support inference from rgbd images. In ECCV 2012, ECCV 2012.
- The development of embodied cognition: Six lessons from babies. Artificial life, 2005.
- Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pages 2256–2265. PMLR, 2015.
- Midastouch: Monte-carlo inference over distributions across sliding touch. In Conference on Robot Learning, pages 319–331. PMLR, 2023.
- Nerfstudio: A modular framework for neural radiance field development. arXiv preprint arXiv:2302.04264, 2023.
- 3d shape perception from monocular vision, touch, and shape priors. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2018.
- Sun3d: A database of big spaces reconstructed using sfm and object labels. 2013.
- Touch and go: Learning from human-collected vision and touch. Neural Information Processing Systems (NeurIPS) - Datasets and Benchmarks Track, 2022.
- Generating visual scenes from touch. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22070–22080, 2023.
- Binding touch to everything: Learning unified multimodal tactile representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024.
- Scannet++: A high-fidelity dataset of 3d indoor scenes. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12–22, 2023.
- Rotating without seeing: Towards in-hand dexterity through touch. arXiv preprint arXiv:2303.10880, 2023.
- Connecting look and feel: Associating the visual and tactile properties of physical materials. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5580–5588, 2017.
- In-place scene labelling and understanding with implicit scene representation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15838–15847, 2021.
- Touching a nerf: Leveraging neural radiance fields for tactile sensory data generation. In Conference on Robot Learning, pages 1618–1628. PMLR, 2023.