Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Case Study on Visual-Audio-Tactile Cross-Modal Retrieval (2407.20709v1)

Published 30 Jul 2024 in cs.RO

Abstract: Cross-Modal Retrieval (CMR), which retrieves relevant items from one modality (e.g., audio) given a query in another modality (e.g., visual), has undergone significant advancements in recent years. This capability is crucial for robots to integrate and interpret information across diverse sensory inputs. However, the retrieval space in existing robotic CMR approaches often consists of only one modality, which limits the robot's performance. In this paper, we propose a novel CMR model that incorporates three different modalities, i.e., visual, audio and tactile, for enhanced multi-modal object retrieval, named as VAT-CMR. In this model, multi-modal representations are first fused to provide a holistic view of object features. To mitigate the semantic gaps between representations of different modalities, a dominant modality is then selected during the classification training phase to improve the distinctiveness of the representations, so as to improve the retrieval performance. To evaluate our proposed approach, we conducted a case study and the results demonstrate that our VAT-CMR model surpasses competing approaches. Further, our proposed dominant modality selection significantly enhances cross-retrieval accuracy.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. P. Kaur, H. S. Pannu, and A. K. Malhi, “Comparative analysis on cross-modal information retrieval: A review,” Computer Science Review, vol. 39, p. 100336, 2021.
  2. S. Luo, W. Yuan, E. Adelson, A. G. Cohn, and R. Fuentes, “Vitac: Feature sharing between vision and tactile sensing for cloth texture recognition,” in 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 2722–2727, 2018.
  3. S. Luo, J. Bimbo, R. Dahiya, and H. Liu, “Robotic tactile perception of object properties: A review,” Mechatronics, vol. 48, pp. 54–67, 2017.
  4. J.-T. Lee, D. Bollegala, and S. Luo, ““Touching to see” and “seeing to feel”: Robotic cross-modal sensory data generation for visual-tactile perception,” in 2019 International Conference on Robotics and Automation (ICRA), pp. 4276–4282, IEEE, 2019.
  5. G. Cao, J. Jiang, N. Mao, D. Bollegala, M. Li, and S. Luo, “Vis2Hap: Vision-based Haptic Rendering by Cross-modal Generation,” in 2023 IEEE International Conference on Robotics and Automation (ICRA), pp. 12443–12449, 2023.
  6. S. Luo, W. Mou, K. Althoefer, and H. Liu, “Localizing the object contact through matching tactile features with visual map,” in Proc. IEEE Int. Conf. Robot. Autom., pp. 3903–3908, 2015.
  7. L. Pecyna, S. Dong, and S. Luo, “Visual-tactile multimodality for following deformable linear objects using reinforcement learning,” in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 3987–3994, 2022.
  8. J. Jiang, G. Cao, A. Butterworth, T.-T. Do, and S. Luo, “Where shall i touch? vision-guided tactile poking for transparent object grasping,” IEEE/ASME Transactions on Mechatronics, vol. 28, no. 1, pp. 233–244, 2022.
  9. R. Gao, Y.-Y. Chang, S. Mall, L. Fei-Fei, and J. Wu, “Objectfolder: A dataset of objects with implicit visual, auditory, and tactile representations,” in Conference on Robot Learning, pp. 466–476, PMLR, 2022.
  10. A. A. Ghazanfar and C. E. Schroeder, “Is neocortex essentially multisensory?,” Trends in cognitive sciences, vol. 10, no. 6, pp. 278–285, 2006.
  11. B. E. Stein and T. R. Stanford, “Multisensory integration: current issues from the perspective of the single neuron,” Nature reviews neuroscience, vol. 9, no. 4, pp. 255–266, 2008.
  12. N. Rasiwasia, J. Costa Pereira, E. Coviello, G. Doyle, G. R. Lanckriet, R. Levy, and N. Vasconcelos, “A new approach to cross-modal multimedia retrieval,” in Proceedings of the 18th ACM international conference on Multimedia, pp. 251–260, 2010.
  13. R. Rosipal and N. Krämer, “Overview and recent advances in partial least squares,” in International Statistical and Optimization Perspectives Workshop” Subspace, Latent Structure and Feature Selection”, pp. 34–51, Springer, 2005.
  14. A. Sharma, A. Kumar, H. Daume, and D. W. Jacobs, “Generalized multiview analysis: A discriminative latent space,” in 2012 IEEE conference on computer vision and pattern recognition, pp. 2160–2167, IEEE, 2012.
  15. N. Srivastava and R. R. Salakhutdinov, “Multimodal learning with deep boltzmann machines,” Advances in neural information processing systems, vol. 25, 2012.
  16. G. Andrew, R. Arora, J. Bilmes, and K. Livescu, “Deep canonical correlation analysis,” in International conference on machine learning, pp. 1247–1255, PMLR, 2013.
  17. W. Wang, R. Arora, K. Livescu, and J. Bilmes, “On deep multi-view representation learning,” in International conference on machine learning, pp. 1083–1092, PMLR, 2015.
  18. B. Wang, Y. Yang, X. Xu, A. Hanjalic, and H. T. Shen, “Adversarial cross-modal retrieval,” in Proceedings of the 25th ACM international conference on Multimedia, pp. 154–162, 2017.
  19. L. Zhen, P. Hu, X. Wang, and D. Peng, “Deep supervised cross-modal retrieval,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10394–10403, 2019.
  20. H. Liu, F. Wang, F. Sun, and X. Zhang, “Active visual-tactile cross-modal matching,” IEEE Transactions on Cognitive and Developmental Systems, vol. 11, no. 2, pp. 176–187, 2018.
  21. W. Zheng, H. Liu, B. Wang, and F. Sun, “Cross-modal surface material retrieval using discriminant adversarial learning,” IEEE transactions on industrial informatics, vol. 15, no. 9, pp. 4978–4987, 2019.
  22. H. Liu, F. Wang, F. Sun, and B. Fang, “Surface material retrieval using weakly paired cross-modal learning,” IEEE Transactions on Automation Science and Engineering, vol. 16, no. 2, pp. 781–791, 2018.
  23. W. Yuan, S. Dong, and E. H. Adelson, “Gelsight: High-resolution robot tactile sensors for estimating geometry and force,” Sensors, vol. 17, no. 12, p. 2762, 2017.
  24. M. Lambeta, P.-W. Chou, S. Tian, B. Yang, B. Maloon, V. R. Most, D. Stroud, R. Santos, A. Byagowi, G. Kammerer, et al., “Digit: A novel design for a low-cost compact high-resolution tactile sensor with application to in-hand manipulation,” IEEE Robotics and Automation Letters, vol. 5, no. 3, pp. 3838–3845, 2020.
  25. D. F. Gomes, Z. Lin, and S. Luo, “Geltip: A finger-shaped optical tactile sensor for robotic manipulation,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst., pp. 9903–9909, 2020.
  26. G. Cao, J. Jiang, C. Lu, D. F. Gomes, and S. Luo, “Touchroller: A rolling optical tactile sensor for rapid assessment of textures for large surface areas,” Sensors, vol. 23, no. 5, p. 2661, 2023.
  27. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
  28. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  29. W. Zheng, H. Liu, and F. Sun, “Lifelong visual-tactile cross-modal learning for robotic material perception,” IEEE transactions on neural networks and learning systems, vol. 32, no. 3, pp. 1192–1203, 2020.
  30. R. Gao, Z. Si, Y.-Y. Chang, S. Clarke, J. Bohg, L. Fei-Fei, W. Yuan, and J. Wu, “Objectfolder 2.0: A multisensory object dataset for sim2real transfer,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10598–10608, 2022.
  31. L. Van der Maaten and G. Hinton, “Visualizing data using t-sne.,” Journal of machine learning research, vol. 9, no. 11, 2008.
  32. T. Jianu, D. F. Gomes, and S. Luo, “Reducing tactile sim2real domain gaps via deep texture generation networks,” in 2022 International Conference on Robotics and Automation (ICRA), pp. 8305–8311, IEEE, 2022.
  33. X. Jing, K. Qian, T. Jianu, and S. Luo, “Unsupervised adversarial domain adaptation for sim-to-real transfer of tactile images,” IEEE Transactions on Instrumentation and Measurement, vol. 72, pp. 1–11, 2023.
  34. G. Cao, Y. Zhou, D. Bollegala, and S. Luo, “Spatio-temporal attention model for tactile texture recognition,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst., pp. 9896–9902, 2020.
  35. G. Cao, J. Jiang, D. Bollegala, M. Li, and S. Luo, “Multimodal zero-shot learning for tactile texture recognition,” Robotics and Autonomous Systems, vol. 176, p. 104688, 2024.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Jagoda Wojcik (1 paper)
  2. Jiaqi Jiang (34 papers)
  3. Jiacheng Wu (16 papers)
  4. Shan Luo (74 papers)

Summary

We haven't generated a summary for this paper yet.