Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
166 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Prototypical Contrastive Transfer Learning for Multimodal Language Understanding (2307.05942v1)

Published 12 Jul 2023 in cs.RO, cs.CL, and cs.CV

Abstract: Although domestic service robots are expected to assist individuals who require support, they cannot currently interact smoothly with people through natural language. For example, given the instruction "Bring me a bottle from the kitchen," it is difficult for such robots to specify the bottle in an indoor environment. Most conventional models have been trained on real-world datasets that are labor-intensive to collect, and they have not fully leveraged simulation data through a transfer learning framework. In this study, we propose a novel transfer learning approach for multimodal language understanding called Prototypical Contrastive Transfer Learning (PCTL), which uses a new contrastive loss called Dual ProtoNCE. We introduce PCTL to the task of identifying target objects in domestic environments according to free-form natural language instructions. To validate PCTL, we built new real-world and simulation datasets. Our experiment demonstrated that PCTL outperformed existing methods. Specifically, PCTL achieved an accuracy of 78.1%, whereas simple fine-tuning achieved an accuracy of 73.4%.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. A. Magassouba, K. Sugiura, and H. Kawai, “A Multimodal Target-Source Classifier With Attention Branches to Understand Ambiguous Instructions for Fetching Daily Objects,” RA-L, vol. 5, no. 2, pp. 532–539, 2020.
  2. J. Hatori, Y. Kikuchi, S. Kobayashi, et al., “Interactively Picking Real-World Objects with Unconstrained Spoken Language Instructions,” in ICRA, 2018, pp. 3774–3781.
  3. A. Magassouba, K. Sugiura, A. T. Quoc, H. Kawai, et al., “Understanding Natural Language Instructions for Fetching Daily Objects Using GAN-Based Multimodal Target–Source Classification,” RA-L, vol. 4, no. 4, pp. 3884–3891, 2019.
  4. S. Ishikawa and K. Sugiura, “Target-dependent UNITER: A Transformer-Based Multimodal Language Comprehension Model for Domestic Service Robots,” RA-L, vol. 6, no. 4, pp. 8401–8408, 2021.
  5. J. Li, P. Zhou, C. Xiong, and S. Hoi, “Prototypical Contrastive Learning of Unsupervised Representations,” in ICLR, 2021.
  6. A. Mogadala, M. Kalimuthu, D. Klakow, et al., “Trends in integration of vision and language research: A survey of tasks, datasets, and methods,” JAIR, vol. 71, pp. 1183–1317, 2021.
  7. S. Uppal, S. Bhagat, D. Hazarika, N. Majumder, S. Poria, R. Zimmermann, and A. Zadeh, “Multimodal research in vision and language: A review of current and emerging trends,” Information Fusion, vol. 77, pp. 149–171, 2022.
  8. Y. Du, Z. Liu, J. Li, and W. X. Zhao, “A Survey of Vision-Language Pre-Trained Models,” in IJCAI, 2022.
  9. S. Long, F. Cao, S. C. Han, and H. Yang, “Vision-and-Language Pretrained Models: A Survey,” in IJCAI, 2022.
  10. F.-L. Chen, D.-Z. Zhang, M.-L. Han, X.-Y. Chen, J. Shi, S. Xu, and B. Xu, “VLP: A Survey on Vision-language Pre-training,” Machine Intelligence Research, vol. 20, no. 1, pp. 38–56, 2023.
  11. L. Yu, Z. Lin, X. Shen, J. Yang, X. Lu, M. Bansal, and T. L. Berg, “MAttNet: Modular Attention Network for Referring Expression Comprehension,” in CVPR, 2018, pp. 1307–1315.
  12. J. Lu, D. Batra, D. Parikh, and S. Lee, “ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks,” in NeurIPS, vol. 32, 2019.
  13. Y.-C. Chen, L. Li, L. Yu, A. El Kholy, F. Ahmed, Z. Gan, Y. Cheng, and J. Liu, “UNITER: UNiversal Image-TExt Representation Learning,” in ECCV, 2020, pp. 104–120.
  14. Z. Gan, Y.-C. Chen, L. Li, C. Zhu, Y. Cheng, and J. Liu, “Large-Scale Adversarial Training for Vision-and-Language Representation Learning,” in NeurIPS, vol. 33, 2020, pp. 6616–6628.
  15. A. Kamath, M. Singh, Y. LeCun, G. Synnaeve, I. Misra, and N. Carion, “MDETR - Modulated Detection for End-to-End Multi-Modal Understanding,” in ICCV, 2021, pp. 1780–1790.
  16. P. Wang, A. Yang, R. Men, J. Lin, S. Bai, Z. Li, J. Ma, C. Zhou, J. Zhou, and H. Yang, “OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework,” in ICML, 2022, pp. 23 318–23 340.
  17. S. Ishikawa and K. Sugiura, “Moment-based Adversarial Training for Embodied Language Comprehension,” in ICPR, 2022, pp. 4139–4145.
  18. P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. Sünderhauf, I. Reid, S. Gould, A. Van Den Hengel, et al., “Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments,” in CVPR, 2018, pp. 3674–3683.
  19. Y. Qi, Q. Wu, P. Anderson, X. Wang, W. Y. Wang, C. Shen, A. v. d. Hengel, et al., “Reverie: Remote embodied visual referring expression in real indoor environments,” in CVPR, 2020, pp. 9982–9991.
  20. A. Ku, P. Anderson, R. Patel, E. Ie, and J. Baldridge, “Room-Across-Room: Multilingual Vision-and-Language Navigation with Dense Spatiotemporal Grounding,” in EMNLP, 2020, pp. 4392–4412.
  21. A. X. Chang, A. Dai, T. A. Funkhouser, M. Halber, M. Nießner, M. Savva, S. Song, A. Zeng, Y. Zhang, et al., “Matterport3D: Learning from RGB-D Data in Indoor Environments,” in 3DV, 2017, pp. 667–676.
  22. T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A Simple Framework for Contrastive Learning of Visual Representations,” in ICML, 2020, pp. 1597–1607.
  23. K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum Contrast for Unsupervised Visual Representation Learning,” in CVPR, 2020, pp. 9729–9738.
  24. I. Misra and L. v. d. Maaten, “Self-Supervised Learning of Pretext-Invariant Representations,” in CVPR, 2020, pp. 6707–6717.
  25. T. Gao, X. Yao, and D. Chen, “SimCSE: Simple Contrastive Learning of Sentence Embeddings,” in EMNLP, 2021, pp. 6894–6910.
  26. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., “Learning Transferable Visual Models From Natural Language Supervision,” in ICML, 2021, pp. 8748–8763.
  27. C. Jia, Y. Yang, Y. Xia, Y.-T. Chen, Z. Parekh, H. Pham, Q. Le, Y.-H. Sung, Z. Li, and T. Duerig, “Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision,” in ICML, 2021, pp. 4904–4916.
  28. B. Wu, R. Cheng, P. Zhang, T. Gao, J. E. Gonzalez, and P. Vajda, “Data Efficient Language-Supervised Zero-Shot Recognition with Optimal Transport Distillation,” in ICLR, 2022.
  29. Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, et al., “Google’s neural machine translation system: Bridging the gap between human and machine translation,” arXiv preprint arXiv:1609.08144, 2016.
  30. A. v. d. Oord, Y. Li, and O. Vinyals, “Representation Learning with Contrastive Predictive Coding,” arXiv preprint arXiv:1807.03748, 2018.
  31. F. Duanmu, Y. He, X. Xiu, P. Hanhart, Y. Ye, and Y. Wang, “Hybrid Cubemap Projection Format for 360-Degree Video Coding,” in Data Compression Conference, 2018, pp. 404–404.
  32. M. Shridhar, J. Thomason, D. Gordon, Y. Bisk, W. Han, R. Mottaghi, L. Zettlemoyer, and D. Fox, “ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks,” in CVPR, 2020, pp. 10 740–10 749.
  33. S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks,” IEEE Trans. PAMI, vol. 39, no. 6, pp. 1137–1149, 2017.
  34. H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, S. Savarese, et al., “Generalized intersection over union: A metric and a loss for bounding box regression,” in CVPR, 2019, pp. 658–666.
  35. K. Saito, K. Watanabe, Y. Ushiku, and T. Harada, “Maximum classifier discrepancy for unsupervised domain adaptation,” in CVPR, 2018, pp. 3723–3732.
Citations (1)

Summary

We haven't generated a summary for this paper yet.