Prototypical Contrastive Transfer Learning for Multimodal Language Understanding (2307.05942v1)
Abstract: Although domestic service robots are expected to assist individuals who require support, they cannot currently interact smoothly with people through natural language. For example, given the instruction "Bring me a bottle from the kitchen," it is difficult for such robots to specify the bottle in an indoor environment. Most conventional models have been trained on real-world datasets that are labor-intensive to collect, and they have not fully leveraged simulation data through a transfer learning framework. In this study, we propose a novel transfer learning approach for multimodal language understanding called Prototypical Contrastive Transfer Learning (PCTL), which uses a new contrastive loss called Dual ProtoNCE. We introduce PCTL to the task of identifying target objects in domestic environments according to free-form natural language instructions. To validate PCTL, we built new real-world and simulation datasets. Our experiment demonstrated that PCTL outperformed existing methods. Specifically, PCTL achieved an accuracy of 78.1%, whereas simple fine-tuning achieved an accuracy of 73.4%.
- A. Magassouba, K. Sugiura, and H. Kawai, “A Multimodal Target-Source Classifier With Attention Branches to Understand Ambiguous Instructions for Fetching Daily Objects,” RA-L, vol. 5, no. 2, pp. 532–539, 2020.
- J. Hatori, Y. Kikuchi, S. Kobayashi, et al., “Interactively Picking Real-World Objects with Unconstrained Spoken Language Instructions,” in ICRA, 2018, pp. 3774–3781.
- A. Magassouba, K. Sugiura, A. T. Quoc, H. Kawai, et al., “Understanding Natural Language Instructions for Fetching Daily Objects Using GAN-Based Multimodal Target–Source Classification,” RA-L, vol. 4, no. 4, pp. 3884–3891, 2019.
- S. Ishikawa and K. Sugiura, “Target-dependent UNITER: A Transformer-Based Multimodal Language Comprehension Model for Domestic Service Robots,” RA-L, vol. 6, no. 4, pp. 8401–8408, 2021.
- J. Li, P. Zhou, C. Xiong, and S. Hoi, “Prototypical Contrastive Learning of Unsupervised Representations,” in ICLR, 2021.
- A. Mogadala, M. Kalimuthu, D. Klakow, et al., “Trends in integration of vision and language research: A survey of tasks, datasets, and methods,” JAIR, vol. 71, pp. 1183–1317, 2021.
- S. Uppal, S. Bhagat, D. Hazarika, N. Majumder, S. Poria, R. Zimmermann, and A. Zadeh, “Multimodal research in vision and language: A review of current and emerging trends,” Information Fusion, vol. 77, pp. 149–171, 2022.
- Y. Du, Z. Liu, J. Li, and W. X. Zhao, “A Survey of Vision-Language Pre-Trained Models,” in IJCAI, 2022.
- S. Long, F. Cao, S. C. Han, and H. Yang, “Vision-and-Language Pretrained Models: A Survey,” in IJCAI, 2022.
- F.-L. Chen, D.-Z. Zhang, M.-L. Han, X.-Y. Chen, J. Shi, S. Xu, and B. Xu, “VLP: A Survey on Vision-language Pre-training,” Machine Intelligence Research, vol. 20, no. 1, pp. 38–56, 2023.
- L. Yu, Z. Lin, X. Shen, J. Yang, X. Lu, M. Bansal, and T. L. Berg, “MAttNet: Modular Attention Network for Referring Expression Comprehension,” in CVPR, 2018, pp. 1307–1315.
- J. Lu, D. Batra, D. Parikh, and S. Lee, “ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks,” in NeurIPS, vol. 32, 2019.
- Y.-C. Chen, L. Li, L. Yu, A. El Kholy, F. Ahmed, Z. Gan, Y. Cheng, and J. Liu, “UNITER: UNiversal Image-TExt Representation Learning,” in ECCV, 2020, pp. 104–120.
- Z. Gan, Y.-C. Chen, L. Li, C. Zhu, Y. Cheng, and J. Liu, “Large-Scale Adversarial Training for Vision-and-Language Representation Learning,” in NeurIPS, vol. 33, 2020, pp. 6616–6628.
- A. Kamath, M. Singh, Y. LeCun, G. Synnaeve, I. Misra, and N. Carion, “MDETR - Modulated Detection for End-to-End Multi-Modal Understanding,” in ICCV, 2021, pp. 1780–1790.
- P. Wang, A. Yang, R. Men, J. Lin, S. Bai, Z. Li, J. Ma, C. Zhou, J. Zhou, and H. Yang, “OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework,” in ICML, 2022, pp. 23 318–23 340.
- S. Ishikawa and K. Sugiura, “Moment-based Adversarial Training for Embodied Language Comprehension,” in ICPR, 2022, pp. 4139–4145.
- P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. Sünderhauf, I. Reid, S. Gould, A. Van Den Hengel, et al., “Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments,” in CVPR, 2018, pp. 3674–3683.
- Y. Qi, Q. Wu, P. Anderson, X. Wang, W. Y. Wang, C. Shen, A. v. d. Hengel, et al., “Reverie: Remote embodied visual referring expression in real indoor environments,” in CVPR, 2020, pp. 9982–9991.
- A. Ku, P. Anderson, R. Patel, E. Ie, and J. Baldridge, “Room-Across-Room: Multilingual Vision-and-Language Navigation with Dense Spatiotemporal Grounding,” in EMNLP, 2020, pp. 4392–4412.
- A. X. Chang, A. Dai, T. A. Funkhouser, M. Halber, M. Nießner, M. Savva, S. Song, A. Zeng, Y. Zhang, et al., “Matterport3D: Learning from RGB-D Data in Indoor Environments,” in 3DV, 2017, pp. 667–676.
- T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A Simple Framework for Contrastive Learning of Visual Representations,” in ICML, 2020, pp. 1597–1607.
- K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum Contrast for Unsupervised Visual Representation Learning,” in CVPR, 2020, pp. 9729–9738.
- I. Misra and L. v. d. Maaten, “Self-Supervised Learning of Pretext-Invariant Representations,” in CVPR, 2020, pp. 6707–6717.
- T. Gao, X. Yao, and D. Chen, “SimCSE: Simple Contrastive Learning of Sentence Embeddings,” in EMNLP, 2021, pp. 6894–6910.
- A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., “Learning Transferable Visual Models From Natural Language Supervision,” in ICML, 2021, pp. 8748–8763.
- C. Jia, Y. Yang, Y. Xia, Y.-T. Chen, Z. Parekh, H. Pham, Q. Le, Y.-H. Sung, Z. Li, and T. Duerig, “Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision,” in ICML, 2021, pp. 4904–4916.
- B. Wu, R. Cheng, P. Zhang, T. Gao, J. E. Gonzalez, and P. Vajda, “Data Efficient Language-Supervised Zero-Shot Recognition with Optimal Transport Distillation,” in ICLR, 2022.
- Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, et al., “Google’s neural machine translation system: Bridging the gap between human and machine translation,” arXiv preprint arXiv:1609.08144, 2016.
- A. v. d. Oord, Y. Li, and O. Vinyals, “Representation Learning with Contrastive Predictive Coding,” arXiv preprint arXiv:1807.03748, 2018.
- F. Duanmu, Y. He, X. Xiu, P. Hanhart, Y. Ye, and Y. Wang, “Hybrid Cubemap Projection Format for 360-Degree Video Coding,” in Data Compression Conference, 2018, pp. 404–404.
- M. Shridhar, J. Thomason, D. Gordon, Y. Bisk, W. Han, R. Mottaghi, L. Zettlemoyer, and D. Fox, “ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks,” in CVPR, 2020, pp. 10 740–10 749.
- S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks,” IEEE Trans. PAMI, vol. 39, no. 6, pp. 1137–1149, 2017.
- H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, S. Savarese, et al., “Generalized intersection over union: A metric and a loss for bounding box regression,” in CVPR, 2019, pp. 658–666.
- K. Saito, K. Watanabe, Y. Ushiku, and T. Harada, “Maximum classifier discrepancy for unsupervised domain adaptation,” in CVPR, 2018, pp. 3723–3732.