Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Towards Versatile and Efficient Visual Knowledge Integration into Pre-trained Language Models with Cross-Modal Adapters (2305.07358v4)

Published 12 May 2023 in cs.CL

Abstract: Humans learn language via multi-modal knowledge. However, due to the text-only pre-training scheme, most existing pre-trained LLMs (PLMs) are hindered from the multi-modal information. To inject visual knowledge into PLMs, existing methods incorporate either the text or image encoder of vision-LLMs (VLMs) to encode the visual information and update all the original parameters of PLMs for knowledge fusion. In this paper, we propose a new plug-and-play module, X-adapter, to flexibly leverage the aligned visual and textual knowledge learned in pre-trained VLMs and efficiently inject them into PLMs. Specifically, we insert X-adapters into PLMs, and only the added parameters are updated during adaptation. To fully exploit the potential in VLMs, X-adapters consist of two sub-modules, V-expert and T-expert, to fuse VLMs' image and text representations, respectively. We can opt for activating different sub-modules depending on the downstream tasks. Experimental results show that our method can significantly improve the performance on object-color reasoning and natural language understanding (NLU) tasks compared with PLM baselines.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (44)
  1. J. Devlin, M.-W. Chang, K. Lee, and K. N. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2018.
  2. A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman, “Glue: A multi-task benchmark and analysis platform for natural language understanding,” in International Conference on Learning Representations (ICLR), 2019.
  3. T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei, “Language models are few-shot learners,” in Annual Conference on Neural Information Processing Systems (NeurIPS), vol. 33.   Curran Associates, Inc., 2020. [Online]. Available: https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf
  4. A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever et al., “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, p. 9, 2019.
  5. Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” in arXiv preprint arXiv:1907.11692, 2019.
  6. Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut, “Albert: A lite bert for self-supervised learning of language representations,” in International Conference on Learning Representations (ICLR), 2020.
  7. K. Clark, M.-T. Luong, Q. V. Le, and C. D. Manning, “Electra: Pre-training text encoders as discriminators rather than generators,” in International Conference on Learning Representations (ICLR), 2020.
  8. W. Jin, D.-H. Lee, C. Zhu, J. Pujara, and X. Ren, “Leveraging visual knowledge in language tasks: An empirical study on intermediate pre-training for cross-modal knowledge transfer,” in Annual Meeting of the Association for Computational Linguistics (ACL), 2022.
  9. W. Wang, L. Dong, H. Cheng, H. Song, X. Liu, X. Yan, J. Gao, and F. Wei, “Visually-augmented language modeling,” in arXiv preprint arXiv:2205.10178, 2022.
  10. R. Logan, N. F. Liu, M. E. Peters, M. Gardner, and S. Singh, “Barack’s wife hillary: Using knowledge graphs for fact-aware language modeling,” in Annual Meeting of the Association for Computational Linguistics (ACL), 2019.
  11. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International Conference on Machine Learning (ICML), 2021.
  12. H. Tan and M. Bansal, “Vokenization: Improving language understanding with contextualized, visual-grounded supervision,” The Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020.
  13. Z. Tang, J. Cho, H. Tan, and M. Bansal, “Vidlankd: Improving language understanding via video-distilled knowledge transfer,” in Annual Conference on Neural Information Processing Systems (NeurIPS), 2021.
  14. C.-J. Hsu, H.-y. Lee, and Y. Tsao, “Xdbert: Distilling visual information to bert from cross-modal systems to improve language understanding,” in Annual Meeting of the Association for Computational Linguistics (ACL), 2022.
  15. Y.-C. Chen, L. Li, L. Yu, A. El Kholy, F. Ahmed, Z. Gan, Y. Cheng, and J. Liu, “Uniter: Universal image-text representation learning,” in European Conference on Computer Vision (ECCV).   Springer, 2020, pp. 104–120.
  16. L. H. Li, M. Yatskar, D. Yin, C.-J. Hsieh, and K.-W. Chang, “Visualbert: A simple and performant baseline for vision and language,” in arXiv preprint arXiv:1908.03557, 2019.
  17. W. Su, X. Zhu, Y. Cao, B. Li, L. Lu, F. Wei, and J. Dai, “Vl-bert: Pre-training of generic visual-linguistic representations,” in International Conference on Learning Representations (ICLR), 2020.
  18. X. Li, X. Yin, C. Li, P. Zhang, X. Hu, L. Zhang, L. Wang, H. Hu, L. Dong, F. Wei et al., “Oscar: Object-semantics aligned pre-training for vision-language tasks,” in European Conference on Computer Vision (ECCV), 2020.
  19. H. Tan and M. Bansal, “Lxmert: Learning cross-modality encoder representations from transformers,” in The Conference on Empirical Methods in Natural Language Processing (EMNLP), 2019.
  20. C. Jia, Y. Yang, Y. Xia, Y.-T. Chen, Z. Parekh, H. Pham, Q. Le, Y.-H. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision-language representation learning with noisy text supervision,” in International Conference on Machine Learning (ICML).   PMLR, 2021.
  21. A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierarchical text-conditional image generation with clip latents,” 2022.
  22. Y. Lu, W. Zhu, X. E. Wang, M. Eckstein, and W. Y. Wang, “Imagination-augmented natural language understanding,” in Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2022.
  23. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International Conference on Machine Learning (ICML).   PMLR, 2021, pp. 8748–8763.
  24. N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. De Laroussilhe, A. Gesmundo, M. Attariyan, and S. Gelly, “Parameter-efficient transfer learning for nlp,” in International Conference on Machine Learning (ICML).   PMLR, 2019.
  25. E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,” in International Conference on Learning Representations (ICLR), 2022.
  26. J. He, C. Zhou, X. Ma, T. Berg-Kirkpatrick, and G. Neubig, “Towards a unified view of parameter-efficient transfer learning,” in International Conference on Learning Representations (ICLR), 2022.
  27. R. Wang, D. Tang, N. Duan, Z. Wei, X. Huang, G. Cao, D. Jiang, M. Zhou et al., “K-adapter: Infusing knowledge into pre-trained models with adapters,” in arXiv preprint arXiv:2002.01808, 2020.
  28. J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” in arXiv preprint arXiv:1607.06450, 2016.
  29. T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European Conference on Computer Vision (ECCV).   Springer, 2014, pp. 740–755.
  30. S. Merity, C. Xiong, J. Bradbury, and R. Socher, “Pointer sentinel mixture models,” in International Conference on Learning Representations (ICLR), 2017.
  31. T. Norlund, L. Hagström, and R. Johansson, “Transferring knowledge from vision to language: How to achieve it and how to measure it?” arXiv preprint arXiv:2109.11321, 2021.
  32. E. Bruni, G. Boleda, M. Baroni, and N.-K. Tran, “Distributional semantics in technicolor,” in Annual Meeting of the Association for Computational Linguistics (ACL), 2012.
  33. R. Bar-Haim, I. Szpektor, and O. Glickman, “Definition and analysis of intermediate entailment levels,” in Proceedings of the ACL Workshop on Empirical Modeling of Semantic Equivalence and Entailment, 2005.
  34. W. B. Dolan and C. Brockett, “Automatically constructing a corpus of sentential paraphrases,” in Proceedings of the Third International Workshop on Paraphrasing (IWP2005), 2005.
  35. D. Cer, M. Diab, E. Agirre, I. Lopez-Gazpio, and L. Specia, “Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation,” arXiv preprint arXiv:1708.00055, 2017.
  36. A. Warstadt, A. Singh, and S. R. Bowman, “Neural network acceptability judgments,” Transactions of the Association for Computational Linguistics, 2019.
  37. R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Ng, and C. Potts, “Recursive deep models for semantic compositionality over a sentiment treebank,” in The Conference on Empirical Methods in Natural Language Processing (EMNLP), Oct. 2013.
  38. P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang, “SQuAD: 100,000+ questions for machine comprehension of text,” in The Conference on Empirical Methods in Natural Language Processing (EMNLP), 2016.
  39. S. Iyer, N. Dandekar, and K. Csernai, “First quora dataset release: Question pairs,” 2017.
  40. A. Williams, N. Nangia, and S. Bowman, “A broad-coverage challenge corpus for sentence understanding through inference,” in Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2018.
  41. T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. Le Scao, S. Gugger, M. Drame, Q. Lhoest, and A. Rush, “Transformers: State-of-the-art natural language processing,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations.   Online: Association for Computational Linguistics, Oct. 2020.
  42. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in International Conference on Learning Representations (ICLR), 2015.
  43. A. Rogers, O. Kovaleva, and A. Rumshisky, “A primer in bertology: What we know about how bert works,” Transactions of the Association for Computational Linguistics, vol. 8, pp. 842–866, 2020.
  44. R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma et al., “Visual genome: Connecting language and vision using crowdsourced dense image annotations,” International Journal of Computer Vision (IJCV), 2017.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Xinyun Zhang (9 papers)
  2. Haochen Tan (13 papers)
  3. Han Wu (124 papers)
  4. Bei Yu (113 papers)
Citations (1)