Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Acquiring Linguistic Knowledge from Multimodal Input (2402.17936v1)

Published 27 Feb 2024 in cs.CL

Abstract: In contrast to children, LLMs (LMs) exhibit considerably inferior data efficiency when acquiring language. In this submission to the BabyLM Challenge (Warstadt et al., 2023), we test the hypothesis that this data efficiency gap is partly caused by a lack of multimodal input and grounding in the learning environment of typical LLMs. Although previous work looking into this question found that multimodal training can even harm language-only performance, we speculate that these findings can be attributed to catastrophic forgetting of complex language due to fine-tuning on captions data. To test our hypothesis, we perform an ablation study on FLAVA (Singh et al., 2022), a multimodal vision-and-LLM, independently varying the volume of text and vision input to quantify how much text data (if any) can be offset by vision at different data scales. We aim to limit catastrophic forgetting through a multitask pretraining regime that includes unimodal text-only tasks and data sampled from WiT, the relatively diverse Wikipedia-based dataset (Srinivasan et al., 2021). Our results are largely negative: Multimodal pretraining does not harm our models' language performance but does not consistently help either. That said, our conclusions are limited by our having been able to conduct only a small number of runs. While we must leave open the possibility that multimodal input explains some of the gap in data efficiency between LMs and humans, positive evidence for this hypothesis will require better architectures and techniques for multimodal training.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (58)
  1. Theodor Amariucai. 2023. Acquiring linguistic knowledge from multimodal input. Master thesis, ETH Zürich, Zürich.
  2. Lukas Biewald. 2020. Experiment tracking with weights and biases. Software available from wandb.com.
  3. Multimodal pretraining unmasked: A meta-analysis and a unified framework of vision-and-language BERTs. Transactions of the Association for Computational Linguistics, 9:978–994.
  4. Microsoft COCO Captions: Data Collection and Evaluation Server. In European conference on computer vision, pages 740–755. Springer. ArXiv: 1504.00325.
  5. UNITER: UNiversal Image-TExt Representation Learning. In Computer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX, page 104–120, Berlin, Heidelberg. Springer-Verlag.
  6. Learning language through pictures. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 112–118, Beijing, China. Association for Computational Linguistics.
  7. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  8. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations.
  9. William Falcon and The PyTorch Lightning team. 2019. PyTorch Lightning.
  10. Vision-and-language or vision-for-language? on cross-modal influence in multimodal transformers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 9847–9857, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  11. Word learning and the acquisition of syntactic–semantic overhypotheses. In Proceedings of the 40th annual meeting of the cognitive science society, Madison, Wisconsin.
  12. Mapping the Early Language Environment Using All-Day Recordings and Automated Analysis. American Journal of Speech-Language Pathology, 26(2):248–265.
  13. A Systematic Assessment of Syntactic Generalization in Neural Language Models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1725–1744, Online. Association for Computational Linguistics.
  14. Taichi Iki and Akiko Aizawa. 2021. Effect of Visual Extensions on Natural Language Understanding in Vision-and-Language Models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 2189–2196, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  15. Cross-situational word learning is both implicit and strategic. Frontiers in Psychology, 5.
  16. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. International Journal of Computer Vision, 123(1):32–73.
  17. Tatsuki Kuribayashi. 2023. Does Vision Accelerate Hierarchical Generalization of Neural Language Learners? ArXiv:2302.00667 [cs].
  18. Multimodal Word Meaning Induction From Minimal Exposure to Natural Text. Cognitive Science, 41(S4):677–705.
  19. Datasets: A Community Library for Natural Language Processing. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 175–184. Association for Computational Linguistics.
  20. Unicoder-VL: A universal encoder for vision and language by cross-modal pre-training. Proceedings of the AAAI Conference on Artificial Intelligence, 34:11336–11344.
  21. VisualBERT: A simple and performant baseline for vision and language.
  22. RoBERTa: A Robustly Optimized BERT Pretraining Approach.
  23. ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.
  24. Avinash Madasu and Vasudev Lal. 2023. Is multimodal vision supervision beneficial to language?
  25. The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences from Natural Supervision. In Proceedings of ICLR.
  26. Rebecca Marvin and Tal Linzen. 2018. Targeted Syntactic Evaluation of Language Models. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1192–1202.
  27. Michael McCloskey and Neal J. Cohen. 1989. Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem. In Psychology of Learning and Motivation, volume 24, pages 109–165. Elsevier.
  28. BERTs of a feather do not generalize together: Large variability in generalization across models with similar test set performance. In Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, pages 217–227, Online. Association for Computational Linguistics.
  29. Mitja Nikolaus and Abdellah Fourtassi. 2021. Evaluating the acquisition of semantic knowledge from cross-situational learning in artificial neural networks. In Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics, pages 200–210, Online. Association for Computational Linguistics.
  30. Learning the meanings of function words from grounded language using a visual question answering model. ArXiv:2308.08628 [cs].
  31. VoLTA: Vision-language transformer with weakly-supervised local-feature alignment. ArXiv, abs/2210.04135.
  32. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763. PMLR.
  33. Deb K Roy and Alex P Pentland. 2002. Learning words from sights and sounds: a computational model. Cognitive Science, 26(1):113–146.
  34. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252.
  35. Language with vision: a study on grounded word and sentence embeddings.
  36. FLAVA: A foundational language and vision alignment model. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15617–15629, Los Alamitos, CA, USA. IEEE Computer Society.
  37. Andrew D. M. Smith and Kenny Smith. 2012. Cross-Situational Learning, pages 864–866. Springer US, Boston, MA.
  38. Cross-Situational Learning: An Experimental Study of Word-Learning Mechanisms. Cognitive Science, 35(3):480–498.
  39. WIT: Wikipedia-based image text dataset for multimodal multilingual machine learning. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’21, page 2443–2449, New York, NY, USA. Association for Computing Machinery.
  40. SAYCam: A Large, Longitudinal Audiovisual Dataset Recorded From the Infant’s Perspective. Open Mind, 5:20–29.
  41. Hao Tan and Mohit Bansal. 2019. LXMERT: Learning cross-modality encoder representations from transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5100–5111, Hong Kong, China. Association for Computational Linguistics.
  42. Wai Keen Vong and Brenden M. Lake. 2022. Cross-Situational Word Learning With Multimodal Neural Networks. Cognitive Science, 46(4):e13122.
  43. Alex Wang and Kyunghyun Cho. 2019. BERT has a Mouth, and It Must Speak: BERT as a Markov Random Field Language Model. In Proceedings of the Workshop on Methods for Optimizing and Evaluating Neural Language Generation, pages 30–36, Minneapolis, Minnesota. Association for Computational Linguistics.
  44. SuperGLUE: A stickier benchmark for general-purpose language understanding systems. In 33rd Conference on Neural Information Processing Systems. Journal Abbreviation: arXiv preprint arXiv:1905.00537.
  45. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355, Brussels, Belgium. Association for Computational Linguistics.
  46. Language-Mediated, Object-Centric Representation Learning. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 2033–2046, Online. Association for Computational Linguistics.
  47. Finding Structure in One Child’s Linguistic Experience. Cognitive Science, 47(6):e13305.
  48. Alex Warstadt and Samuel R Bowman. 2022. What artificial neural networks can tell us about human language acquisition. In Shalom Lappin and Jean-Philippe Bernardy, editors, Algebraic Structures in Natural Language, pages 17–60. CRC Press. Publisher: CRC Press.
  49. Findings of the 2023 BabyLM Challenge: Sample-efficient pretraining on developmentally plausible corpora. In Proceedings of the 2023 BabyLM Challenge. Association for Computational Linguistics (ACL).
  50. BLiMP: The Benchmark of Linguistic Minimal Pairs for English. Transactions of the Association for Computational Linguistics, 8:377–392. _eprint: https://doi.org/10.1162/tacl_a_00321.
  51. Learning which features matter: RoBERTa acquires a preference for linguistic generalizations (eventually). In Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pages 217–235, Online. Association for Computational Linguistics.
  52. Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
  53. Yiqun Yao and Rada Mihalcea. 2022. Modality-specific learning rates for effective multimodal additive late-fusion. In Findings of the Association for Computational Linguistics: ACL 2022, pages 1824–1834, Dublin, Ireland. Association for Computational Linguistics.
  54. ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph. Proceedings of the AAAI Conference on Artificial Intelligence, 35(4):3208–3216.
  55. Does Vision-and-Language Pretraining Improve Lexical Grounding? In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 4357–4366, Punta Cana, Dominican Republic. Association for Computational Linguistics.
  56. Chapter Two - Mechanisms of Cross-situational Learning: Behavioral and Computational Evidence. volume 56 of Advances in Child Development and Behavior, pages 37–63. JAI.
  57. When do you need billions of words of pretraining data? In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1112–1125, Online. Association for Computational Linguistics.
  58. Unified Vision-Language Pre-Training for Image Captioning and VQA. Proceedings of the AAAI Conference on Artificial Intelligence, 34(07):13041–13049.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Theodor Amariucai (1 paper)
  2. Alex Warstadt (35 papers)
Citations (1)