Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Visual Grounding Helps Learn Word Meanings in Low-Data Regimes (2310.13257v2)

Published 20 Oct 2023 in cs.CL and cs.AI

Abstract: Modern neural LLMs (LMs) are powerful tools for modeling human sentence production and comprehension, and their internal representations are remarkably well-aligned with representations of language in the human brain. But to achieve these results, LMs must be trained in distinctly un-human-like ways - requiring orders of magnitude more language data than children receive during development, and without perceptual or social context. Do models trained more naturalistically -- with grounded supervision -- exhibit more humanlike language learning? We investigate this question in the context of word learning, a key sub-task in language acquisition. We train a diverse set of LM architectures, with and without auxiliary visual supervision, on datasets of varying scales. We then evaluate these models' learning of syntactic categories, lexical relations, semantic features, word similarity, and alignment with human neural representations. We find that visual supervision can indeed improve the efficiency of word learning. However, these improvements are limited: they are present almost exclusively in the low-data regime, and sometimes canceled out by the inclusion of rich distributional signals from text. The information conveyed by text and images is not redundant -- models mainly driven by visual information yield qualitatively different from those mainly driven by word co-occurrences. However, our results suggest that current multimodal modeling approaches fail to effectively leverage visual information to build human-like word representations from human-scale data.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (71)
  1. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
  2. Neural language models capture some, but not all, agreement attraction effects. 2020.
  3. How we blessed distributional semantic evaluation. In Proceedings of the GEMS 2011 Workshop on GEometrical Models of Natural Language Semantics, pp.  1–10, 2011.
  4. There’s more to “sparkle” than meets the eye: Knowledge of vision and light verbs among congenitally blind and sighted individuals. Cognition, 189:105–115, 2019.
  5. Nature and origins of the lexicon in 6-mo-olds. Proceedings of the National Academy of Sciences, 114(49):12916–12921, 2017.
  6. At 6–9 months, human infants know the meanings of many common nouns. Proceedings of the National Academy of Sciences, 109(9):3253–3258, 2012.
  7. A computational acquisition model for multimodal word categorization. arXiv preprint arXiv:2205.05974, 2022.
  8. Experience grounds language. arXiv preprint arXiv:2004.10151, 2020.
  9. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  10. Distributional semantics in technicolor. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  136–145, 2012.
  11. Concreteness ratings for 40 thousand generally known english word lemmas. Behavior research methods, 46:904–911, 2014.
  12. Word prevalence norms for 62,000 english lemmas. Behavior research methods, 51:467–479, 2019.
  13. English semantic feature production norms: An extended database of 4436 concepts. Behavior Research Methods, 51:1849–1863, 2019.
  14. Making sense of sensory language: Acquisition of sensory knowledge by individuals with congenital sensory impairments. Neuropsychologia, 174:108320, 2022.
  15. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  9650–9660, 2021.
  16. Brains and algorithms partially converge in natural language processing. Communications biology, 5(1):134, 2022.
  17. Word acquisition in neural language models. Transactions of the Association for Computational Linguistics, 10:1–16, 2022.
  18. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  3558–3568, 2021.
  19. A method for studying semantic construal in grammatical constructions with interpretable contextual embedding spaces. arXiv preprint arXiv:2305.18598, 2023.
  20. Real-world statistics at two timescales and a mechanism for infant learning of object names. Proceedings of the National Academy of Sciences, 119(18):e2123239119, 2022.
  21. Real-world visual statistics and infants’ first-learned object names. Philosophical Transactions of the Royal Society B: Biological Sciences, 372(1711):20160055, 2017.
  22. Mark Davies. The corpus of contemporary american english as the first reliable monitor corpus of english. Literary and linguistic computing, 25(4):447–464, 2010.
  23. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  24. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  25. Larry Fenson et al. Macarthur-bates communicative development inventories. 2007.
  26. Placing search in context: The concept revisited. In Proceedings of the 10th international conference on World Wide Web, pp.  406–414, 2001.
  27. Wordbank: An open repository for developmental vocabulary data. Journal of child language, 44(3):677–694, 2017.
  28. Variability and consistency in early language learning: The Wordbank project. MIT Press, 2021.
  29. Simverb-3500: A large-scale evaluation set of verb similarity. arXiv preprint arXiv:1608.00869, 2016.
  30. Shared computational principles for language processing in humans and deep language models. Nature neuroscience, 25(3):369–380, 2022.
  31. Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377, 2021.
  32. Simlex-999: Evaluating semantic models with (genuine) similarity estimation. Computational Linguistics, 41(4):665–695, 2015.
  33. Babyberta: Learning more grammar with small-scale child-directed language. In Proceedings of the 25th conference on computational natural language learning, pp.  624–646, 2021.
  34. Lexical semantic content, not syntactic structure, is the main contributor to ann-brain similarity of fmri responses in the language network. Neurobiology of Language, pp.  1–81, 2023.
  35. A self-supervised domain-general learning framework for human ventral stream representation. Nature communications, 13(1):491, 2022.
  36. Age-of-acquisition ratings for 30,000 english words. Behavior research methods, 44:978–990, 2012.
  37. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  38. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  39. Unified-io: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916, 2022.
  40. Semantic feature production norms for a large set of living and nonliving things. Behavior research methods, 37(4):547–559, 2005.
  41. George A Miller. On knowing a word. Annual review of psychology, 50(1):1–19, 1999.
  42. The understanding of visual metaphors by the congenitally blind. Frontiers in psychology, 9:1242, 2018.
  43. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
  44. Toward a universal decoder of linguistic meaning from brain activation. Nature communications, 9(1):963, 2018.
  45. Learning the meanings of function words from grounded language using a visual question answering model. arXiv preprint arXiv:2308.08628, 2023.
  46. Stanza: A python natural language processing toolkit for many human languages. arXiv preprint arXiv:2003.07082, 2020.
  47. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  48. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748–8763. PMLR, 2021.
  49. The cogalex-v shared task on the corpus-based identification of semantic relations. In Proceedings of the 5th Workshop on Cognitive Aspects of the Lexicon (CogALex-V), pp.  69–79, 2016a.
  50. Nine features in a random forest to learn taxonomical semantic relations. arXiv preprint arXiv:1603.08702, 2016b.
  51. The neural architecture of language: Integrative modeling converges on predictive processing. Proceedings of the National Academy of Sciences, 118(45):e2105646118, 2021.
  52. Looking is not enough: Multimodal attention supports the real-time learning of new words. Developmental Science, 26(2):e13290, 2023.
  53. Touch to learn: Multisensory input supports word learning and processing. Developmental Science, pp.  e13419, 2023.
  54. Flava: A foundational language and vision alignment model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  15638–15650, 2022.
  55. Saycam: A large, longitudinal audiovisual dataset recorded from the infant’s perspective. 2020.
  56. Sentspace: Large-scale benchmarking and evaluation of text using cognitively motivated lexical, syntactic, and semantic features. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: System Demonstrations, pp.  99–113, 2022.
  57. Distilling relation embeddings from pre-trained language models. arXiv preprint arXiv:2110.15705, 2021.
  58. Subtlex-uk: A new and improved word frequency database for british english. Quarterly journal of experimental psychology, 67(6):1176–1190, 2014.
  59. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  60. Git: A generative image-to-text transformer for vision and language. arXiv preprint arXiv:2205.14100, 2022.
  61. Finding structure in one child’s linguistic experience. Cognitive science, 47(6):e13305, 2023.
  62. What artificial neural networks can tell us about human language acquisition. Algebraic Structures in Natural Language, pp.  17–60, 2022.
  63. Call for papers–the babylm challenge: Sample-efficient pretraining on a developmentally plausible corpus. arXiv preprint arXiv:2301.11796, 2023.
  64. Language learning is hands-on: Exploring links between infants’ object manipulation and verbal input. Cognitive Development, 43:190–200, 2017.
  65. On the predictive power of neural language models for human real-time comprehension behavior. arXiv preprint arXiv:2006.01912, 2020.
  66. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, pp.  38–45, 2020.
  67. Contrastive visual semantic pretraining magnifies the semantics of natural language representations. arXiv preprint arXiv:2203.07511, 2022.
  68. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  69. When do you need billions of words of pretraining data? arXiv preprint arXiv:2011.04946, 2020.
  70. Unsupervised neural network models of the ventral visual stream. Proceedings of the National Academy of Sciences, 118(3), 2021.
  71. How well do unsupervised learning algorithms model human real-time and life-long learning? Advances in Neural Information Processing Systems, 35:22628–22642, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Chengxu Zhuang (15 papers)
  2. Evelina Fedorenko (19 papers)
  3. Jacob Andreas (116 papers)
Citations (7)
X Twitter Logo Streamline Icon: https://streamlinehq.com