Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Lexicon-Level Contrastive Visual-Grounding Improves Language Modeling (2403.14551v1)

Published 21 Mar 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Today's most accurate LLMs are trained on orders of magnitude more language data than human language learners receive - but with no supervision from other sensory modalities that play a crucial role in human learning. Can we make LMs' representations and predictions more accurate (and more human-like) with more ecologically plausible supervision? This paper describes LexiContrastive Grounding (LCG), a grounded language learning procedure that leverages visual supervision to improve textual representations. LexiContrastive Grounding combines a next token prediction strategy with a contrastive visual grounding objective, focusing on early-layer representations that encode lexical information. Across multiple word-learning and sentence-understanding benchmarks, LexiContrastive Grounding not only outperforms standard language-only models in learning efficiency, but also improves upon vision-and-language learning procedures including CLIP, GIT, Flamingo, and Vokenization. Moreover, LexiContrastive Grounding improves perplexity by around 5% on multiple LLMing tasks. This work underscores the potential of incorporating visual grounding into LLMs, aligning more closely with the multimodal nature of human language acquisition.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (54)
  1. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736.
  2. Suhas Arehalli and Tal Linzen. 2020. Neural language models capture some, but not all, agreement attraction effects.
  3. A computational acquisition model for multimodal word categorization. arXiv preprint arXiv:2205.05974.
  4. Experience grounds language. arXiv preprint arXiv:2004.10151.
  5. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  6. Distributional semantics in technicolor. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 136–145.
  7. Concreteness ratings for 40 thousand generally known english word lemmas. Behavior research methods, 46:904–911.
  8. English semantic feature production norms: An extended database of 4436 concepts. Behavior Research Methods, 51:1849–1863.
  9. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9650–9660.
  10. Charlotte Caucheteux and Jean-Rémi King. 2022. Brains and algorithms partially converge in natural language processing. Communications biology, 5(1):134.
  11. Tyler A Chang and Benjamin K Bergen. 2022. Word acquisition in neural language models. Transactions of the Association for Computational Linguistics, 10:1–16.
  12. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3558–3568.
  13. A method for studying semantic construal in grammatical constructions with interpretable contextual embedding spaces. arXiv preprint arXiv:2305.18598.
  14. Real-world visual statistics and infants’ first-learned object names. Philosophical Transactions of the Royal Society B: Biological Sciences, 372(1711):20160055.
  15. Imagenet: A large-scale hierarchical image database.
  16. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  17. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
  18. Michael C Frank. 2023. Bridging the data gap between children and large language models. Trends in Cognitive Sciences.
  19. Wordbank: An open repository for developmental vocabulary data. Journal of child language, 44(3):677–694.
  20. Simverb-3500: A large-scale evaluation set of verb similarity. arXiv preprint arXiv:1608.00869.
  21. Shared computational principles for language processing in humans and deep language models. Nature neuroscience, 25(3):369–380.
  22. Babyberta: Learning more grammar with small-scale child-directed language. In Proceedings of the 25th conference on computational natural language learning, pages 624–646.
  23. Talia Konkle and George A Alvarez. 2022. A self-supervised domain-general learning framework for human ventral stream representation. Nature communications, 13(1):491.
  24. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123:32–73.
  25. Age-of-acquisition ratings for 30,000 english words. Behavior research methods, 44:978–990.
  26. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
  27. Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
  28. Unified-io: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916.
  29. Brian MacWhinney. 2014. The CHILDES project: Tools for analyzing talk, Volume II: The database. Psychology Press.
  30. Learning the meanings of function words from grounded language using a visual question answering model. arXiv preprint arXiv:2308.08628.
  31. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR.
  32. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
  33. The cogalex-v shared task on the corpus-based identification of semantic relations. In Proceedings of the 5th Workshop on Cognitive Aspects of the Lexicon (CogALex-V), pages 69–79.
  34. The neural architecture of language: Integrative modeling converges on predictive processing. Proceedings of the National Academy of Sciences, 118(45):e2105646118.
  35. Sara E Schroer and Chen Yu. 2023. Looking is not enough: Multimodal attention supports the real-time learning of new words. Developmental Science, 26(2):e13290.
  36. Touch to learn: Multisensory input supports word learning and processing. Developmental Science, page e13419.
  37. Flava: A foundational language and vision alignment model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15638–15650.
  38. Saycam: A large, longitudinal audiovisual dataset recorded from the infant’s perspective.
  39. Hao Tan and Mohit Bansal. 2020. Vokenization: Improving language understanding with contextualized, visual-grounded supervision. arXiv preprint arXiv:2010.06775.
  40. Distilling relation embeddings from pre-trained language models. arXiv preprint arXiv:2110.15705.
  41. Attention is all you need. Advances in neural information processing systems, 30.
  42. Grounded language acquisition through the eyes and ears of a single child. Science, 383(6682):504–511.
  43. Git: A generative image-to-text transformer for vision and language. arXiv preprint arXiv:2205.14100.
  44. Alex Warstadt and Samuel R Bowman. 2022. What artificial neural networks can tell us about human language acquisition. Algebraic Structures in Natural Language, pages 17–60.
  45. Findings of the babylm challenge: Sample-efficient pretraining on developmentally plausible corpora. In Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning.
  46. Kelsey L West and Jana M Iverson. 2017. Language learning is hands-on: Exploring links between infants’ object manipulation and verbal input. Cognitive Development, 43:190–200.
  47. On the predictive power of neural language models for human real-time comprehension behavior. arXiv preprint arXiv:2006.01912.
  48. A broad-coverage challenge corpus for sentence understanding through inference. arXiv preprint arXiv:1704.05426.
  49. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, pages 38–45.
  50. Robert Wolfe and Aylin Caliskan. 2022. Contrastive visual semantic pretraining magnifies the semantics of natural language representations. arXiv preprint arXiv:2203.07511.
  51. When do you need billions of words of pretraining data? arXiv preprint arXiv:2011.04946.
  52. Visual grounding helps learn word meanings in low-data regimes. arXiv preprint arXiv:2310.13257.
  53. How well do unsupervised learning algorithms model human real-time and life-long learning? Advances in Neural Information Processing Systems, 35:22628–22642.
  54. Unsupervised neural network models of the ventral visual stream. Proceedings of the National Academy of Sciences, 118(3).
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Chengxu Zhuang (15 papers)
  2. Evelina Fedorenko (19 papers)
  3. Jacob Andreas (116 papers)
Citations (1)
X Twitter Logo Streamline Icon: https://streamlinehq.com