Lexicon-Level Contrastive Visual-Grounding Improves Language Modeling (2403.14551v1)
Abstract: Today's most accurate LLMs are trained on orders of magnitude more language data than human language learners receive - but with no supervision from other sensory modalities that play a crucial role in human learning. Can we make LMs' representations and predictions more accurate (and more human-like) with more ecologically plausible supervision? This paper describes LexiContrastive Grounding (LCG), a grounded language learning procedure that leverages visual supervision to improve textual representations. LexiContrastive Grounding combines a next token prediction strategy with a contrastive visual grounding objective, focusing on early-layer representations that encode lexical information. Across multiple word-learning and sentence-understanding benchmarks, LexiContrastive Grounding not only outperforms standard language-only models in learning efficiency, but also improves upon vision-and-language learning procedures including CLIP, GIT, Flamingo, and Vokenization. Moreover, LexiContrastive Grounding improves perplexity by around 5% on multiple LLMing tasks. This work underscores the potential of incorporating visual grounding into LLMs, aligning more closely with the multimodal nature of human language acquisition.
- Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736.
- Suhas Arehalli and Tal Linzen. 2020. Neural language models capture some, but not all, agreement attraction effects.
- A computational acquisition model for multimodal word categorization. arXiv preprint arXiv:2205.05974.
- Experience grounds language. arXiv preprint arXiv:2004.10151.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Distributional semantics in technicolor. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 136–145.
- Concreteness ratings for 40 thousand generally known english word lemmas. Behavior research methods, 46:904–911.
- English semantic feature production norms: An extended database of 4436 concepts. Behavior Research Methods, 51:1849–1863.
- Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9650–9660.
- Charlotte Caucheteux and Jean-Rémi King. 2022. Brains and algorithms partially converge in natural language processing. Communications biology, 5(1):134.
- Tyler A Chang and Benjamin K Bergen. 2022. Word acquisition in neural language models. Transactions of the Association for Computational Linguistics, 10:1–16.
- Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3558–3568.
- A method for studying semantic construal in grammatical constructions with interpretable contextual embedding spaces. arXiv preprint arXiv:2305.18598.
- Real-world visual statistics and infants’ first-learned object names. Philosophical Transactions of the Royal Society B: Biological Sciences, 372(1711):20160055.
- Imagenet: A large-scale hierarchical image database.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
- Michael C Frank. 2023. Bridging the data gap between children and large language models. Trends in Cognitive Sciences.
- Wordbank: An open repository for developmental vocabulary data. Journal of child language, 44(3):677–694.
- Simverb-3500: A large-scale evaluation set of verb similarity. arXiv preprint arXiv:1608.00869.
- Shared computational principles for language processing in humans and deep language models. Nature neuroscience, 25(3):369–380.
- Babyberta: Learning more grammar with small-scale child-directed language. In Proceedings of the 25th conference on computational natural language learning, pages 624–646.
- Talia Konkle and George A Alvarez. 2022. A self-supervised domain-general learning framework for human ventral stream representation. Nature communications, 13(1):491.
- Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123:32–73.
- Age-of-acquisition ratings for 30,000 english words. Behavior research methods, 44:978–990.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
- Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
- Unified-io: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916.
- Brian MacWhinney. 2014. The CHILDES project: Tools for analyzing talk, Volume II: The database. Psychology Press.
- Learning the meanings of function words from grounded language using a visual question answering model. arXiv preprint arXiv:2308.08628.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
- The cogalex-v shared task on the corpus-based identification of semantic relations. In Proceedings of the 5th Workshop on Cognitive Aspects of the Lexicon (CogALex-V), pages 69–79.
- The neural architecture of language: Integrative modeling converges on predictive processing. Proceedings of the National Academy of Sciences, 118(45):e2105646118.
- Sara E Schroer and Chen Yu. 2023. Looking is not enough: Multimodal attention supports the real-time learning of new words. Developmental Science, 26(2):e13290.
- Touch to learn: Multisensory input supports word learning and processing. Developmental Science, page e13419.
- Flava: A foundational language and vision alignment model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15638–15650.
- Saycam: A large, longitudinal audiovisual dataset recorded from the infant’s perspective.
- Hao Tan and Mohit Bansal. 2020. Vokenization: Improving language understanding with contextualized, visual-grounded supervision. arXiv preprint arXiv:2010.06775.
- Distilling relation embeddings from pre-trained language models. arXiv preprint arXiv:2110.15705.
- Attention is all you need. Advances in neural information processing systems, 30.
- Grounded language acquisition through the eyes and ears of a single child. Science, 383(6682):504–511.
- Git: A generative image-to-text transformer for vision and language. arXiv preprint arXiv:2205.14100.
- Alex Warstadt and Samuel R Bowman. 2022. What artificial neural networks can tell us about human language acquisition. Algebraic Structures in Natural Language, pages 17–60.
- Findings of the babylm challenge: Sample-efficient pretraining on developmentally plausible corpora. In Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning.
- Kelsey L West and Jana M Iverson. 2017. Language learning is hands-on: Exploring links between infants’ object manipulation and verbal input. Cognitive Development, 43:190–200.
- On the predictive power of neural language models for human real-time comprehension behavior. arXiv preprint arXiv:2006.01912.
- A broad-coverage challenge corpus for sentence understanding through inference. arXiv preprint arXiv:1704.05426.
- Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, pages 38–45.
- Robert Wolfe and Aylin Caliskan. 2022. Contrastive visual semantic pretraining magnifies the semantics of natural language representations. arXiv preprint arXiv:2203.07511.
- When do you need billions of words of pretraining data? arXiv preprint arXiv:2011.04946.
- Visual grounding helps learn word meanings in low-data regimes. arXiv preprint arXiv:2310.13257.
- How well do unsupervised learning algorithms model human real-time and life-long learning? Advances in Neural Information Processing Systems, 35:22628–22642.
- Unsupervised neural network models of the ventral visual stream. Proceedings of the National Academy of Sciences, 118(3).
- Chengxu Zhuang (15 papers)
- Evelina Fedorenko (19 papers)
- Jacob Andreas (116 papers)