Lexicon-Level Contrastive Visual-Grounding Improves Language Modeling (2403.14551v1)

Published 21 Mar 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Today's most accurate LLMs are trained on orders of magnitude more language data than human language learners receive - but with no supervision from other sensory modalities that play a crucial role in human learning. Can we make LMs' representations and predictions more accurate (and more human-like) with more ecologically plausible supervision? This paper describes LexiContrastive Grounding (LCG), a grounded language learning procedure that leverages visual supervision to improve textual representations. LexiContrastive Grounding combines a next token prediction strategy with a contrastive visual grounding objective, focusing on early-layer representations that encode lexical information. Across multiple word-learning and sentence-understanding benchmarks, LexiContrastive Grounding not only outperforms standard language-only models in learning efficiency, but also improves upon vision-and-language learning procedures including CLIP, GIT, Flamingo, and Vokenization. Moreover, LexiContrastive Grounding improves perplexity by around 5% on multiple LLMing tasks. This work underscores the potential of incorporating visual grounding into LLMs, aligning more closely with the multimodal nature of human language acquisition.

PDF HTML Abstract

Summarize Bookmark Chat (Pro)

References (54)

Authors (3)

Chengxu Zhuang (15 papers)
Evelina Fedorenko (19 papers)
Jacob Andreas (116 papers)

Citations (1)

View on Semantic Scholar

Tweets

https://twitter.com/ChengxuZhuang/status/1781786199322943584

Lexicon-Level Contrastive Visual-Grounding Improves Language Modeling (2403.14551v1)

Related Papers

Tweets