Probing Contextual Language Models for Common Ground with Visual Representations (2005.00619v5)

Published 1 May 2020 in cs.CL and cs.CV

Abstract: The success of large-scale contextual LLMs has attracted great interest in probing what is encoded in their representations. In this work, we consider a new question: to what extent contextual representations of concrete nouns are aligned with corresponding visual representations? We design a probing model that evaluates how effective are text-only representations in distinguishing between matching and non-matching visual representations. Our findings show that language representations alone provide a strong signal for retrieving image patches from the correct object categories. Moreover, they are effective in retrieving specific instances of image patches; textual context plays an important role in this process. Visually grounded LLMs slightly outperform text-only LLMs in instance retrieval, but greatly under-perform humans. We hope our analyses inspire future research in understanding and improving the visual capabilities of LLMs.

Authors (4)

Gabriel Ilharco (26 papers)
Rowan Zellers (25 papers)
Ali Farhadi (138 papers)
Hannaneh Hajishirzi (176 papers)

Citations (14)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Probing Contextual Language Models for Common Ground with Visual Representations (2005.00619v5)

Summary

Related Papers