Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Probing Contextual Language Models for Common Ground with Visual Representations (2005.00619v5)

Published 1 May 2020 in cs.CL and cs.CV

Abstract: The success of large-scale contextual LLMs has attracted great interest in probing what is encoded in their representations. In this work, we consider a new question: to what extent contextual representations of concrete nouns are aligned with corresponding visual representations? We design a probing model that evaluates how effective are text-only representations in distinguishing between matching and non-matching visual representations. Our findings show that language representations alone provide a strong signal for retrieving image patches from the correct object categories. Moreover, they are effective in retrieving specific instances of image patches; textual context plays an important role in this process. Visually grounded LLMs slightly outperform text-only LLMs in instance retrieval, but greatly under-perform humans. We hope our analyses inspire future research in understanding and improving the visual capabilities of LLMs.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Gabriel Ilharco (26 papers)
  2. Rowan Zellers (25 papers)
  3. Ali Farhadi (138 papers)
  4. Hannaneh Hajishirzi (176 papers)
Citations (14)

Summary

We haven't generated a summary for this paper yet.