Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Do Vision-and-Language Transformers Learn Grounded Predicate-Noun Dependencies? (2210.12079v1)

Published 21 Oct 2022 in cs.CL and cs.CV

Abstract: Recent advances in vision-and-LLMing have seen the development of Transformer architectures that achieve remarkable performance on multimodal reasoning tasks. Yet, the exact capabilities of these black-box models are still poorly understood. While much of previous work has focused on studying their ability to learn meaning at the word-level, their ability to track syntactic dependencies between words has received less attention. We take a first step in closing this gap by creating a new multimodal task targeted at evaluating understanding of predicate-noun dependencies in a controlled setup. We evaluate a range of state-of-the-art models and find that their performance on the task varies considerably, with some models performing relatively well and others at chance level. In an effort to explain this variability, our analyses indicate that the quality (and not only sheer quantity) of pretraining data is essential. Additionally, the best performing models leverage fine-grained multimodal pretraining objectives in addition to the standard image-text matching objectives. This study highlights that targeted and controlled evaluations are a crucial step for a precise and rigorous test of the multimodal knowledge of vision-and-LLMs.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Mitja Nikolaus (7 papers)
  2. Emmanuelle Salin (2 papers)
  3. Abdellah Fourtassi (3 papers)
  4. Benoit Favre (9 papers)
  5. Stephane Ayache (8 papers)
Citations (12)

Summary

We haven't generated a summary for this paper yet.