Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

On Advances in Text Generation from Images Beyond Captioning: A Case Study in Self-Rationalization (2205.11686v2)

Published 24 May 2022 in cs.CL and cs.CV

Abstract: Combining the visual modality with pretrained LLMs has been surprisingly effective for simple descriptive tasks such as image captioning. More general text generation however remains elusive. We take a step back and ask: How do these models work for more complex generative tasks, i.e. conditioning on both text and images? Are multimodal models simply visually adapted LLMs, or do they combine they reason jointly over modalities? We investigate these questions in the context of self-rationalization (jointly generating task labels/answers and free-text explanations) of three tasks: (i) visual question answering in VQA-X, (ii) visual commonsense reasoning in VCR, and (iii) visual-textual entailment in e-SNLI-VE. We show that recent unimodal advances, CLIP image representations and scaling of LLMs, do not consistently improve self-rationalization in multimodal tasks. We find that no single model type works universally best across tasks, datasets, and finetuning data sizes. Our findings motivate the need for novel general backbones approach that move text generation from images and text beyond image captioning.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Shruti Palaskar (14 papers)
  2. Akshita Bhagia (12 papers)
  3. Yonatan Bisk (91 papers)
  4. Florian Metze (79 papers)
  5. Alan W Black (83 papers)
  6. Ana Marasović (27 papers)
Citations (3)