Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Towards Models that Can See and Read (2301.07389v2)

Published 18 Jan 2023 in cs.CV and cs.LG

Abstract: Visual Question Answering (VQA) and Image Captioning (CAP), which are among the most popular vision-language tasks, have analogous scene-text versions that require reasoning from the text in the image. Despite their obvious resemblance, the two are treated independently and, as we show, yield task-specific methods that can either see or read, but not both. In this work, we conduct an in-depth analysis of this phenomenon and propose UniTNT, a Unified Text-Non-Text approach, which grants existing multimodal architectures scene-text understanding capabilities. Specifically, we treat scene-text information as an additional modality, fusing it with any pretrained encoder-decoder-based architecture via designated modules. Thorough experiments reveal that UniTNT leads to the first single model that successfully handles both task types. Moreover, we show that scene-text understanding capabilities can boost vision-LLMs' performance on general VQA and CAP by up to 2.69% and 0.6 CIDEr, respectively.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Roy Ganz (19 papers)
  2. Oren Nuriel (8 papers)
  3. Aviad Aberdam (16 papers)
  4. Yair Kittenplon (7 papers)
  5. Shai Mazor (14 papers)
  6. Ron Litman (15 papers)
Citations (12)