Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Understanding ME? Multimodal Evaluation for Fine-grained Visual Commonsense (2211.05895v2)

Published 10 Nov 2022 in cs.CV, cs.AI, cs.CL, cs.HC, and cs.MM

Abstract: Visual commonsense understanding requires Vision Language (VL) models to not only understand image and text but also cross-reference in-between to fully integrate and achieve comprehension of the visual scene described. Recently, various approaches have been developed and have achieved high performance on visual commonsense benchmarks. However, it is unclear whether the models really understand the visual scene and underlying commonsense knowledge due to limited evaluation data resources. To provide an in-depth analysis, we present a Multimodal Evaluation (ME) pipeline to automatically generate question-answer pairs to test models' understanding of the visual scene, text, and related knowledge. We then take a step further to show that training with the ME data boosts the model's performance in standard VCR evaluation. Lastly, our in-depth analysis and comparison reveal interesting findings: (1) semantically low-level information can assist the learning of high-level information but not the opposite; (2) visual information is generally under utilization compared with text.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Zhecan Wang (18 papers)
  2. Haoxuan You (33 papers)
  3. Yicheng He (8 papers)
  4. Wenhao Li (135 papers)
  5. Kai-Wei Chang (292 papers)
  6. Shih-Fu Chang (131 papers)
Citations (5)