RCA: Region Conditioned Adaptation for Visual Abductive Reasoning (2303.10428v5)

Published 18 Mar 2023 in cs.CV

Abstract: Visual abductive reasoning aims to make likely explanations for visual observations. We propose a simple yet effective Region Conditioned Adaptation, a hybrid parameter-efficient fine-tuning method that equips the frozen CLIP with the ability to infer explanations from local visual cues. We encode local hints'' andglobal contexts'' into visual prompts of the CLIP model separately at fine and coarse-grained levels. Adapters are used for fine-tuning CLIP models for downstream tasks and we design a new attention adapter, that directly steers the focus of the attention map with trainable query and key projections of a frozen CLIP model. Finally, we train our new model with a modified contrastive loss to regress the visual feature simultaneously toward features of literal description and plausible explanations. The loss enables CLIP to maintain both perception and reasoning abilities. Experiments on the Sherlock visual abductive reasoning benchmark show that the RCA significantly outstands previous SOTAs, ranking the \nth{1} on the leaderboards (e.g., Human Acc: RCA 31.74 \textit{vs} CPT-CLIP 29.58, higher =better). We also validate the RCA is generalizable to local perception benchmarks like RefCOCO. We open-source our project at \textit{\color{magenta}{\url{https://github.com/LUNAProject22/RPA}}}.

PDF HTML Abstract

Summarize PDF Markdown Bookmark Chat (Pro)

References (72)

Authors (3)

Hao Zhang (947 papers)
Yeo Keat Ee (1 paper)
Basura Fernando (60 papers)

Citations (1)

View on Semantic Scholar

Tweets

https://twitter.com/CSVisionPapers/status/1821756076334764487

RCA: Region Conditioned Adaptation for Visual Abductive Reasoning (2303.10428v5)

Related Papers

Tweets