ScanFormer: Referring Expression Comprehension by Iteratively Scanning (2406.18048v1)

Published 26 Jun 2024 in cs.CV

Abstract: Referring Expression Comprehension (REC) aims to localize the target objects specified by free-form natural language descriptions in images. While state-of-the-art methods achieve impressive performance, they perform a dense perception of images, which incorporates redundant visual regions unrelated to linguistic queries, leading to additional computational overhead. This inspires us to explore a question: can we eliminate linguistic-irrelevant redundant visual regions to improve the efficiency of the model? Existing relevant methods primarily focus on fundamental visual tasks, with limited exploration in vision-language fields. To address this, we propose a coarse-to-fine iterative perception framework, called ScanFormer. It can iteratively exploit the image scale pyramid to extract linguistic-relevant visual patches from top to bottom. In each iteration, irrelevant patches are discarded by our designed informativeness prediction. Furthermore, we propose a patch selection strategy for discarded patches to accelerate inference. Experiments on widely used datasets, namely RefCOCO, RefCOCO+, RefCOCOg, and ReferItGame, verify the effectiveness of our method, which can strike a balance between accuracy and efficiency.

View on arXiv

Authors (4)

Wei Su (121 papers)
Peihan Miao (7 papers)
Huanzhang Dou (16 papers)
Xi Li (198 papers)

Citations (4)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Tweets

https://twitter.com/CSVisionPapers/status/1806311043888447704

ScanFormer: Referring Expression Comprehension by Iteratively Scanning (2406.18048v1)

Summary

Related Papers

Tweets