Explore the Potential of CLIP for Training-Free Open Vocabulary Semantic Segmentation (2407.08268v1)

Published 11 Jul 2024 in cs.CV

Abstract: CLIP, as a vision-LLM, has significantly advanced Open-Vocabulary Semantic Segmentation (OVSS) with its zero-shot capabilities. Despite its success, its application to OVSS faces challenges due to its initial image-level alignment training, which affects its performance in tasks requiring detailed local context. Our study delves into the impact of CLIP's [CLS] token on patch feature correlations, revealing a dominance of "global" patches that hinders local feature discrimination. To overcome this, we propose CLIPtrase, a novel training-free semantic segmentation strategy that enhances local feature awareness through recalibrated self-correlation among patches. This approach demonstrates notable improvements in segmentation accuracy and the ability to maintain semantic coherence across objects.Experiments show that we are 22.3% ahead of CLIP on average on 9 segmentation benchmarks, outperforming existing state-of-the-art training-free methods.The code are made publicly available at: https://github.com/leaves162/CLIPtrase.

PDF HTML Abstract

Summarize Bookmark Chat (Pro)

Authors (4)

Tong Shao (3 papers)
Zhuotao Tian (38 papers)
Hang Zhao (156 papers)
Jingyong Su (16 papers)

Citations (9)

View on Semantic Scholar

Tweets

https://twitter.com/CSVisionPapers/status/1811942601651638655

Explore the Potential of CLIP for Training-Free Open Vocabulary Semantic Segmentation (2407.08268v1)

Related Papers

Tweets