DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting (2112.01518v2)

Published 2 Dec 2021 in cs.CV, cs.AI, and cs.LG

Abstract: Recent progress has shown that large-scale pre-training using contrastive image-text pairs can be a promising alternative for high-quality visual representation learning from natural language supervision. Benefiting from a broader source of supervision, this new paradigm exhibits impressive transferability to downstream classification tasks and datasets. However, the problem of transferring the knowledge learned from image-text pairs to more complex dense prediction tasks has barely been visited. In this work, we present a new framework for dense prediction by implicitly and explicitly leveraging the pre-trained knowledge from CLIP. Specifically, we convert the original image-text matching problem in CLIP to a pixel-text matching problem and use the pixel-text score maps to guide the learning of dense prediction models. By further using the contextual information from the image to prompt the LLM, we are able to facilitate our model to better exploit the pre-trained knowledge. Our method is model-agnostic, which can be applied to arbitrary dense prediction systems and various pre-trained visual backbones including both CLIP models and ImageNet pre-trained models. Extensive experiments demonstrate the superior performance of our methods on semantic segmentation, object detection, and instance segmentation tasks. Code is available at https://github.com/raoyongming/DenseCLIP

PDF Abstract

DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting

Overview

The paper presents DenseCLIP, a novel framework for extending the knowledge gained from large-scale vision-language pre-training, specifically through Contrastive Language-Image Pre-training (CLIP), to dense prediction tasks. The focus is on enhancing transferability to tasks such as semantic segmentation, object detection, and instance segmentation by transforming the image-text matching inherent in CLIP to a pixel-text matching paradigm.

Methodology

DenseCLIP includes a model-agnostic approach that leverages both implicit and explicit strategies to utilize pre-trained CLIP knowledge. The framework is designed to exploit pixel-text score maps derived from the original image-text matching objectives in CLIP and employs contextual information to prompt the LLM using a Transformer module. This approach enables various dense prediction models to benefit from CLIP's language-guided priors without reliance on a specific model architecture.

Experimental Results

Extensive experiments conducted on challenging datasets such as ADE20K demonstrate DenseCLIP's superior performance over existing methods, providing up to +4.9% increase in mIoU compared to traditional ImageNet pre-trained models. The ResNet and ViT backbones, when utilized within DenseCLIP, considerably outperformed their ImageNet-pre-trained counterparts, showcasing the framework's efficiency across different network architectures.

DenseCLIP also demonstrated remarkable improvements in object detection and instance segmentation tasks, achieving significant increases in AP%, notably in instance segmentation where the framework benefitted from the pixel-text mapping approach, emphasizing the applicability of model adaptations for dense prediction tasks.

Implications and Future Directions

The research suggests several implications for theoretical advancements and practical applications in dense prediction. Firstly, by aligning dense prediction tasks with LLMs, DenseCLIP bridges vision and language supervision granularity, proposing a new trajectory for leveraging text-based knowledge transfer. Secondly, the presented post-model prompting offers a computationally efficient way to optimize model performance without extensive computational overhead, implying potential for widespread industrial applications where model efficiency is crucial.

Future studies may further explore how DenseCLIP's framework can adapt to other transformer-based and CNN architectures to verify its generalized applicability across broader classes of visual tasks. Additionally, integrating dense supervision during pre-training may help refine pixel-level predictions, thus enhancing downstream task performance further.

Conclusion

DenseCLIP robustly extends the CLIP framework to dense prediction tasks using context-aware prompting, transitioning from instance-level to pixel-level prediction efficiency. The research underlines the benefits of incorporating language priors into visual models, suggesting a promising direction for future developments in vision-LLM integration. The findings propound the applicability of DenseCLIP as an agile paradigm for enhancing dense prediction tasks while inviting exploration into additional model optimization techniques informed by language-guided cues.

PDF Markdown Bookmark Chat (Pro)

Authors (8)

Yongming Rao (50 papers)
Wenliang Zhao (22 papers)
Guangyi Chen (45 papers)
Yansong Tang (81 papers)
Zheng Zhu (200 papers)
Guan Huang (75 papers)
Jie Zhou (687 papers)
Jiwen Lu (192 papers)

Citations (483)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - raoyongming/DenseCLIP: [CVPR 2022] DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting (527 stars)