Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen Convolutional CLIP (2308.02487v2)

Published 4 Aug 2023 in cs.CV

Abstract: Open-vocabulary segmentation is a challenging task requiring segmenting and recognizing objects from an open set of categories. One way to address this challenge is to leverage multi-modal models, such as CLIP, to provide image and text features in a shared embedding space, which bridges the gap between closed-vocabulary and open-vocabulary recognition. Hence, existing methods often adopt a two-stage framework to tackle the problem, where the inputs first go through a mask generator and then through the CLIP model along with the predicted masks. This process involves extracting features from images multiple times, which can be ineffective and inefficient. By contrast, we propose to build everything into a single-stage framework using a shared Frozen Convolutional CLIP backbone, which not only significantly simplifies the current two-stage pipeline, but also remarkably yields a better accuracy-cost trade-off. The proposed FC-CLIP, benefits from the following observations: the frozen CLIP backbone maintains the ability of open-vocabulary classification and can also serve as a strong mask generator, and the convolutional CLIP generalizes well to a larger input resolution than the one used during contrastive image-text pretraining. When training on COCO panoptic data only and testing in a zero-shot manner, FC-CLIP achieve 26.8 PQ, 16.8 AP, and 34.1 mIoU on ADE20K, 18.2 PQ, 27.9 mIoU on Mapillary Vistas, 44.0 PQ, 26.8 AP, 56.2 mIoU on Cityscapes, outperforming the prior art by +4.2 PQ, +2.4 AP, +4.2 mIoU on ADE20K, +4.0 PQ on Mapillary Vistas and +20.1 PQ on Cityscapes, respectively. Additionally, the training and testing time of FC-CLIP is 7.5x and 6.6x significantly faster than the same prior art, while using 5.9x fewer parameters. FC-CLIP also sets a new state-of-the-art performance across various open-vocabulary semantic segmentation datasets. Code at https://github.com/bytedance/fc-clip

PDF Abstract

Insightful Overview of "Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen Convolutional CLIP"

This paper introduces FC-CLIP, a novel single-stage framework for open-vocabulary segmentation, leveraging a frozen Convolutional CLIP backbone. The authors aim to address the inefficiencies present in existing two-stage approaches by integrating both mask generation and classification within a unified architecture. This approach not only maintains the alignment capability for open-vocabulary classification characteristic of CLIP models but also enhances performance and computational efficiency across diverse benchmarks.

The core contribution of this work is the consolidation of previous pipelines into a single-stage model using a shared frozen convolutional backbone. By employing a Convolutional Neural Network (CNN) architecture—specifically, ConvNeXt-Large CLIP—the model benefits from the generalization abilities of convolutional networks for variable input resolutions, a limitation noted in ViT-based CLIP implementations. FC-CLIP inherently supports efficient training and inference by eliminating redundant feature extraction processes seen in segregated frameworks.

Numerical results substantiate the robustness of the approach, with FC-CLIP outperforming existing methods across multiple open-vocabulary panoptic and semantic segmentation datasets, including COCO, ADE20K, Mapillary Vistas, and Cityscapes. Particularly notable improvements occur on ADE20K, where FC-CLIP achieves 26.8 PQ, representing an advancement of +4.2 over previous state-of-the-art models. Furthermore, training and testing efficiencies are improved significantly, demonstrating 7.5× faster training and 6.6× faster inference, while using 5.9× fewer parameters than ODISE.

The paper provides insights into the operational dynamics of shared frozen backbones for segmentation tasks, emphasizing the alignment advantages retained by freezing the CLIP. This effectively counters the degradation seen when fine-tuning on additional datasets or tasks. The ensemble of in-vocabulary and out-of-vocabulary classifiers, which combines learned task-specific alignments with zero-shot capabilities inherent in CLIP, further underlines the flexibility and robustness of the model.

The paper's practical implications enhance real-time applications in domains requiring dynamic segmentation capabilities across an expansive set of categories, such as autonomous vehicles or augmented reality systems, which necessitate highly efficient and adaptable models. The theoretical advancement offers a pathway for future exploration into single-stage frameworks, pushing towards maximally efficient designs that do not compromise on extensive vocabulary coverage.

By exploring the trade-offs between various model architectures and segmentation frameworks (e.g., Mask2Former, kMaX-DeepLab), the research also elucidates the broader applicability of adopting convolution-based CLIP models for segmentation tasks. Indeed, the findings may encourage future studies to explore enhanced fusion techniques or investigate additional backbone architectures suitable for segmentation within versatile contexts.

Overall, the paper represents an efficient realignment of existing segmentation paradigms, showcasing how targeted architectural choices can lead to substantial computational and performance benefits. These insights, coupled with strong empirical results, suggest that FC-CLIP may serve as a conduit for ongoing innovation in open-vocabulary segmentation systems, harmonizing the need for comprehensive recognition capabilities with agility and operational feasibility.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Qihang Yu (44 papers)
Ju He (24 papers)
Xueqing Deng (25 papers)
Xiaohui Shen (67 papers)
Liang-Chieh Chen (66 papers)

Citations (96)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - bytedance/fc-clip: [NeurIPS 2023] This repo contains the code for our paper Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen Convolutional CLIP (279 stars)