kNN-CLIP: Retrieval Enables Training-Free Segmentation on Continually Expanding Large Vocabularies (2404.09447v3)
Abstract: Continual segmentation has not yet tackled the challenge of improving open-vocabulary segmentation models with training data for accurate segmentation across large, continually expanding vocabularies. We discover that traditional continual training results in severe catastrophic forgetting, failing to outperform a zero-shot segmentation baseline. We introduce a novel training-free strategy, kNN-CLIP, which augments the model with a database of instance embeddings for semantic and panoptic segmentation that achieves zero forgetting. We demonstrate that kNN-CLIP can adapt to continually growing vocabularies without the need for retraining or large memory costs. kNN-CLIP enables open-vocabulary segmentation methods to expand their vocabularies on any domain with a single pass through the data, while only storing compact embeddings. This approach minimizes both compute and memory costs. kNN-CLIP achieves state-of-the-art performance across large-vocabulary semantic and panoptic segmentation datasets. We hope kNN-CLIP represents a significant step forward in enabling more efficient and adaptable continual segmentation, paving the way for advances in real-world large-vocabulary continual segmentation methods.
- Towards in-context scene understanding. Advances in Neural Information Processing Systems, 36, 2024.
- Retrieval-augmented diffusion models. Advances in Neural Information Processing Systems, 35:15309–15324, 2022.
- Improving language models by retrieving from trillions of tokens. In International conference on machine learning, pp. 2206–2240. PMLR, 2022.
- Coco-stuff: Thing and stuff classes in context. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1209–1218, 2018.
- Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 9650–9660, 2021.
- Comformer: Continual learning in semantic and panoptic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3010–3020, 2023.
- Ssul: Semantic segmentation with unknown label for exemplar-based class-incremental learning. Advances in neural information processing systems, 34:10919–10930, 2021.
- Exploring open-vocabulary semantic segmentation without human labels. arXiv preprint arXiv:2306.00450, 2023.
- Murag: Multimodal retrieval-augmented generator for open question answering over images and text. arXiv preprint arXiv:2210.02928, 2022a.
- Re-imagen: Retrieval-augmented text-to-image generator. arXiv preprint arXiv:2209.14491, 2022b.
- Mask2former for video instance segmentation. In CVPR, 2022.
- Decoupling zero-shot semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11583–11592, 2022.
- An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
- The pascal visual object classes (voc) challenge. International journal of computer vision, 88:303–338, 2010.
- Scaling open-vocabulary image segmentation with image-level labels. In European Conference on Computer Vision, pp. 540–557. Springer, 2022.
- Retrieval augmented language model pre-training. In International conference on machine learning, pp. 3929–3938. PMLR, 2020.
- Clipcam: A simple baseline for zero-shot text-guided object and action localization. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4453–4457. IEEE, 2022.
- Retrieval-enhanced contrastive vision-text models. arXiv preprint arXiv:2306.07196, 2023.
- Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299, 2022.
- Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pp. 4904–4916. PMLR, 2021.
- Improving replay-based continual semantic segmentation with smart data selection. In 2022 IEEE 25th International Conference on Intelligent Transportation Systems (ITSC), pp. 1114–1121. IEEE, 2022.
- Diffusion models for zero-shot open-vocabulary segmentation. arXiv preprint arXiv:2306.09316, 2023.
- Generalization through memorization: Nearest neighbor language models. arXiv preprint arXiv:1911.00172, 2019.
- Pin the memory: Learning to generalize semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4350–4360, 2022.
- Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474, 2020.
- Language-driven semantic segmentation. In International Conference on Learning Representations, 2022a.
- Sil-land: Segmentation incremental learning in aerial imagery via label number distribution consistency. IEEE Transactions on Geoscience and Remote Sensing, 60:1–20, 2022b.
- Open-vocabulary semantic segmentation with mask-adapted clip. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7061–7070, 2023.
- Prototype-guided continual adaptation for class-incremental unsupervised domain adaptation. In European Conference on Computer Vision, pp. 351–368. Springer, 2022a.
- Continual semantic segmentation via structure preserving and projected feature alignment. In European Conference on Computer Vision, pp. 345–361. Springer, 2022b.
- Learning customized visual models with retrieval-augmented knowledge. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15148–15158, 2023a.
- Dynamic prototype convolution network for few-shot semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11553–11562, 2022.
- Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023b.
- K-bert: Enabling language representation with knowledge graph. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp. 2901–2908, 2020.
- Image segmentation using text and image prompts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7086–7096, 2022.
- Krisp: Integrating implicit and symbolic knowledge for open-domain knowledge-based vqa. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14111–14121, 2021.
- Continual semantic segmentation via repulsion-attraction of sparse and disentangled latent representations. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 1114–1124, 2021.
- The role of context for object detection and semantic segmentation in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 891–898, 2014.
- Fine-tuning can cripple your foundation model; preserving features may be the solution. arXiv preprint arXiv:2308.13320, 2023.
- Alife: Adaptive logit regularizer and feature replay for incremental semantic segmentation. Advances in Neural Information Processing Systems, 35:14516–14528, 2022.
- Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
- Knowledge enhanced contextual word representations. arXiv preprint arXiv:1909.04164, 2019.
- Improving wikipedia verifiability with ai. Nature Machine Intelligence, 5(10):1142–1148, 2023.
- Online continual learning without the storage constraint. arXiv preprint arXiv:2305.09253, 2023a.
- Computationally budgeted continual learning: What does matter? In CVPR, 2023b.
- From categories to classifier: Name-only continual learning by exploring the web. arXiv preprint arXiv:2311.11293, 2023c.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748–8763. PMLR, 2021.
- In-context retrieval-augmented language models. arXiv preprint arXiv:2302.00083, 2023.
- Denseclip: Language-guided dense prediction with context-aware prompting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18082–18091, 2022.
- icarl: Incremental classifier and representation learning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 2001–2010, 2017.
- Towards total recall in industrial anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 14318–14328, June 2022.
- Investigating the limitation of clip models: The worst-performing categories. arXiv preprint arXiv:2310.03324, 2023.
- K-lite: Learning transferable visual models with external knowledge. Advances in Neural Information Processing Systems, 35:15558–15573, 2022.
- Knn-diffusion: Image generation via large-scale retrieval. arXiv preprint arXiv:2204.02849, 2022.
- Incremental few-shot semantic segmentation via embedding adaptive-update and hyper-class representation. In Proceedings of the 30th ACM international conference on multimedia, pp. 5547–5556, 2022.
- Replug: Retrieval-augmented black-box language models. arXiv preprint arXiv:2301.12652, 2023.
- Reco: Retrieve and co-segment for zero-shot transfer. Advances in Neural Information Processing Systems, 35:33754–33767, 2022.
- Clip as rnn: Segment countless visual concepts without training endeavor. arXiv preprint arXiv:2312.07661, 2023.
- Sus-x: Training-free name-only transfer of vision-language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2725–2736, 2023.
- No “zero-shot” without exponential data: Pretraining concept frequency determines multimodal model performance. arXiv preprint, 2024.
- Rethinking exemplars for continual semantic segmentation in endoscopy scenes: Entropy-based mini-batch pseudo-replay. Computers in Biology and Medicine, 165:107412, 2023a.
- Sam-clip: Merging vision foundation models towards semantic and spatial understanding. arXiv preprint arXiv:2310.15308, 2023b.
- Training data is more valuable than you think: A simple and effective method by retrieving from training data. arXiv preprint arXiv:2203.08773, 2022.
- Regularizing deep networks with semantic data augmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(7):3733–3748, 2021.
- Memorizing transformers. arXiv preprint arXiv:2203.08913, 2022.
- Sed: A simple encoder-decoder for open-vocabulary semantic segmentation. arXiv preprint arXiv:2311.15537, 2023.
- Groupvit: Semantic segmentation emerges from text supervision. In CVPR, 2022a.
- Open-vocabulary panoptic segmentation with text-to-image diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2955–2966, 2023a.
- A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In European Conference on Computer Vision, pp. 736–753. Springer, 2022b.
- Side adapter network for open-vocabulary semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2945–2954, 2023b.
- Retrieval-augmented multimodal language modeling. arXiv preprint arXiv:2211.12561, 2022.
- Semi-supervised domain adaptation via sample-to-sample self-distillation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1978–1987, 2022.
- Semantic drift compensation for class-incremental learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 6982–6991, 2020.
- Convolutions die hard: Open-vocabulary segmentation with single frozen convolutional clip. arXiv preprint arXiv:2308.02487, 2023.
- Dict-bert: Enhancing language model pre-training with dictionary. arXiv preprint arXiv:2110.06490, 2021.
- Bo Yuan and Danpei Zhao. A survey on continual semantic segmentation: Theory, challenge, method and application. arXiv preprint arXiv:2310.14277, 2023.
- Semantic understanding of scenes through the ade20k dataset. International Journal of Computer Vision, 127:302–321, 2019.
- Denseclip: Extract free dense labels from clip. arXiv preprint arXiv:2112.01071, 2021.
- Continual semantic segmentation with automatic memory sample selection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3082–3092, 2023.
- Zhongrui Gui (2 papers)
- Shuyang Sun (25 papers)
- Runjia Li (16 papers)
- Jianhao Yuan (10 papers)
- Zhaochong An (11 papers)
- Karsten Roth (36 papers)
- Ameya Prabhu (37 papers)
- Philip Torr (172 papers)