Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

kNN-CLIP: Retrieval Enables Training-Free Segmentation on Continually Expanding Large Vocabularies (2404.09447v3)

Published 15 Apr 2024 in cs.CV and cs.LG

Abstract: Continual segmentation has not yet tackled the challenge of improving open-vocabulary segmentation models with training data for accurate segmentation across large, continually expanding vocabularies. We discover that traditional continual training results in severe catastrophic forgetting, failing to outperform a zero-shot segmentation baseline. We introduce a novel training-free strategy, kNN-CLIP, which augments the model with a database of instance embeddings for semantic and panoptic segmentation that achieves zero forgetting. We demonstrate that kNN-CLIP can adapt to continually growing vocabularies without the need for retraining or large memory costs. kNN-CLIP enables open-vocabulary segmentation methods to expand their vocabularies on any domain with a single pass through the data, while only storing compact embeddings. This approach minimizes both compute and memory costs. kNN-CLIP achieves state-of-the-art performance across large-vocabulary semantic and panoptic segmentation datasets. We hope kNN-CLIP represents a significant step forward in enabling more efficient and adaptable continual segmentation, paving the way for advances in real-world large-vocabulary continual segmentation methods.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (79)
  1. Towards in-context scene understanding. Advances in Neural Information Processing Systems, 36, 2024.
  2. Retrieval-augmented diffusion models. Advances in Neural Information Processing Systems, 35:15309–15324, 2022.
  3. Improving language models by retrieving from trillions of tokens. In International conference on machine learning, pp.  2206–2240. PMLR, 2022.
  4. Coco-stuff: Thing and stuff classes in context. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  1209–1218, 2018.
  5. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pp.  9650–9660, 2021.
  6. Comformer: Continual learning in semantic and panoptic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  3010–3020, 2023.
  7. Ssul: Semantic segmentation with unknown label for exemplar-based class-incremental learning. Advances in neural information processing systems, 34:10919–10930, 2021.
  8. Exploring open-vocabulary semantic segmentation without human labels. arXiv preprint arXiv:2306.00450, 2023.
  9. Murag: Multimodal retrieval-augmented generator for open question answering over images and text. arXiv preprint arXiv:2210.02928, 2022a.
  10. Re-imagen: Retrieval-augmented text-to-image generator. arXiv preprint arXiv:2209.14491, 2022b.
  11. Mask2former for video instance segmentation. In CVPR, 2022.
  12. Decoupling zero-shot semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  11583–11592, 2022.
  13. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
  14. The pascal visual object classes (voc) challenge. International journal of computer vision, 88:303–338, 2010.
  15. Scaling open-vocabulary image segmentation with image-level labels. In European Conference on Computer Vision, pp.  540–557. Springer, 2022.
  16. Retrieval augmented language model pre-training. In International conference on machine learning, pp.  3929–3938. PMLR, 2020.
  17. Clipcam: A simple baseline for zero-shot text-guided object and action localization. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  4453–4457. IEEE, 2022.
  18. Retrieval-enhanced contrastive vision-text models. arXiv preprint arXiv:2306.07196, 2023.
  19. Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299, 2022.
  20. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pp.  4904–4916. PMLR, 2021.
  21. Improving replay-based continual semantic segmentation with smart data selection. In 2022 IEEE 25th International Conference on Intelligent Transportation Systems (ITSC), pp.  1114–1121. IEEE, 2022.
  22. Diffusion models for zero-shot open-vocabulary segmentation. arXiv preprint arXiv:2306.09316, 2023.
  23. Generalization through memorization: Nearest neighbor language models. arXiv preprint arXiv:1911.00172, 2019.
  24. Pin the memory: Learning to generalize semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  4350–4360, 2022.
  25. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474, 2020.
  26. Language-driven semantic segmentation. In International Conference on Learning Representations, 2022a.
  27. Sil-land: Segmentation incremental learning in aerial imagery via label number distribution consistency. IEEE Transactions on Geoscience and Remote Sensing, 60:1–20, 2022b.
  28. Open-vocabulary semantic segmentation with mask-adapted clip. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  7061–7070, 2023.
  29. Prototype-guided continual adaptation for class-incremental unsupervised domain adaptation. In European Conference on Computer Vision, pp.  351–368. Springer, 2022a.
  30. Continual semantic segmentation via structure preserving and projected feature alignment. In European Conference on Computer Vision, pp.  345–361. Springer, 2022b.
  31. Learning customized visual models with retrieval-augmented knowledge. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  15148–15158, 2023a.
  32. Dynamic prototype convolution network for few-shot semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  11553–11562, 2022.
  33. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023b.
  34. K-bert: Enabling language representation with knowledge graph. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp.  2901–2908, 2020.
  35. Image segmentation using text and image prompts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  7086–7096, 2022.
  36. Krisp: Integrating implicit and symbolic knowledge for open-domain knowledge-based vqa. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  14111–14121, 2021.
  37. Continual semantic segmentation via repulsion-attraction of sparse and disentangled latent representations. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  1114–1124, 2021.
  38. The role of context for object detection and semantic segmentation in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  891–898, 2014.
  39. Fine-tuning can cripple your foundation model; preserving features may be the solution. arXiv preprint arXiv:2308.13320, 2023.
  40. Alife: Adaptive logit regularizer and feature replay for incremental semantic segmentation. Advances in Neural Information Processing Systems, 35:14516–14528, 2022.
  41. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
  42. Knowledge enhanced contextual word representations. arXiv preprint arXiv:1909.04164, 2019.
  43. Improving wikipedia verifiability with ai. Nature Machine Intelligence, 5(10):1142–1148, 2023.
  44. Online continual learning without the storage constraint. arXiv preprint arXiv:2305.09253, 2023a.
  45. Computationally budgeted continual learning: What does matter? In CVPR, 2023b.
  46. From categories to classifier: Name-only continual learning by exploring the web. arXiv preprint arXiv:2311.11293, 2023c.
  47. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp.  8748–8763. PMLR, 2021.
  48. In-context retrieval-augmented language models. arXiv preprint arXiv:2302.00083, 2023.
  49. Denseclip: Language-guided dense prediction with context-aware prompting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  18082–18091, 2022.
  50. icarl: Incremental classifier and representation learning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp.  2001–2010, 2017.
  51. Towards total recall in industrial anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  14318–14328, June 2022.
  52. Investigating the limitation of clip models: The worst-performing categories. arXiv preprint arXiv:2310.03324, 2023.
  53. K-lite: Learning transferable visual models with external knowledge. Advances in Neural Information Processing Systems, 35:15558–15573, 2022.
  54. Knn-diffusion: Image generation via large-scale retrieval. arXiv preprint arXiv:2204.02849, 2022.
  55. Incremental few-shot semantic segmentation via embedding adaptive-update and hyper-class representation. In Proceedings of the 30th ACM international conference on multimedia, pp.  5547–5556, 2022.
  56. Replug: Retrieval-augmented black-box language models. arXiv preprint arXiv:2301.12652, 2023.
  57. Reco: Retrieve and co-segment for zero-shot transfer. Advances in Neural Information Processing Systems, 35:33754–33767, 2022.
  58. Clip as rnn: Segment countless visual concepts without training endeavor. arXiv preprint arXiv:2312.07661, 2023.
  59. Sus-x: Training-free name-only transfer of vision-language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  2725–2736, 2023.
  60. No “zero-shot” without exponential data: Pretraining concept frequency determines multimodal model performance. arXiv preprint, 2024.
  61. Rethinking exemplars for continual semantic segmentation in endoscopy scenes: Entropy-based mini-batch pseudo-replay. Computers in Biology and Medicine, 165:107412, 2023a.
  62. Sam-clip: Merging vision foundation models towards semantic and spatial understanding. arXiv preprint arXiv:2310.15308, 2023b.
  63. Training data is more valuable than you think: A simple and effective method by retrieving from training data. arXiv preprint arXiv:2203.08773, 2022.
  64. Regularizing deep networks with semantic data augmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(7):3733–3748, 2021.
  65. Memorizing transformers. arXiv preprint arXiv:2203.08913, 2022.
  66. Sed: A simple encoder-decoder for open-vocabulary semantic segmentation. arXiv preprint arXiv:2311.15537, 2023.
  67. Groupvit: Semantic segmentation emerges from text supervision. In CVPR, 2022a.
  68. Open-vocabulary panoptic segmentation with text-to-image diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  2955–2966, 2023a.
  69. A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In European Conference on Computer Vision, pp.  736–753. Springer, 2022b.
  70. Side adapter network for open-vocabulary semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  2945–2954, 2023b.
  71. Retrieval-augmented multimodal language modeling. arXiv preprint arXiv:2211.12561, 2022.
  72. Semi-supervised domain adaptation via sample-to-sample self-distillation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp.  1978–1987, 2022.
  73. Semantic drift compensation for class-incremental learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  6982–6991, 2020.
  74. Convolutions die hard: Open-vocabulary segmentation with single frozen convolutional clip. arXiv preprint arXiv:2308.02487, 2023.
  75. Dict-bert: Enhancing language model pre-training with dictionary. arXiv preprint arXiv:2110.06490, 2021.
  76. Bo Yuan and Danpei Zhao. A survey on continual semantic segmentation: Theory, challenge, method and application. arXiv preprint arXiv:2310.14277, 2023.
  77. Semantic understanding of scenes through the ade20k dataset. International Journal of Computer Vision, 127:302–321, 2019.
  78. Denseclip: Extract free dense labels from clip. arXiv preprint arXiv:2112.01071, 2021.
  79. Continual semantic segmentation with automatic memory sample selection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  3082–3092, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Zhongrui Gui (2 papers)
  2. Shuyang Sun (25 papers)
  3. Runjia Li (16 papers)
  4. Jianhao Yuan (10 papers)
  5. Zhaochong An (11 papers)
  6. Karsten Roth (36 papers)
  7. Ameya Prabhu (37 papers)
  8. Philip Torr (172 papers)
Citations (3)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com