Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SCLIP: Rethinking Self-Attention for Dense Vision-Language Inference (2312.01597v4)

Published 4 Dec 2023 in cs.CV
SCLIP: Rethinking Self-Attention for Dense Vision-Language Inference

Abstract: Recent advances in contrastive language-image pretraining (CLIP) have demonstrated strong capabilities in zero-shot classification by aligning visual representations with target text embeddings in an image level. However, in dense prediction tasks, CLIP often struggles to localize visual features within an image and fails to give accurate pixel-level predictions, which prevents it from functioning as a generalized visual foundation model. In this work, we aim to enhance CLIP's potential for semantic segmentation with minimal modifications to its pretrained models. By rethinking self-attention, we surprisingly find that CLIP can adapt to dense prediction tasks by simply introducing a novel Correlative Self-Attention (CSA) mechanism. Specifically, we replace the traditional self-attention block of CLIP vision encoder's last layer by our CSA module and reuse its pretrained projection matrices of query, key, and value, leading to a training-free adaptation approach for CLIP's zero-shot semantic segmentation. Extensive experiments show the advantage of CSA: we obtain a 38.2% average zero-shot mIoU across eight semantic segmentation benchmarks highlighted in this paper, significantly outperforming the existing SoTA's 33.9% and the vanilla CLIP's 14.1%.

Enhancing CLIP for Zero-Shot Semantic Segmentation Via Correlative Self-Attention

Introduction to the Research Gap

Contrastive Language-Image Pretraining (CLIP) models have emerged as powerful tools in achieving remarkable zero-shot classification results. They operate by comparing image-level representations with target text embeddings, which works exceptionally well for general classification tasks. However, when applied to the more granular and demanding task of semantic segmentation, CLIP models face challenges. Specifically, they struggle to accurately localize visual features at the pixel level, thereby limiting their effectiveness for dense prediction tasks. This paper introduces a novel Correlative Self-Attention (CSA) mechanism to tackle this limitation, aiming to augment CLIP's applicability to semantic segmentation without necessitating extensive retraining or model modifications.

Revisiting Self-Attention in Vision-LLMs

The core issue with applying CLIP to semantic segmentation stems from the spatial invariance of its learned visual features. In semantic segmentation, spatially covariant features are desirable — meaning that the model should discern how local representations change in accordance with their positions within an image. The authors pinpoint the inefficacy of CLIP’s self-attention mechanism in this context, proposing a shift towards Correlative Self-Attention (CSA) to foster spatially covariant features. By recalibrating the self-attention mechanism, CSA enables each visual token to recognize and align with semantically similar entities across the image, thereby enhancing the model's ability to localize features accurately.

Empirical Validation of CSA

The newly proposed SCLIP model, incorporating CSA, was rigorously evaluated across eight semantic segmentation benchmarks. With a remarkable average zero-shot mIoU of 38.2%, SCLIP significantly outperforms the current state-of-the-art models as well as the vanilla CLIP model, which achieves a substantially lower mIoU. This impressive improvement highlights the potential of minimal yet focused modifications to preexisting large models like CLIP, steering them towards more specialized tasks without the need for extensive retraining.

Theoretical and Practical Implications

SCLIP’s advancement presents both theoretical and practical ramifications. Theoretically, it proposes a plausible pathway for adapting large, generalized models to more specific tasks through targeted architectural changes, rather than extensive dataset-specific retraining. Practically, the findings suggest a more efficient use of existing resources - leveraging pre-trained models like CLIP for a broader spectrum of applications, including dense prediction tasks, with minimal performance trade-offs. Furthermore, the CSA mechanism's insensitivity to particular projection matrices indicates a robustness and adaptability that could simplify the transformation process for other tasks or models as well.

Future Directions in Zero-Shot Learning

While SCLIP represents a significant stride forward, it also opens avenues for future research. The exploration of additional architectural modifications that could further enhance the zero-shot learning capabilities of CLIP and similar models is a promising direction. Additionally, investigating the scalability of such approaches to accommodate a wider array of dense prediction tasks, beyond semantic segmentation, could extend the utility of pre-trained models even further. The findings also beckon a deeper dive into understanding the dynamics between language and vision models in zero-shot learning scenarios, potentially unlocking new methodologies for enhancing their interplay.

Conclusion

By introducing the novel Correlative Self-Attention mechanism, this research significantly enhances the zero-shot semantic segmentation performance of the CLIP model. The methodology and results underscore the viability of adapting large, general-purpose models for specific tasks through targeted modifications, expanding their applicability and efficiency. As AI research continues to evolve, such approaches open new horizons for leveraging pre-trained models across a wider array of tasks, pushing the boundaries of zero-shot learning and model generalizability.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (71)
  1. Single-stage semantic segmentation from image labels. In CVPR, 2020.
  2. BEiT: BERT pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021.
  3. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
  4. Language models are few-shot learners. NeurIPS, 2020.
  5. Coco-stuff: Thing and stuff classes in context. In CVPR, 2018.
  6. End-to-end object detection with transformers. In ECCV, 2020.
  7. Emerging properties in self-supervised vision transformers. In ICCV, 2021.
  8. Learning to generate text-grounded mask for open-world semantic segmentation from only image-text pairs. In CVPR, 2023.
  9. A simple framework for contrastive learning of visual representations. In ICML, 2020.
  10. Per-pixel classification is not all you need for semantic segmentation. NeurIPS, 2021.
  11. Masked-attention mask transformer for universal image segmentation. In CVPR, 2022.
  12. The cityscapes dataset for semantic urban scene understanding. In CVPR, 2016.
  13. ImageNet: A large-scale hierarchical image database. In CVPR, 2009.
  14. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  15. Diffusion models beat gans on image synthesis. NeurIPS, 2021.
  16. Cswin transformer: A general vision transformer backbone with cross-shaped windows. In CVPR, 2022.
  17. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
  18. The pascal visual object classes challenge: A retrospective. IJCV, 2015.
  19. Clip-adapter: Better vision-language models with feature adapters. arXiv preprint arXiv:2110.04544, 2021.
  20. Scaling open-vocabulary image segmentation with image-level labels. In ECCV, 2022.
  21. Bootstrap your own latent a new approach to self-supervised learning. In NeurIPS, 2020.
  22. Open-vocabulary object detection via vision and language knowledge distillation. arXiv preprint arXiv:2104.13921, 2021.
  23. Deep residual learning for image recognition. In CVPR, 2016.
  24. Momentum contrast for unsupervised visual representation learning. In CVPR, 2020.
  25. Masked autoencoders are scalable vision learners. In CVPR, 2022.
  26. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  27. Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, 2021.
  28. Diffusion models for zero-shot open-vocabulary segmentation. arXiv preprint arXiv:2306.09316, 2023.
  29. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
  30. Efficient inference in fully connected crfs with gaussian edge potentials. NeurIPS, 2011.
  31. Acseg: Adaptive conceptualization for unsupervised semantic segmentation. In CVPR, 2023.
  32. Foundations and recent trends in multimodal machine learning: Principles, challenges, and open questions. arXiv preprint arXiv:2209.03430, 2022.
  33. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023.
  34. Open-world semantic segmentation via contrasting and clustering vision-language embedding. In ECCV, 2022a.
  35. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021.
  36. Swin transformer v2: Scaling up capacity and resolution. In CVPR, 2022b.
  37. Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In ICML, 2023.
  38. The role of context for object detection and semantic segmentation in the wild. In CVPR, 2014.
  39. Open vocabulary semantic segmentation with patch aligned contrastive learning. In CVPR, 2023.
  40. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In ICML, 2022.
  41. Combined scaling for open-vocabulary image classification. arXiv preprint arXiv:2111.10050, 2021.
  42. Learning transferable visual models from natural language supervision. In ICML, 2021.
  43. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 2020.
  44. Perceptual grouping in contrastive vision-language models. In ICCV, 2023.
  45. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
  46. How much can clip benefit vision-and-language tasks? arXiv preprint arXiv:2107.06383, 2021.
  47. Reco: Retrieve and co-segment for zero-shot transfer. NeurIPS, 2022.
  48. Deep unsupervised learning using nonequilibrium thermodynamics. In ICML, 2015.
  49. Segmenter: Transformer for semantic segmentation. In ICCV, 2021.
  50. Maxvit: Multi-axis vision transformer. In ECCV, 2022.
  51. Learning to decompose visual features with latent textual prompts. arXiv preprint arXiv:2210.04287, 2022.
  52. Axial-deeplab: Stand-alone axial-attention for panoptic segmentation. In ECCV, 2020.
  53. Max-deeplab: End-to-end panoptic segmentation with mask transformers. In CVPR, 2021.
  54. Crossformer++: A versatile vision transformer hinging on cross-scale attention. arXiv preprint arXiv:2303.06908, 2023.
  55. Masked feature prediction for self-supervised visual pre-training. In CVPR, 2022.
  56. Vision transformer with deformable attention. In CVPR, 2022.
  57. Groupvit: Semantic segmentation emerges from text supervision. In CVPR, 2022.
  58. Learning open-vocabulary semantic segmentation models from natural language supervision. In CVPR, 2023a.
  59. Open-vocabulary panoptic segmentation with text-to-image diffusion models. In CVPR, 2023b.
  60. Side adapter network for open-vocabulary semantic segmentation. In CVPR, 2023c.
  61. A simple framework for text-supervised semantic segmentation. In CVPR, 2023.
  62. Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022.
  63. Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432, 2021.
  64. Ifseg: Image-free semantic segmentation via vision-language model. In CVPR, 2023.
  65. Semantic understanding of scenes through the ade20k dataset. IJCV, 2019.
  66. Extract free dense labels from clip. In ECCV, 2022.
  67. Learning to prompt for vision-language models. arXiv preprint arXiv:2109.01134, 2021.
  68. Zegclip: Towards adapting clip for zero-shot semantic segmentation. In CVPR, 2023.
  69. Biformer: Vision transformer with bi-level routing attention. In CVPR, 2023.
  70. Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159, 2020.
  71. Segment everything everywhere all at once. arXiv preprint arXiv:2304.06718, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Feng Wang (408 papers)
  2. Jieru Mei (26 papers)
  3. Alan Yuille (294 papers)
Citations (35)