Distilling Self-Supervised Vision Transformers for Weakly-Supervised Few-Shot Classification & Segmentation (2307.03407v1)
Abstract: We address the task of weakly-supervised few-shot image classification and segmentation, by leveraging a Vision Transformer (ViT) pretrained with self-supervision. Our proposed method takes token representations from the self-supervised ViT and leverages their correlations, via self-attention, to produce classification and segmentation predictions through separate task heads. Our model is able to effectively learn to perform classification and segmentation in the absence of pixel-level labels during training, using only image-level labels. To do this it uses attention maps, created from tokens generated by the self-supervised ViT backbone, as pixel-level pseudo-labels. We also explore a practical setup with ``mixed" supervision, where a small number of training images contains ground-truth pixel-level labels and the remaining images have only image-level labels. For this mixed setup, we propose to improve the pseudo-labels using a pseudo-label enhancer that was trained using the available ground-truth pixel-level labels. Experiments on Pascal-5i and COCO-20i demonstrate significant performance gains in a variety of supervision settings, and in particular when little-to-no pixel-level labels are available.
- Weakly supervised learning of instance segmentation with inter-pixel relations. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
- Towards understanding ensemble, knowledge distillation and self-distillation in deep learning. Proc. International Conference on Learning Representations (ICLR), 2023.
- Deep vit features as dense visual descriptors. arXiv preprint arXiv:2112.05814, 2021.
- One weird trick to improve your semi-weakly supervised semantic segmentation model. Proc. International Joint Conference on Artificial Intelligence (IJCAI), 2022.
- Weakly supervised deep detection networks. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
- Few-shot segmentation without meta-learning: A good transductive inference is all you need? In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
- Deep clustering for unsupervised learning of visual features. In Proc. European Conference on Computer Vision (ECCV), 2018.
- Unsupervised learning of visual features by contrasting cluster assignments. Advances in Neural Information Processing Systems (NeurIPS), 2020.
- Emerging properties in self-supervised vision transformers. In Proc. IEEE International Conference on Computer Vision (ICCV), 2021.
- A simple framework for contrastive learning of visual representations. In Proc. International Conference on Machine Learning (ICML), 2020.
- A closer look at few-shot classification. In International Conference on Learning Representations (ICLR), 2019.
- Meta-baseline: Exploring simple meta-learning for few-shot learning. In Proc. IEEE International Conference on Computer Vision (ICCV), 2021.
- Boxsup: Exploiting bounding boxes to supervise convolutional networks for semantic segmentation. In Proc. IEEE International Conference on Computer Vision (ICCV), 2015.
- Funnel-transformer: Filtering out sequential redundancy for efficient language processing. Advances in Neural Information Processing Systems (NeurIPS), 2020.
- Bert: Pre-training of deep bidirectional transformers for language understanding. Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2019.
- A baseline for few-shot image classification. In International Conference on Learning Representations, 2019.
- Few-shot semantic segmentation with prototype learning. In Proc. British Machine Vision Conference (BMVC), 2018.
- An image is worth 16x16 words: Transformers for image recognition at scale. In Proc. International Conference on Learning Representations (ICLR), 2021.
- Learning a deep convnet for multi-label classification with partial labels. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
- The pascal visual object classes (voc) challenge. International Journal of Computer Vision (IJCV), 2010.
- Pytorch lightning. GitHub. Note: https://github.com/PyTorchLightning/pytorch-lightning, 2019.
- Self-support few-shot semantic segmentation. In Proc. European Conference on Computer Vision (ECCV), 2022.
- One-shot learning of object categories. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2006.
- Michael Fink. Object classification from a single example utilizing class relevance metrics. Advances in Neural Information Processing Systems (NeurIPS), 2005.
- Model-agnostic meta-learning for fast adaptation of deep networks. In Proc. International Conference on Machine Learning (ICML), 2017.
- Bootstrap your own latent-a new approach to self-supervised learning. Advances in Neural Information Processing Systems (NeurIPS), 2020.
- Hypercolumns for object segmentation and fine-grained localization. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
- Masked autoencoders are scalable vision learners. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- Deep residual learning for image recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
- Distilling the knowledge in a neural network. NIPS Deep Learning Workshop, 2014.
- Learning to learn using gradient descent. In Proc. International Conference on Artificial Neural Networks (ICANN), 2001.
- Weakly supervised instance segmentation using the bounding box tightness prior. Advances in Neural Information Processing Systems (NeurIPS), 2019.
- Self-supervised visual feature learning with deep neural networks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2020.
- Few-shot metric learning: Online adaptation of embedding for retrieval. In Asian Conference on Computer Vision (ACCV), 2022.
- Integrative few-shot learning for classification and segmentation. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- Relational embedding for few-shot classification. In Proc. IEEE International Conference on Computer Vision (ICCV), 2021.
- Adam: A method for stochastic optimization. Proc. International Conference on Learning Representations (ICLR), 2015.
- Siamese neural networks for one-shot image recognition. In International Conference on Machine Learning (ICML) Workshop on Deep Learning, 2015.
- Dong-Hyun Lee. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML, volume 3, page 896, 2013.
- Ficklenet: Weakly and semi-supervised semantic image segmentation using stochastic inference. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
- Anti-adversarially manipulated attributions for weakly and semi-supervised semantic segmentation. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
- A pixel-level meta-learner for weakly supervised few-shot semantic segmentation. In IEEE Winter Conference on Applications of Computer Vision (WACV), 2022.
- Weakly supervised object localization with progressive domain adaptation. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
- Tell me where to look: Guided attention inference network. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
- Scribblesup: Scribble-supervised convolutional networks for semantic segmentation. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3159–3167, 2016.
- Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2117–2125, 2017.
- Microsoft coco: Common objects in context. In Proc. European Conference on Computer Vision (ECCV), 2014.
- Negative margin matters: Understanding margin in few-shot classification. In Proc. European Conference on Computer Vision (ECCV), 2020.
- Learning non-target knowledge for few-shot semantic segmentation. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- Hypercorrelation squeeze for few-shot segmentation. In Proc. IEEE International Conference on Computer Vision (ICCV), 2021.
- Rectified linear units improve restricted boltzmann machines. In International Conference on Machine Learning (ICML), 2010.
- Feature weighting and boosting for few-shot segmentation. In Proc. IEEE International Conference on Computer Vision (ICCV), 2019.
- Scalable vision transformers with hierarchical pooling. In Proc. IEEE International Conference on Computer Vision (ICCV), 2021.
- Fully convolutional multi-class multiple instance learning. Proc. International Conference on Learning Representations (ICLR), 2015.
- Augmented feedback in semantic segmentation under image level supervision. In Proc. European Conference on Computer Vision (ECCV), 2016.
- Optimization as a model for few-shot learning. In Proc. International Conference on Learning Representations (ICLR), 2017.
- Learning internal representations by error propagation. Technical report, California Univ San Diego La Jolla Inst for Cognitive Science, 1985.
- Imagenet large scale visual recognition challenge. International Journal of Computer Vision (IJCV), 2015.
- Improving few-shot part segmentation using coarse supervision. In Proc. European Conference on Computer Vision (ECCV), 2022.
- Jürgen Schmidhuber. Evolutionary principles in self-referential learning, or on learning how to learn: the meta-meta-… hook. PhD thesis, Technische Universität München, 1987.
- Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proc. IEEE International Conference on Computer Vision (ICCV), 2017.
- One-shot learning for semantic segmentation. In Proc. British Machine Vision Conference (BMVC), 2017.
- Distinct class-specific saliency maps for weakly supervised semantic segmentation. In Proc. European Conference on Computer Vision (ECCV), 2016.
- Weakly supervised few-shot object segmentation using co-attention with visual and semantic embeddings. In Proc. International Joint Conference on Artificial Intelligence (IJCAI), 2020.
- Amp: Adaptive masked proxies for few-shot segmentation. In Proc. IEEE International Conference on Computer Vision (ICCV), 2019.
- Localizing objects with self-supervised transformers and no labels. In Proc. British Machine Vision Conference (BMVC), 2021.
- Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems (NeurIPS), 2017.
- Revisiting the sibling head in object detector. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
- Segmenter: Transformer for semantic segmentation. In Proc. IEEE International Conference on Computer Vision (ICCV), 2021.
- Learning to compare: Relation network for few-shot learning. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
- Prior guided feature enrichment network for few-shot segmentation. In IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2020.
- Training data-efficient image transformers & distillation through attention. In Proc. International Conference on Machine Learning (ICML). PMLR, 2021.
- Attention is all you need. In Advances in Neural Information Processing Systems (NeurIPS), 2017.
- Extracting and composing robust features with denoising autoencoders. In Proc. International Conference on Machine Learning (ICML), 2008.
- Matching networks for one shot learning. In Advances in Neural Information Processing Systems (NeurIPS), 2016.
- Panet: Few-shot image semantic segmentation with prototype alignment. In Proc. IEEE International Conference on Computer Vision (ICCV), 2019.
- Temporal-viewpoint transportation plan for skeletal few-shot action recognition. In Asian Conference on Computer Vision (ACCV), 2022.
- Uncertainty-dtw for time series and sequences. In Proc. European Conference on Computer Vision (ECCV), 2022.
- Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proc. IEEE International Conference on Computer Vision (ICCV), 2021.
- Semi-and weakly-supervised semantic segmentation with deep convolutional neural networks. In Proc. ACM Multimedia Conference (ACMMM), 2015.
- Self-supervised transformers for unsupervised object discovery using normalized cut. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- Generalizing from a few examples: A survey on few-shot learning. ACM computing surveys (csur), 2020.
- Group normalization. In Proc. European Conference on Computer Vision (ECCV), 2018.
- Learning meta-class memory for few-shot semantic segmentation. In Proc. IEEE International Conference on Computer Vision (ICCV), 2021.
- Unsupervised feature learning via non-parametric instance discrimination. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
- Few-shot semantic segmentation with cyclic memory network. In Proc. IEEE International Conference on Computer Vision (ICCV), 2021.
- Doubly deformable aggregation of covariance matrices for few-shot segmentation. In Proc. European Conference on Computer Vision (ECCV), 2022.
- Prototype mixture models for few-shot semantic segmentation. In Proc. European Conference on Computer Vision (ECCV), 2020.
- Mining latent classes for few-shot segmentation. In Proc. IEEE International Conference on Computer Vision (ICCV), 2021.
- Few-shot learning via embedding adaptation with set-to-set functions. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
- Transfgu: a top-down approach to fine-grained unsupervised semantic segmentation. In Proc. European Conference on Computer Vision (ECCV), 2022.
- Unsupervised semantic segmentation with self-supervised object-centric representations. Proc. International Conference on Learning Representations (ICLR), 2023.
- Pyramid graph networks with connection attentions for region-based one-shot semantic segmentation. In Proc. IEEE International Conference on Computer Vision (ICCV), 2019.
- Canet: Class-agnostic segmentation networks with iterative refinement and attentive few-shot learning. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
- Few-shot segmentation via cycle-consistent transformer. Advances in Neural Information Processing Systems (NeurIPS), 2021.
- Self-distillation: Towards efficient and compact neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2021.
- Time-reversed diffusion tensor transformer: A new tenet of few-shot object detection. In Proc. European Conference on Computer Vision (ECCV), 2022.
- Kernelized few-shot object detection with efficient integral aggregation. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- Self-distillation as instance-specific label smoothing. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
- Learning deep features for discriminative localization. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
- Improving semantic segmentation via efficient self-training. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2021.
- Local aggregation for unsupervised learning of visual embeddings. In Proc. IEEE International Conference on Computer Vision (ICCV), 2019.
- Self-supervised learning of object parts for semantic segmentation. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- Pseudoseg: Designing pseudo labels for semantic segmentation. Proc. International Conference on Learning Representations (ICLR), 2021.