Patch-Level Contrasting without Patch Correspondence for Accurate and Dense Contrastive Representation Learning (2306.13337v1)
Abstract: We propose ADCLR: A ccurate and D ense Contrastive Representation Learning, a novel self-supervised learning framework for learning accurate and dense vision representation. To extract spatial-sensitive information, ADCLR introduces query patches for contrasting in addition with global contrasting. Compared with previous dense contrasting methods, ADCLR mainly enjoys three merits: i) achieving both global-discriminative and spatial-sensitive representation, ii) model-efficient (no extra parameters in addition to the global contrasting baseline), and iii) correspondence-free and thus simpler to implement. Our approach achieves new state-of-the-art performance for contrastive methods. On classification tasks, for ViT-S, ADCLR achieves 77.5% top-1 accuracy on ImageNet with linear probing, outperforming our baseline (DINO) without our devised techniques as plug-in, by 0.5%. For ViT-B, ADCLR achieves 79.8%, 84.0% accuracy on ImageNet by linear probing and finetune, outperforming iBOT by 0.3%, 0.2% accuracy. For dense tasks, on MS-COCO, ADCLR achieves significant improvements of 44.3% AP on object detection, 39.7% AP on instance segmentation, outperforming previous SOTA method SelfPatch by 2.2% and 1.2%, respectively. On ADE20K, ADCLR outperforms SelfPatch by 1.0% mIoU, 1.2% mAcc on the segme
- Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021.
- Vicreg: Variance-invariance-covariance regularization for self-supervised learning. In ICLR, 2022.
- Unsupervised learning of visual features by contrasting cluster assignments. NeurIPS, 2020.
- Emerging properties in self-supervised vision transformers. In ICCV, 2021.
- A simple framework for contrastive learning of visual representations. In ICML, 2020a.
- Context autoencoder for self-supervised representation learning. arXiv preprint arXiv:2202.03026, 2022.
- Exploring simple siamese representation learning. In CVPR, 2021.
- Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297, 2020b.
- An empirical study of training self-supervised vision transformers. In ICCV, 2021.
- Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
- An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2020.
- Corrupted image modeling for self-supervised visual pre-training. arXiv preprint arXiv:2202.03382, 2022.
- Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728, 2018.
- Generative adversarial nets. NeurIPS, 27, 2014.
- Bootstrap your own latent-a new approach to self-supervised learning. NeurIPS, 2020.
- Deep residual learning for image recognition. In CVPR, 2016.
- Mask r-cnn. In ICCV, 2017.
- Momentum contrast for unsupervised visual representation learning. In CVPR, 2020.
- Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377, 2021.
- Object discovery and representation networks. ECCV, 2022.
- Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670, 2018.
- Acquisition of localization confidence for accurate object detection. In ECCV, 2018.
- Expectation-maximization contrastive learning for compact video-and-language representations. In NeurIPS, 2022.
- Microsoft coco: Common objects in context. In ECCV, 2014.
- Feature pyramid networks for object detection. In CVPR, 2017.
- Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, pp. 10012–10022, 2021.
- Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
- Fixing weight decay regularization in adam. 2018.
- Dall-e: Creating images from text. UGC Care Group I Journal, 2021.
- The inaturalist species classification and detection dataset. In CVPR, 2018.
- Attention is all you need. NeurIPS, 2017.
- Extracting and composing robust features with denoising autoencoders. In ICML, 2008.
- Dense contrastive learning for self-supervised visual pre-training. In CVPR, 2021.
- Exploring set similarity for dense self-supervised representation learning. In CVPR, 2022.
- Aligning pretraining for detection via object-level contrastive learning. NeurIPS, 2021.
- Unified perceptual parsing for scene understanding. In ECCV, 2018.
- Region similarity representation learning. In ICCV, 2021.
- Detco: Unsupervised contrastive learning for object detection. In ICCV, 2021a.
- Self-supervised learning with swin transformers. arXiv preprint arXiv:2105.04553, 2021b.
- Propagate yourself: Exploring pixel-level consistency for unsupervised visual representation learning. In CVPR, 2021c.
- Simmim: A simple framework for masked image modeling. arXiv preprint arXiv:2111.09886, 2021d.
- Masked image modeling with denoising contrast. arXiv preprint arXiv:2205.09616, 2022.
- Patch-level representation learning for self-supervised vision transformers. In CVPR, 2022.
- Barlow twins: Self-supervised learning via redundancy reduction. In ICML, 2021.
- Colorful image colorization. In ECCV, 2016.
- Zero-cl: Instance and feature decorrelation for negative-free symmetric contrastive learning. In ICLR, 2021.
- Align representations with base: A new approach to self-supervised learning. In CVPR, 2022.
- Scene parsing through ade20k dataset. In CVPR, 2017.
- Image BERT pre-training with online tokenizer. In ICLR, 2022.