Heuristic Vision Pre-Training with Self-Supervised and Supervised Multi-Task Learning (2310.07510v1)
Abstract: To mimic human vision with the way of recognizing the diverse and open world, foundation vision models are much critical. While recent techniques of self-supervised learning show the promising potentiality of this mission, we argue that signals from labelled data are also important for common-sense recognition, and properly chosen pre-text tasks can facilitate the efficiency of vision representation learning. To this end, we propose a novel pre-training framework by adopting both self-supervised and supervised visual pre-text tasks in a multi-task manner. Specifically, given an image, we take a heuristic way by considering its intrinsic style properties, inside objects with their locations and correlations, and how it looks like in 3D space for basic visual understanding. However, large-scale object bounding boxes and correlations are usually hard to achieve. Alternatively, we develop a hybrid method by leveraging both multi-label classification and self-supervised learning. On the one hand, under the multi-label supervision, the pre-trained model can explore the detailed information of an image, e.g., image types, objects, and part of semantic relations. On the other hand, self-supervised learning tasks, with respect to Masked Image Modeling (MIM) and contrastive learning, can help the model learn pixel details and patch correlations. Results show that our pre-trained models can deliver results on par with or better than state-of-the-art (SOTA) results on multiple visual tasks. For example, with a vanilla Swin-B backbone, we achieve 85.3\% top-1 accuracy on ImageNet-1K classification, 47.9 box AP on COCO object detection for Mask R-CNN, and 50.6 mIoU on ADE-20K semantic segmentation when using Upernet. The performance shows the ability of our vision foundation model to serve general purpose vision tasks.
- Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021.
- A simple framework for contrastive learning of visual representations. In ICML, pages 1597–1607. PMLR, 2020.
- Unsupervised learning of visual features by contrasting cluster assignments. arXiv preprint arXiv:2006.09882, 2020.
- Emerging properties in self-supervised vision transformers. arXiv preprint arXiv:2104.14294, 2021.
- M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. NeurIPS, 26:2292–2300, 2013.
- Z. Cai and N. Vasconcelos. Cascade r-cnn: Delving into high quality object detection. In CVPR, pages 6154–6162, 2018.
- Multi-label image recognition with joint class-aware map disentangling and label correlation embedding. In ICME, pages 622–627. IEEE, 2019.
- Learning semantic-specific graph representation for multi-label image recognition. In ICCV, pages 522–531, 2019.
- An empirical study of training self-supervised vision transformers. arXiv preprint arXiv:2104.02057, 2021.
- Randaugment: Practical automated data augmentation with a reduced search space. In CVPRW, pages 702–703, 2020.
- Cswin transformer: A general vision transformer backbone with cross-shaped windows. arXiv preprint arXiv:2107.00652, 2021.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Peco: Perceptual codebook for bert pre-training of vision transformers. arXiv preprint arXiv:2111.12710, 2021.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Imagenet: A large-scale hierarchical image database. In CVPR, pages 248–255. IEEE, 2009.
- Are large-scale datasets necessary for self-supervised pre-training? arXiv preprint arXiv:2112.10740, 2021.
- Do self-supervised and supervised methods learn similar visual representations? arXiv preprint arXiv:2110.00528, 2021.
- Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377, 2021.
- Mask r-cnn. In CVPR, pages 2961–2969, 2017.
- Openimages: A public dataset for large-scale multi-label and multi-class image classification. https://github.com/openimages, 2(3):18, 2017.
- Panoptic feature pyramid networks. In CVPR, pages 6399–6408, 2019.
- Focal loss for dense object detection. In ICCV, pages 2980–2988, 2017.
- I. Loshchilov and F. Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- Swin transformer v2: Scaling up capacity and resolution. arXiv preprint arXiv:2111.09883, 2021.
- Swin transformer: Hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030, 2021.
- Microsoft coco: Common objects in context. In ECCV, pages 740–755. Springer, 2014.
- Align before fuse: Vision and language representation learning with momentum distillation. NeurIPS, 34, 2021.
- Query2label: A simple transformer way to multi-label classification. arXiv preprint arXiv:2107.10834, 2021.
- Multitask learning over graphs: An approach for distributed, streaming machine learning. IEEE Signal Processing Magazine, 37(3):14–25, 2020.
- Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
- S. J. Pan and Q. Yang. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10):1345–1359, 2009.
- Asymmetric loss for multi-label classification. In ICCV, pages 82–91, 2021.
- Grad-cam: Visual explanations from deep networks via gradient-based localization. In ICCV, pages 618–626, 2017.
- Revisiting unreasonable effectiveness of data in deep learning era. In ICCV, pages 843–852, 2017.
- Tencent ml-images: A large-scale multi-label image database for visual representation learning. IEEE Access, 7, 2019.
- Distribution-balanced loss for multi-label classification in long-tailed datasets. In ECCV, pages 162–178. Springer, 2020.
- Unified perceptual parsing for scene understanding. In ECCV, pages 418–434, 2018.
- Simmim: A simple framework for masked image modeling. arXiv preprint arXiv:2111.09886, 2021.
- Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432, 2021.
- Attention-driven dynamic graph convolutional network for multi-label image recognition. In ECCV, pages 649–665. Springer, 2020.
- Semantic understanding of scenes through the ade20k dataset. International Journal of Computer Vision, 127(3):302–321, 2019.
- Zhiming Qian (1 paper)