Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Heuristic Vision Pre-Training with Self-Supervised and Supervised Multi-Task Learning (2310.07510v1)

Published 11 Oct 2023 in cs.CV

Abstract: To mimic human vision with the way of recognizing the diverse and open world, foundation vision models are much critical. While recent techniques of self-supervised learning show the promising potentiality of this mission, we argue that signals from labelled data are also important for common-sense recognition, and properly chosen pre-text tasks can facilitate the efficiency of vision representation learning. To this end, we propose a novel pre-training framework by adopting both self-supervised and supervised visual pre-text tasks in a multi-task manner. Specifically, given an image, we take a heuristic way by considering its intrinsic style properties, inside objects with their locations and correlations, and how it looks like in 3D space for basic visual understanding. However, large-scale object bounding boxes and correlations are usually hard to achieve. Alternatively, we develop a hybrid method by leveraging both multi-label classification and self-supervised learning. On the one hand, under the multi-label supervision, the pre-trained model can explore the detailed information of an image, e.g., image types, objects, and part of semantic relations. On the other hand, self-supervised learning tasks, with respect to Masked Image Modeling (MIM) and contrastive learning, can help the model learn pixel details and patch correlations. Results show that our pre-trained models can deliver results on par with or better than state-of-the-art (SOTA) results on multiple visual tasks. For example, with a vanilla Swin-B backbone, we achieve 85.3\% top-1 accuracy on ImageNet-1K classification, 47.9 box AP on COCO object detection for Mask R-CNN, and 50.6 mIoU on ADE-20K semantic segmentation when using Upernet. The performance shows the ability of our vision foundation model to serve general purpose vision tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021.
  2. A simple framework for contrastive learning of visual representations. In ICML, pages 1597–1607. PMLR, 2020.
  3. Unsupervised learning of visual features by contrasting cluster assignments. arXiv preprint arXiv:2006.09882, 2020.
  4. Emerging properties in self-supervised vision transformers. arXiv preprint arXiv:2104.14294, 2021.
  5. M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. NeurIPS, 26:2292–2300, 2013.
  6. Z. Cai and N. Vasconcelos. Cascade r-cnn: Delving into high quality object detection. In CVPR, pages 6154–6162, 2018.
  7. Multi-label image recognition with joint class-aware map disentangling and label correlation embedding. In ICME, pages 622–627. IEEE, 2019.
  8. Learning semantic-specific graph representation for multi-label image recognition. In ICCV, pages 522–531, 2019.
  9. An empirical study of training self-supervised vision transformers. arXiv preprint arXiv:2104.02057, 2021.
  10. Randaugment: Practical automated data augmentation with a reduced search space. In CVPRW, pages 702–703, 2020.
  11. Cswin transformer: A general vision transformer backbone with cross-shaped windows. arXiv preprint arXiv:2107.00652, 2021.
  12. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  13. Peco: Perceptual codebook for bert pre-training of vision transformers. arXiv preprint arXiv:2111.12710, 2021.
  14. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  15. Imagenet: A large-scale hierarchical image database. In CVPR, pages 248–255. IEEE, 2009.
  16. Are large-scale datasets necessary for self-supervised pre-training? arXiv preprint arXiv:2112.10740, 2021.
  17. Do self-supervised and supervised methods learn similar visual representations? arXiv preprint arXiv:2110.00528, 2021.
  18. Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377, 2021.
  19. Mask r-cnn. In CVPR, pages 2961–2969, 2017.
  20. Openimages: A public dataset for large-scale multi-label and multi-class image classification. https://github.com/openimages, 2(3):18, 2017.
  21. Panoptic feature pyramid networks. In CVPR, pages 6399–6408, 2019.
  22. Focal loss for dense object detection. In ICCV, pages 2980–2988, 2017.
  23. I. Loshchilov and F. Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  24. Swin transformer v2: Scaling up capacity and resolution. arXiv preprint arXiv:2111.09883, 2021.
  25. Swin transformer: Hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030, 2021.
  26. Microsoft coco: Common objects in context. In ECCV, pages 740–755. Springer, 2014.
  27. Align before fuse: Vision and language representation learning with momentum distillation. NeurIPS, 34, 2021.
  28. Query2label: A simple transformer way to multi-label classification. arXiv preprint arXiv:2107.10834, 2021.
  29. Multitask learning over graphs: An approach for distributed, streaming machine learning. IEEE Signal Processing Magazine, 37(3):14–25, 2020.
  30. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
  31. S. J. Pan and Q. Yang. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10):1345–1359, 2009.
  32. Asymmetric loss for multi-label classification. In ICCV, pages 82–91, 2021.
  33. Grad-cam: Visual explanations from deep networks via gradient-based localization. In ICCV, pages 618–626, 2017.
  34. Revisiting unreasonable effectiveness of data in deep learning era. In ICCV, pages 843–852, 2017.
  35. Tencent ml-images: A large-scale multi-label image database for visual representation learning. IEEE Access, 7, 2019.
  36. Distribution-balanced loss for multi-label classification in long-tailed datasets. In ECCV, pages 162–178. Springer, 2020.
  37. Unified perceptual parsing for scene understanding. In ECCV, pages 418–434, 2018.
  38. Simmim: A simple framework for masked image modeling. arXiv preprint arXiv:2111.09886, 2021.
  39. Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432, 2021.
  40. Attention-driven dynamic graph convolutional network for multi-label image recognition. In ECCV, pages 649–665. Springer, 2020.
  41. Semantic understanding of scenes through the ade20k dataset. International Journal of Computer Vision, 127(3):302–321, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (1)
  1. Zhiming Qian (1 paper)

Summary

We haven't generated a summary for this paper yet.