Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
96 tokens/sec
Gemini 2.5 Pro Premium
48 tokens/sec
GPT-5 Medium
15 tokens/sec
GPT-5 High Premium
23 tokens/sec
GPT-4o
104 tokens/sec
DeepSeek R1 via Azure Premium
77 tokens/sec
GPT OSS 120B via Groq Premium
466 tokens/sec
Kimi K2 via Groq Premium
201 tokens/sec
2000 character limit reached

GLID: Pre-training a Generalist Encoder-Decoder Vision Model (2404.07603v1)

Published 11 Apr 2024 in cs.CV

Abstract: This paper proposes a GeneraLIst encoder-Decoder (GLID) pre-training method for better handling various downstream computer vision tasks. While self-supervised pre-training approaches, e.g., Masked Autoencoder, have shown success in transfer learning, task-specific sub-architectures are still required to be appended for different downstream tasks, which cannot enjoy the benefits of large-scale pre-training. GLID overcomes this challenge by allowing the pre-trained generalist encoder-decoder to be fine-tuned on various vision tasks with minimal task-specific architecture modifications. In the GLID training scheme, pre-training pretext task and other downstream tasks are modeled as "query-to-answer" problems, including the pre-training pretext task and other downstream tasks. We pre-train a task-agnostic encoder-decoder with query-mask pairs. During fine-tuning, GLID maintains the pre-trained encoder-decoder and queries, only replacing the topmost linear transformation layer with task-specific linear heads. This minimizes the pretrain-finetune architecture inconsistency and enables the pre-trained model to better adapt to downstream tasks. GLID achieves competitive performance on various vision tasks, including object detection, image segmentation, pose estimation, and depth estimation, outperforming or matching specialist models such as Mask2Former, DETR, ViTPose, and BinsFormer.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. Data2vec: A general framework for self-supervised learning in speech, vision and language. arXiv:2202.03555, 2022.
  2. Beit: Bert pre-training of image transformers. In ICLR, 2021.
  3. Adabins: Depth estimation using adaptive bins. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4009–4018, 2021.
  4. Denoising pretraining for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4175–4186, 2022.
  5. End-to-end object detection with transformers. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16, pages 213–229. Springer, 2020.
  6. Pix2seq: A language modeling framework for object detection. arXiv preprint arXiv:2109.10852, 2021.
  7. A unified sequence interface for vision tasks. arXiv preprint arXiv:2206.07669, 2022a.
  8. Context autoencoder for self-supervised representation learning. arXiv:2202.03026, 2022b.
  9. Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1290–1299, 2022.
  10. Up-detr: Unsupervised pre-training for object detection with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1601–1610, 2021.
  11. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
  12. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL, 2019.
  13. Peco: Perceptual codebook for bert pre-training of vision transformers. arXiv:2111.12710, 2021.
  14. Depth map prediction from a single image using a multi-scale deep network. Advances in neural information processing systems, 27, 2014.
  15. Eva: Exploring the limits of masked visual representation learning at scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19358–19369, 2023.
  16. Ross Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 1440–1448, 2015.
  17. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017.
  18. Masked autoencoders are scalable vision learners. In CVPR, 2021.
  19. Unit: Multimodal multitask learning with a unified transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1439–1449, 2021.
  20. Oneformer: One transformer to rule universal image segmentation. arXiv preprint arXiv:2211.06220, 2022.
  21. Detrs with hybrid matching. arXiv preprint arXiv:2207.13080, 2022.
  22. Panoptic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9404–9413, 2019.
  23. Uvim: A unified modeling approach for vision with learned guiding codes. arXiv preprint arXiv:2205.10337, 2022.
  24. Efficient self-supervised vision transformers for representation learning. In ICLR, 2022a.
  25. Binsformer: Revisiting adaptive bins for monocular depth estimation. arXiv preprint arXiv:2204.00987, 2022b.
  26. Microsoft coco: Common objects in context. In ECCV, 2014.
  27. Feature pyramid networks for object detection. In CVPR, 2017.
  28. Mixmim: Mixed and masked image modeling for efficient visual representation learning. arXiv preprint arXiv:2205.13137, 2022.
  29. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021.
  30. Decoupled weight decay regularization. In ICLR, 2017.
  31. Unified-io: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916, 2022.
  32. Unsupervised learning of dense visual representations. Advances in Neural Information Processing Systems, 33:4489–4500, 2020.
  33. Jason Tyler Rolfe. Discrete variational autoencoders. In ICLR, 2016.
  34. Indoor segmentation and support inference from rgbd images. ECCV (5), 7576:746–760, 2012.
  35. Attention is all you need. In NeurIPS, 2017.
  36. Dense contrastive learning for self-supervised visual pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3024–3033, 2021.
  37. Images speak in images: A generalist painter for in-context visual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6830–6839, 2023.
  38. Simmim: A simple framework for masked image modeling. In CVPR, 2021.
  39. Vitpose: Simple vision transformer baselines for human pose estimation. arXiv preprint arXiv:2204.12484, 2022.
  40. Scene parsing through ade20k dataset. In CVPR, 2017.
  41. ibot: Image bert pre-training with online tokenizer. In ICLR, 2021.
  42. Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159, 2020.
  43. Generalized decoding for pixel, image, and language. arXiv preprint arXiv:2212.11270, 2022.
Citations (2)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com