An Efficient General-Purpose Modular Vision Model via Multi-Task Heterogeneous Training (2306.17165v1)
Abstract: We present a model that can perform multiple vision tasks and can be adapted to other downstream tasks efficiently. Despite considerable progress in multi-task learning, most efforts focus on learning from multi-label data: a single image set with multiple task labels. Such multi-label data sets are rare, small, and expensive. We say heterogeneous to refer to image sets with different task labels, or to combinations of single-task datasets. Few have explored training on such heterogeneous datasets. General-purpose vision models are still dominated by single-task pretraining, and it remains unclear how to scale up multi-task models by leveraging mainstream vision datasets designed for different purposes. The challenges lie in managing large intrinsic differences among vision tasks, including data distribution, architectures, task-specific modules, dataset scales, and sampling strategies. To address these challenges, we propose to modify and scale up mixture-of-experts (MoE) vision transformers, so that they can simultaneously learn classification, detection, and segmentation on diverse mainstream vision datasets including ImageNet, COCO, and ADE20K. Our approach achieves comparable results to single-task state-of-the-art models and demonstrates strong generalization on downstream tasks. Due to its emergent modularity, this general-purpose model decomposes into high-performing components, efficiently adapting to downstream tasks. We can fine-tune it with fewer training parameters, fewer model parameters, and less computation. Additionally, its modularity allows for easy expansion in continual-learning-without-forgetting scenarios. Finally, these functions can be controlled and combined to meet various demands of downstream tasks.
- Language models are few-shot learners. Advances in neural information processing systems (Neurips), 33:1877–1901, 2020.
- Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision (ICCV), pages 9650–9660, 2021.
- Mod-squad: Designing mixtures of experts as modular multi-task learners. Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), 2023.
- The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pages 3213–3223, 2016.
- Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009.
- Davit: Dual attention vision transformers. In ECCV, pages 74–92, 2022.
- The pascal visual object classes (voc) challenge. International journal of computer vision, 88(2):303–338, 2010.
- Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23(120):1–39, 2022.
- Clip-adapter: Better vision-language models with feature adapters. arXiv preprint arXiv:2110.04544, 2021.
- Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning (ICML), pages 2790–2799. PMLR, 2019.
- Lora: Low-rank adaptation of large language models, 2021.
- Compacting, picking and growing for unforgetting continual learning. Advances in Neural Information Processing Systems (Nerurips), 32, 2019.
- Adaptive mixtures of local experts. Neural computation, 3(1):79–87, 1991.
- Perceiver: General perception with iterative attention. In International conference on machine learning (ICML), pages 4651–4664. PMLR, 2021.
- Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pages 7482–7491, 2018.
- Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526, 2017.
- I. Kokkinos. Ubernet: Training a universal convolutional neural network for low-, mid-, and high-level vision using diverse datasets and limited memory. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pages 6129–6138, 2017.
- 3d object representations for fine-grained categorization. In Proceedings of the IEEE international conference on computer vision workshops, pages 554–561, 2013.
- {GS}hard: Scaling giant models with conditional computation and automatic sharding. In International Conference on Learning Representations, 2021.
- Exploring plain vision transformer backbones for object detection. ArXiv, abs/2203.16527, 2022.
- Focal loss for dense object detection. In ICCV, pages 2980–2988, 2017.
- Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
- Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021.
- D. Lopez-Paz and M. Ranzato. Gradient episodic memory for continual learning. Advances in neural information processing systems, 30, 2017.
- UNIFIED-IO: A unified model for vision, language, and multi-modal tasks. In The Eleventh International Conference on Learning Representations (ICLR), 2023.
- Attentive single-tasking of multiple tasks. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pages 1851–1860, 2019.
- Multimodal contrastive learning with limoe: the language-image mixture of experts. arXiv preprint arXiv:2206.02770, 2022.
- Variational continual learning. In International Conference on Learning Representations (ICLR), 2018.
- Cats and dogs. In 2012 IEEE conference on computer vision and pattern recognition (CVPR), pages 3498–3505. IEEE, 2012.
- Scaling vision with sparse mixture of experts. Advances in Neural Information Processing Systems (NeurIPS), 34:8583–8595, 2021.
- Online structured laplace approximations for overcoming catastrophic forgetting. Advances in Neural Information Processing Systems, 31, 2018.
- Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In International Conference on Learning Representations, 2017.
- Indoor segmentation and support inference from rgbd images. ECCV (5), 7576:746–760, 2012.
- The inaturalist species classification and detection dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pages 8769–8778, 2018.
- The caltech-ucsd birds-200-2011 dataset. 2011.
- Unified perceptual parsing for scene understanding. In Proceedings of the European conference on computer vision (ECCV), pages 418–434, 2018.
- Mtformer: Multi-task learning via transformer and cross-task reasoning. In Proceedings of the European Conference on Computer Vision (ECCV), 2022.
- M. B. Yi-Lin Sung, Jaemin Cho. Vl-adapter: Parameter-efficient transfer learning for vision-and-language tasks. In CVPR, 2022.
- Lifelong learning with dynamically expandable networks. 2018.
- Large batch optimization for deep learning: Training bert in 76 minutes. arXiv preprint arXiv:1904.00962, 2019.
- Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432, 2021.
- Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL), 2022.
- Taskonomy: Disentangling task transfer learning. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pages 3712–3722, 2018.
- Continual learning through synaptic intelligence. In International conference on machine learning, pages 3987–3995. PMLR, 2017.
- Tip-adapter: Training-free clip-adapter for better vision-language modeling. arXiv preprint arXiv:2111.03930, 2021.
- Places: A 10 million image database for scene recognition. IEEE transactions on pattern analysis and machine intelligence, 40(6):1452–1464, 2017.
- Scene parsing through ade20k dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
- Zitian Chen (10 papers)
- Mingyu Ding (82 papers)
- Yikang Shen (62 papers)
- Wei Zhan (130 papers)
- Masayoshi Tomizuka (261 papers)
- Erik Learned-Miller (47 papers)
- Chuang Gan (195 papers)