Merging Vision Transformers from Different Tasks and Domains (2312.16240v1)
Abstract: This work targets to merge various Vision Transformers (ViTs) trained on different tasks (i.e., datasets with different object categories) or domains (i.e., datasets with the same categories but different environments) into one unified model, yielding still good performance on each task or domain. Previous model merging works focus on either CNNs or NLP models, leaving the ViTs merging research untouched. To fill this gap, we first explore and find that existing model merging methods cannot well handle the merging of the whole ViT models and still have improvement space. To enable the merging of the whole ViT, we propose a simple-but-effective gating network that can both merge all kinds of layers (e.g., Embedding, Norm, Attention, and MLP) and select the suitable classifier. Specifically, the gating network is trained by unlabeled datasets from all the tasks (domains), and predicts the probability of which task (domain) the input belongs to for merging the models during inference. To further boost the performance of the merged model, especially when the difficulty of merging tasks increases, we design a novel metric of model weight similarity, and utilize it to realize controllable and combined weight merging. Comprehensive experiments on kinds of newly established benchmarks, validate the superiority of the proposed ViT merging framework for different tasks and domains. Our method can even merge beyond 10 ViT models from different vision tasks with a negligible effect on the performance of each task.
- Dcnn-based vegetable image classification using transfer learning: A comparative study. In 2021 5th International Conference on Computer, Communication and Signal Processing (ICCCSP), pages 235–243. IEEE, 2021.
- Mangoleafbd: A comprehensive image dataset to classify diseased and healthy mango leaves. Data in Brief, 47:108941, 2023.
- Git re-basin: Merging models modulo permutation symmetries. arXiv preprint arXiv:2209.04836, 2022.
- Puneet Bansal. Intel image classification. Available on https://www. kaggle. com/puneet6060/intel-image-classification, Online, 2019.
- Food-101 – mining discriminative components with random forests. In European Conference on Computer Vision, 2014.
- Exponential moving average normalization for self-supervised and semi-supervised learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 194–203, 2021.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Unbiased metric learning: On the utilization of multiple datasets and web images for softening bias. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2013.
- Using mode connectivity for loss landscape analysis. arXiv preprint arXiv:1806.06977, 2018.
- In search of lost domain generalization. arXiv preprint arXiv:2007.01434, 2020.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12(7):2217–2226, 2019.
- Editing models with task arithmetic. arXiv preprint arXiv:2212.04089, 2022.
- Dart: Diversify-aggregate-repeat training improves generalization of neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16048–16059, 2023.
- Dataless knowledge fusion by merging weights of language models. arXiv preprint arXiv:2212.09849, 2022.
- 3d object representations for fine-grained categorization. In Proceedings of the IEEE international conference on computer vision workshops, pages 554–561, 2013.
- Learning multiple layers of features from tiny images. 2009.
- Makerere AI Lab. Bean disease dataset, 2020.
- Yann LeCun. The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/, 1998.
- Deeper, broader and artier domain generalization. In Proceedings of the IEEE international conference on computer vision, pages 5542–5550, 2017.
- Merging models with fisher-weighted averaging. Advances in Neural Information Processing Systems, 35:17703–17716, 2022.
- Denkd: Decoupled non-target knowledge distillation for complementing transformer-based unsupervised domain adaptation. IEEE Transactions on Circuits and Systems for Video Technology, 2023a.
- Automatic loss function search for adversarial unsupervised domain adaptation. IEEE Transactions on Circuits and Systems for Video Technology, 2023b.
- Cats and dogs. In 2012 IEEE conference on computer vision and pattern recognition, pages 3498–3505. IEEE, 2012.
- Kvasir: A multi-class image dataset for computer aided gastrointestinal disease detection. In Proceedings of the 8th ACM on Multimedia Systems Conference, pages 164–169, 2017.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Diverse weight averaging for out-of-distribution generalization. Advances in Neural Information Processing Systems, 35:10821–10836, 2022.
- Model ratatouille: Recycling diverse models for out-of-distribution generalization. 2023.
- Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
- Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pages 618–626, 2017.
- Zipit! merging models from different tasks without training. arXiv preprint arXiv:2305.03053, 2023.
- Safe self-refinement for transformer-based domain adaptation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7191–7200, 2022.
- An empirical study of multimodal model merging. arXiv preprint arXiv:2304.14933, 2023.
- Multi-task learning for dense prediction tasks: A survey. IEEE transactions on pattern analysis and machine intelligence, 44(7):3614–3633, 2021.
- Deep hashing network for unsupervised domain adaptation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5018–5027, 2017.
- Resnet strikes back: An improved training procedure in timm. arXiv preprint arXiv:2110.00476, 2021.
- Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, 2019.
- Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In International Conference on Machine Learning, pages 23965–23998. PMLR, 2022.
- Classification of weather phenomenon from images by using deep convolutional neural network. Earth and Space Science, 8(5):e2020EA001604, 2021.
- Transferable vision transformer for unsupervised domain adaptation, 2023. US Patent App. 17/485,251.
- Bi3d: Bi-domain active learning for cross-domain 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15599–15608, 2023.
- Resimad: Zero-shot 3d domain transfer for autonomous driving with source reconstruction and target simulation. arXiv preprint arXiv:2309.05527, 2023a.
- Uni3d: A unified baseline for multi-dataset 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9253–9262, 2023b.
- A survey on multi-task learning. IEEE Transactions on Knowledge and Data Engineering, 34(12):5586–5609, 2021.
- Domain generalization: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.