Plug-and-Play Transformer Modules for Test-Time Adaptation (2401.04130v3)
Abstract: Parameter-efficient tuning (PET) methods such as LoRA, Adapter, and Visual Prompt Tuning (VPT) have found success in enabling adaptation to new domains by tuning small modules within a transformer model. However, the number of domains encountered during test time can be very large, and the data is usually unlabeled. Thus, adaptation to new domains is challenging; it is also impractical to generate customized tuned modules for each such domain. Toward addressing these challenges, this work introduces PLUTO: a Plug-and-pLay modUlar Test-time domain adaptatiOn strategy. We pre-train a large set of modules, each specialized for different source domains, effectively creating a ``module store''. Given a target domain with few-shot unlabeled data, we introduce an unsupervised test-time adaptation (TTA) method to (1) select a sparse subset of relevant modules from this store and (2) create a weighted combination of selected modules without tuning their weights. This plug-and-play nature enables us to harness multiple most-relevant source domains in a single inference call. Comprehensive evaluations demonstrate that PLUTO uniformly outperforms alternative TTA methods and that selecting $\leq$5 modules suffice to extract most of the benefit. At a high level, our method equips pre-trained transformers with the capability to dynamically adapt to new domains, motivating a new paradigm for efficient and scalable domain adaptation.
- Unsupervised multi-source domain adaptation without access to source data. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10103–10112, 2021.
- Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- Leo Breiman. Bagging predictors. Machine learning, 24:123–140, 1996.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- On the effectiveness of layernorm tuning for continual learning in vision transformers. arXiv preprint arXiv:2308.09610, 2023.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Adversarial removal of demographic attributes from text data. arXiv preprint arXiv:1808.06640, 2018.
- Sharpness-aware minimization for efficiently improving generalization. arXiv preprint arXiv:2010.01412, 2020.
- A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1):119–139, 1997.
- Domain-adversarial training of neural networks. The journal of machine learning research, 17(1):2096–2030, 2016.
- Visual prompt tuning for test-time domain adaptation. arXiv preprint arXiv:2210.04831, 2022.
- Ppt: Pre-trained prompt tuning for few-shot learning. arXiv preprint arXiv:2109.04332, 2021.
- Multi-source domain adaptation with mixture of experts. arXiv preprint arXiv:1809.02256, 2018.
- Neural network ensembles. IEEE transactions on pattern analysis and machine intelligence, 12(10):993–1001, 1990.
- Benchmarking neural network robustness to common corruptions and perturbations. arXiv preprint arXiv:1903.12261, 2019.
- Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR, 2019.
- Progressive domain adaptation for object detection. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 749–757, 2020.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021a.
- Fully test-time adaptation for image segmentation. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part III 24, pages 251–260. Springer, 2021b.
- Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pages 448–456. pmlr, 2015.
- Visual prompt tuning. In European Conference on Computer Vision, pages 709–727. Springer, 2022.
- How to adapt your large-scale vision-and-language model. 2021.
- Domain attention with an ensemble of experts. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 643–653, 2017.
- Learning multiple layers of features from tiny images. 2009.
- The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021.
- Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190, 2021.
- Revisiting batch normalization for practical domain adaptation. arXiv preprint arXiv:1603.04779, 2016.
- Gpt understands, too. AI Open, 2023.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
- A convnet for the 2020s. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11976–11986, 2022.
- The norm must go on: Dynamic unsupervised domain adaptation by normalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14765–14775, 2022.
- Efficient test-time model adaptation without forgetting. In International conference on machine learning, pages 16888–16905. PMLR, 2022.
- Towards stable test-time adaptation in dynamic wild world. arXiv preprint arXiv:2302.12400, 2023.
- Moment matching for multi-source domain adaptation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1406–1415, 2019.
- Model ensemble instead of prompt fusion: a sample-specific knowledge transfer method for few-shot prompt tuning. arXiv preprint arXiv:2210.12587, 2022.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
- Cross-domain imitation from observations. In International Conference on Machine Learning, pages 8902–8912. PMLR, 2021.
- Learning to select data for transfer learning with bayesian optimization. arXiv preprint arXiv:1707.05246, 2017.
- Robert E Schapire. The strength of weak learnability. Machine learning, 5:197–227, 1990.
- Improving robustness against common corruptions by covariate shift adaptation. Advances in neural information processing systems, 33:11539–11551, 2020.
- Mm-tta: multi-modal test-time adaptation for 3d semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16928–16937, 2022.
- Avoiding the hypothesis-only bias in natural language inference via ensemble adversarial training. arXiv preprint arXiv:2004.07790, 2020.
- A survey of multi-source domain adaptation. Information Fusion, 24:84–92, 2015.
- Learning to adapt structured output space for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7472–7481, 2018.
- Adversarial discriminative domain adaptation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7167–7176, 2017.
- On-the-fly test-time adaptation for medical image segmentation. arXiv preprint arXiv:2203.05574, 2022.
- Deep hashing network for unsupervised domain adaptation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5018–5027, 2017.
- Spot: Better frozen model adaptation through soft prompt transfer. arXiv preprint arXiv:2110.07904, 2021.
- Tent: Fully test-time adaptation by entropy minimization. arXiv preprint arXiv:2006.10726, 2020.
- Continual test-time domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7201–7211, 2022.
- David H Wolpert. Stacked generalization. Neural networks, 5(2):241–259, 1992.
- Multi-source test-time adaptation as dueling bandits for extractive question answering. arXiv preprint arXiv:2306.06779, 2023.
- Memo: Test time robustness via adaptation and augmentation. Advances in Neural Information Processing Systems, 35:38629–38642, 2022.