Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
166 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Plug-and-Play Transformer Modules for Test-Time Adaptation (2401.04130v3)

Published 6 Jan 2024 in cs.LG and cs.AI

Abstract: Parameter-efficient tuning (PET) methods such as LoRA, Adapter, and Visual Prompt Tuning (VPT) have found success in enabling adaptation to new domains by tuning small modules within a transformer model. However, the number of domains encountered during test time can be very large, and the data is usually unlabeled. Thus, adaptation to new domains is challenging; it is also impractical to generate customized tuned modules for each such domain. Toward addressing these challenges, this work introduces PLUTO: a Plug-and-pLay modUlar Test-time domain adaptatiOn strategy. We pre-train a large set of modules, each specialized for different source domains, effectively creating a ``module store''. Given a target domain with few-shot unlabeled data, we introduce an unsupervised test-time adaptation (TTA) method to (1) select a sparse subset of relevant modules from this store and (2) create a weighted combination of selected modules without tuning their weights. This plug-and-play nature enables us to harness multiple most-relevant source domains in a single inference call. Comprehensive evaluations demonstrate that PLUTO uniformly outperforms alternative TTA methods and that selecting $\leq$5 modules suffice to extract most of the benefit. At a high level, our method equips pre-trained transformers with the capability to dynamically adapt to new domains, motivating a new paradigm for efficient and scalable domain adaptation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (56)
  1. Unsupervised multi-source domain adaptation without access to source data. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10103–10112, 2021.
  2. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
  3. Leo Breiman. Bagging predictors. Machine learning, 24:123–140, 1996.
  4. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  5. On the effectiveness of layernorm tuning for continual learning in vision transformers. arXiv preprint arXiv:2308.09610, 2023.
  6. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  7. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  8. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  9. Adversarial removal of demographic attributes from text data. arXiv preprint arXiv:1808.06640, 2018.
  10. Sharpness-aware minimization for efficiently improving generalization. arXiv preprint arXiv:2010.01412, 2020.
  11. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1):119–139, 1997.
  12. Domain-adversarial training of neural networks. The journal of machine learning research, 17(1):2096–2030, 2016.
  13. Visual prompt tuning for test-time domain adaptation. arXiv preprint arXiv:2210.04831, 2022.
  14. Ppt: Pre-trained prompt tuning for few-shot learning. arXiv preprint arXiv:2109.04332, 2021.
  15. Multi-source domain adaptation with mixture of experts. arXiv preprint arXiv:1809.02256, 2018.
  16. Neural network ensembles. IEEE transactions on pattern analysis and machine intelligence, 12(10):993–1001, 1990.
  17. Benchmarking neural network robustness to common corruptions and perturbations. arXiv preprint arXiv:1903.12261, 2019.
  18. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR, 2019.
  19. Progressive domain adaptation for object detection. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 749–757, 2020.
  20. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021a.
  21. Fully test-time adaptation for image segmentation. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part III 24, pages 251–260. Springer, 2021b.
  22. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pages 448–456. pmlr, 2015.
  23. Visual prompt tuning. In European Conference on Computer Vision, pages 709–727. Springer, 2022.
  24. How to adapt your large-scale vision-and-language model. 2021.
  25. Domain attention with an ensemble of experts. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 643–653, 2017.
  26. Learning multiple layers of features from tiny images. 2009.
  27. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021.
  28. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190, 2021.
  29. Revisiting batch normalization for practical domain adaptation. arXiv preprint arXiv:1603.04779, 2016.
  30. Gpt understands, too. AI Open, 2023.
  31. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  32. A convnet for the 2020s. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11976–11986, 2022.
  33. The norm must go on: Dynamic unsupervised domain adaptation by normalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14765–14775, 2022.
  34. Efficient test-time model adaptation without forgetting. In International conference on machine learning, pages 16888–16905. PMLR, 2022.
  35. Towards stable test-time adaptation in dynamic wild world. arXiv preprint arXiv:2302.12400, 2023.
  36. Moment matching for multi-source domain adaptation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1406–1415, 2019.
  37. Model ensemble instead of prompt fusion: a sample-specific knowledge transfer method for few-shot prompt tuning. arXiv preprint arXiv:2210.12587, 2022.
  38. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  39. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  40. Cross-domain imitation from observations. In International Conference on Machine Learning, pages 8902–8912. PMLR, 2021.
  41. Learning to select data for transfer learning with bayesian optimization. arXiv preprint arXiv:1707.05246, 2017.
  42. Robert E Schapire. The strength of weak learnability. Machine learning, 5:197–227, 1990.
  43. Improving robustness against common corruptions by covariate shift adaptation. Advances in neural information processing systems, 33:11539–11551, 2020.
  44. Mm-tta: multi-modal test-time adaptation for 3d semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16928–16937, 2022.
  45. Avoiding the hypothesis-only bias in natural language inference via ensemble adversarial training. arXiv preprint arXiv:2004.07790, 2020.
  46. A survey of multi-source domain adaptation. Information Fusion, 24:84–92, 2015.
  47. Learning to adapt structured output space for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7472–7481, 2018.
  48. Adversarial discriminative domain adaptation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7167–7176, 2017.
  49. On-the-fly test-time adaptation for medical image segmentation. arXiv preprint arXiv:2203.05574, 2022.
  50. Deep hashing network for unsupervised domain adaptation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5018–5027, 2017.
  51. Spot: Better frozen model adaptation through soft prompt transfer. arXiv preprint arXiv:2110.07904, 2021.
  52. Tent: Fully test-time adaptation by entropy minimization. arXiv preprint arXiv:2006.10726, 2020.
  53. Continual test-time domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7201–7211, 2022.
  54. David H Wolpert. Stacked generalization. Neural networks, 5(2):241–259, 1992.
  55. Multi-source test-time adaptation as dueling bandits for extractive question answering. arXiv preprint arXiv:2306.06779, 2023.
  56. Memo: Test time robustness via adaptation and augmentation. Advances in Neural Information Processing Systems, 35:38629–38642, 2022.

Summary

We haven't generated a summary for this paper yet.