Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

How Useful is Continued Pre-Training for Generative Unsupervised Domain Adaptation? (2401.17514v2)

Published 31 Jan 2024 in cs.CL

Abstract: Recent breakthroughs in scale have enabled the emergence of powerful generative LLMs, and the ability to fine-tune these models on various tasks by casting them into prompts or instructions. In this landscape, the problem of Unsupervised Domain Adaptation (UDA), or the problem of leveraging knowledge from a labeled source domain to an unlabeled target domain, has been left behind, with recent UDA methods still addressing discriminative classification. In particular, two popular UDA approaches, involving Continued Pre-Training (CPT) and learning domain invariant representations, have been under-explored in the generative setting, signaling a gap. In this work, we evaluate the utility of CPT for generative UDA. We first perform an empirical evaluation to measure the trade-offs between CPT and strong methods promoting domain invariance. We further evaluate how well the benefits of CPT extend to different architectures, tuning methods and data regimes. We then motivate the use of CPT by studying to what degree it benefits classification performance on the target domain. Finally, we attempt to understand the mechanism behind which CPT improves classification performance on the unlabeled target domain. Our findings suggest that a implicitly learns the downstream task while predicting masked words informative to that task. Our work connects the body of UDA research with that of instruction tuning, enabling an initial step towards a wider applicability of modern LLMs.

Introduction

Addressing the challenge of domain adaptation in LLMs (LMs), a new paradigm within unsupervised domain adaptation (UDA) has emerged, called prompt-based UDA. This approach utilizes prompt templates to convert discriminative predictions into generative tasks, allowing for the adaptation to the target domain without relying on domain-invariant representations or extended pre-training.

Methodology

The paper introduces the FEUDA (Frustratingly Easy UDA) method, which comprises two instruction-tuning tasks. The initial task involves masked LLMing (MLM) using unlabeled data from both the source and target domains. The subsequent task leverages supervised instruction-tuning with labeled source data for classification. The integration of these tasks effectively bridges the gap between pre-training and adaptation, enhancing the LM's performance on the target domain.

Results

Extensive experiments on 24 real-world domain pairs demonstrate FEUDA's superiority over traditional domain-invariant methods. A noteworthy finding is that MLM within FEUDA augments the model's semantic and background knowledge of a domain, contributing positively to downstream classification tasks. The research reveals significant improvements in target-domain classification performance, even in few-shot learning scenarios and across various models and adaptation techniques.

Analysis and Extensions

The authors delve into the effects of MLM on UDA by analyzing the importance of masked words selection and varying masking rates. They find that the presence of both informative and uninformative words, identified through PMI, is crucial for achieving high classification accuracy. Additionally, the paper explores the impact of different masking rates, highlighting that optimal performance is attained at a 15% masking rate, while higher rates negatively affect the target domain's classification.

Conclusion

The paper concludes that domain invariance is not a necessity in prompt-based UDA—an insight that sets the stage for future explorations. FEUDA stands as a robust and competitive method, providing a simple yet effective solution for UDA challenges in LMs. As researchers and practitioners aim for better adaptability in real-world applications, FEUDA offers a promising direction.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (73)
  1. Nachman Aronszajn. 1950. Theory of reproducing kernels. Transactions of the American mathematical society, 68(3):337–404.
  2. Promptsource: An integrated development environment and repository for natural language prompts. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 93–104.
  3. Perl: Pivot-based domain adaptation for pre-trained deep contextualized embedding models. Transactions of the Association for Computational Linguistics, 8:504–521.
  4. A theory of learning from different domains. Machine learning, 79:151–175.
  5. Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification. In Proceedings of the 45th annual meeting of the association of computational linguistics, pages 440–447.
  6. Domain adaptation with structural correspondence learning. In Proceedings of the 2006 conference on empirical methods in natural language processing, pages 120–128.
  7. Domain separation networks. Advances in neural information processing systems, 29.
  8. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  9. Robust weight signatures: Gaining robustness as easy as patching weights? arXiv preprint arXiv:2302.12480.
  10. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
  11. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of naacL-HLT, volume 1, page 2.
  12. Robert M Fano. 1961. Transmission of information: A statistical theory of communications. American Journal of Physics, 29(11):793–794.
  13. On the effectiveness of parameter-efficient fine-tuning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 12799–12807.
  14. Yaroslav Ganin and Victor Lempitsky. 2015. Unsupervised domain adaptation by backpropagation. In International conference on machine learning, pages 1180–1189. PMLR.
  15. Domain-adversarial training of neural networks. The journal of machine learning research, 17(1):2096–2030.
  16. Making pre-trained language models better few-shot learners. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3816–3830.
  17. Unsupervised domain adaptation via deep conditional adaptation network. Pattern Recognition, 134:109088.
  18. A kernel two-sample test. The Journal of Machine Learning Research, 13(1):723–773.
  19. Improving the sample efficiency of prompt tuning with domain adaptation. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 3523–3537, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  20. Don’t stop pretraining: Adapt language models to domains and tasks. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8342–8360.
  21. Annotation artifacts in natural language inference data. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 107–112.
  22. Xiaochuang Han and Jacob Eisenstein. 2019. Unsupervised domain adaptation of contextualized embeddings for sequence labeling. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4238–4248.
  23. Towards a unified view of parameter-efficient transfer learning. In International Conference on Learning Representations.
  24. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR.
  25. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations.
  26. Tada: Efficient task-agnostic domain adaptation for transformers. arXiv preprint arXiv:2305.12717.
  27. Editing models with task arithmetic. In The Eleventh International Conference on Learning Representations.
  28. Udalm: Unsupervised domain adaptation through language modeling. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2579–2590.
  29. Domain divergences: A survey and empirical analysis. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1830–1849.
  30. Revisiting pretraining with adapters. In Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP-2021), pages 90–99.
  31. Fine-tuning can distort pretrained features and underperform out-of-distribution. In International Conference on Learning Representations.
  32. Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4):1234–1240.
  33. Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4582–4597.
  34. What’s in a domain? learning domain-robust text representations using adversarial training. arXiv preprint arXiv:1805.06088.
  35. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. Advances in Neural Information Processing Systems, 35:1950–1965.
  36. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9):1–35.
  37. Learning transferable features with deep adaptation networks. In International conference on machine learning, pages 97–105. PMLR.
  38. Domain confused contrastive learning for unsupervised domain adaptation. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2982–2995.
  39. Udapter-efficient domain adaptation using adapters. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 2241–2255.
  40. Michael S Matena and Colin A Raffel. 2022. Merging models with fisher-weighted averaging. Advances in Neural Information Processing Systems, 35:17703–17716.
  41. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426.
  42. Cross-domain sentiment classification via spectral feature alignment. In Proceedings of the 19th international conference on World wide web, pages 751–760.
  43. Sinno Jialin Pan and Qiang Yang. 2010. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10):1345–1359.
  44. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
  45. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
  46. Alan Ramponi and Barbara Plank. 2020. Neural unsupervised domain adaptation in nlp—a survey. In Proceedings of the 28th International Conference on Computational Linguistics, pages 6838–6855.
  47. Beyond accuracy: Behavioral testing of nlp models with checklist. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4902–4912.
  48. Sebastian Ruder and Barbara Plank. 2017. Learning to select data for transfer learning with bayesian optimization. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 372–382.
  49. Multitask prompted training enables zero-shot task generalization. In International Conference on Learning Representations.
  50. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100.
  51. Timo Schick and Hinrich Schütze. 2021a. Exploiting cloze-questions for few-shot text classification and natural language inference. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 255–269.
  52. Timo Schick and Hinrich Schütze. 2021b. It’s not just size that matters: Small language models are also few-shot learners. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2339–2352.
  53. Connect, not collapse: Explaining contrastive learning for unsupervised domain adaptation. In International Conference on Machine Learning, pages 19847–19878. PMLR.
  54. Correlation alignment for unsupervised domain adaptation. Domain adaptation in computer vision applications, pages 153–171.
  55. Unsupervised domain adaptation through self-supervision. arXiv preprint arXiv:1909.11825.
  56. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  57. Unsupervised domain adaptation for text classification via meta self-paced learning. In Proceedings of the 29th International Conference on Computational Linguistics, pages 4741–4752.
  58. Adversarial discriminative domain adaptation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7167–7176.
  59. Deep domain confusion: Maximizing for domain invariance. conference on computer vision and pattern recognition.
  60. Trust issues: Uncertainty estimation does not enable reliable ood detection on medical tabular data. In Machine Learning for Health, pages 341–354. PMLR.
  61. Is fine-tuning needed? pre-trained language models are near perfect for out-of-domain detection. In Annual Meeting of the Association for Computational Linguistics.
  62. Finetuned language models are zero-shot learners. In International Conference on Learning Representations.
  63. Should you mask 15% in masked language modeling? In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 2985–3000, Dubrovnik, Croatia. Association for Computational Linguistics.
  64. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122.
  65. Garrett Wilson and Diane J Cook. 2020. A survey of unsupervised deep domain adaptation. ACM Transactions on Intelligent Systems and Technology (TIST), 11(5):1–46.
  66. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In International Conference on Machine Learning, pages 23965–23998. PMLR.
  67. Hui Wu and Xiaodong Shi. 2022. Adversarial soft prompt tuning for cross-domain sentiment analysis. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2438–2447.
  68. Out of distribution detection for medical images. In Uncertainty for Safe Utilization of Machine Learning in Medical Imaging, and Perinatal Imaging, Placental and Preterm Image Analysis, pages 102–111. Springer.
  69. Unsupervised domain adaptation with adapter. Proceedings of the Workshop on Efficient Natural Language and Speech Processing.
  70. Instruction tuning for large language models: A survey. arXiv preprint arXiv:2308.10792.
  71. On learning invariant representations for domain adaptation. In International conference on machine learning, pages 7523–7532. PMLR.
  72. Zhi-Hua Zhou and Ming Li. 2005. Tri-training: Exploiting unlabeled data using three classifiers. IEEE Transactions on knowledge and Data Engineering, 17(11):1529–1541.
  73. Yftah Ziser and Roi Reichart. 2018. Pivot based language modeling for improved neural domain adaptation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1241–1251.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Rheeya Uppaal (8 papers)
  2. Yixuan Li (183 papers)
  3. Junjie Hu (111 papers)
Citations (2)