BERT Goes Off-Topic: Investigating the Domain Transfer Challenge using Genre Classification (2311.16083v1)
Abstract: While performance of many text classification tasks has been recently improved due to Pre-trained LLMs (PLMs), in this paper we show that they still suffer from a performance gap when the underlying distribution of topics changes. For example, a genre classifier trained on \textit{political} topics often fails when tested on documents about \textit{sport} or \textit{medicine}. In this work, we quantify this phenomenon empirically with a large corpus and a large set of topics. Consequently, we verify that domain transfer remains challenging both for classic PLMs, such as BERT, and for modern large models, such as GPT-3. We also suggest and successfully test a possible remedy: after augmenting the training dataset with topically-controlled synthetic texts, the F1 score improves by up to 50\% for some topics, nearing on-topic training results, while others show little to no improvement. While our empirical results focus on genre classification, our methodology is applicable to other classification tasks such as gender, authorship, or sentiment classification. The code and data to replicate the experiments are available at https://github.com/dminus1/genre
- Do not have enough data? deep learning to the rescue! In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 7383–7390.
- The WaCky wide web: a collection of very large linguistically processed web-crawled corpora. Language Resources and Evaluation, 43(3):209–226.
- Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993–1022.
- Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification. In Proc ACL, pages 440–447, Prague, Czech Republic. Association for Computational Linguistics.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Christopher Cieri and Mark Liberman. 2002. Language resources creation and distribution at the Linguistic Data Consortium. In Proc LREC, pages 1327–1333. Las Palmas, Spain.
- Problems in the use-centered development of a taxonomy of web genres. In Alexander Mehler, Serge Sharoff, and Marina Santini, editors, Genres on the Web: Computational Models and Empirical Studies. Springer.
- Frustratingly easy semi-supervised domain adaptation. In Proceedings of the 2010 Workshop on Domain Adaptation for Natural Language Processing, pages 53–59, Uppsala, Sweden. Association for Computational Linguistics.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- The form is the substance: classification of genres in text. In Proc. Human Language Technology and Knowledge Management, pages 1–8.
- Topic modeling in embedding spaces. Transactions of the Association for Computational Linguistics, 8:439–453.
- Guiding generative language models for data augmentation in few-shot text classification. arXiv preprint arXiv:2111.09064.
- A survey of data augmentation approaches for nlp. arXiv preprint arXiv:2105.03075.
- A study of various text augmentation techniques for relation classification in free text. ICPRAM, 3:5.
- Generative adversarial networks. Communications of the ACM, 63(11):139–144.
- Andrew Gordon and Reid Swanson. 2009. Identifying personal stories in millions of weblog entries. In Proceedings of International Conference on Weblogs and Social Media, San Jose, CA.
- It’s all in the name: Mitigating gender bias with name-based counterfactual data substitution. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5267–5275, Hong Kong, China. Association for Computational Linguistics.
- Pretrained transformers improve out-of-distribution robustness. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2744–2751, Online. Association for Computational Linguistics.
- Deep learning for text style transfer: A survey. Computational Linguistics, 48(1):155–205.
- Disentangled representation learning for non-parallel text style transfer. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 424–434, Florence, Italy. Association for Computational Linguistics.
- Learning the difference that makes a difference with counterfactually-augmented data. In International Conference on Learning Representations.
- SemEval-2019 task 4: Hyperpartisan news detection. In Proceedings of the 13th International Workshop on Semantic Evaluation, pages 829–839, Minneapolis, Minnesota, USA. Association for Computational Linguistics.
- Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- Imagenet classification with deep convolutional neural networks. Communications of the ACM, 60(6):84–90.
- Data augmentation using pre-trained transformer models. arXiv preprint arXiv:2003.02245.
- Chatgpt: Beginning of an end of manual annotation? use case of automatic genre identification. arXiv preprint arXiv:2303.03953.
- The GINCO training dataset for web genre identification of documents out in the wild. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 1584–1594, Marseille, France. European Language Resources Association.
- Meta learning for natural language processing: A survey. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 666–684, Seattle, United States. Association for Computational Linguistics.
- Neural data augmentation via example extrapolation. arXiv preprint arXiv:2102.01335.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
- Masker: Masked keyword regularization for reliable text classification. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 13578–13586.
- Philipp Petrenz and Bonnie Webber. 2010. Stable classification of text genres. Computational Linguistics, 34(4):285–293.
- Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67.
- Adversarial decomposition of text representation. In Proc NAACL, pages 815–825, Minneapolis, Minnesota. Association for Computational Linguistics.
- Multilingual and zero-shot is closing in on monolingual web register classification. In Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa), pages 157–165, Reykjavik, Iceland (Online). Linköping University Electronic Press, Sweden.
- Genre based navigation on the web. In Proceedings of the 34th annual Hawaii international conference on system sciences, pages 10–pp. IEEE.
- Riding the rough waves of genre on the web. In Alexander Mehler, Serge Sharoff, and Marina Santini, editors, Genres on the Web: Computational Models and Empirical Studies. Springer, Berlin/New York.
- Serge Sharoff. 2018. Functional text dimensions for the annotation of Web corpora. Corpora, 13(1):65–95.
- The Web library of Babel: evaluating genre collections. In Proc Seventh Language Resources and Evaluation Conference, LREC, Malta.
- Multiple-attribute text style transfer. arXiv preprint arXiv:1811.00552.
- Progressive generation of long text. arXiv preprint arXiv:2006.15720.
- Causal mediation analysis for interpreting neural NLP: The case of gender bias. arXiv preprint arXiv:2004.12265.
- Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903.
- Generative data augmentation for commonsense reasoning. arXiv preprint arXiv:2004.11546.