Assessing In-context Learning and Fine-tuning for Topic Classification of German Web Data
Abstract: Researchers in the political and social sciences often rely on classification models to analyze trends in information consumption by examining browsing histories of millions of webpages. Automated scalable methods are necessary due to the impracticality of manual labeling. In this paper, we model the detection of topic-related content as a binary classification task and compare the accuracy of fine-tuned pre-trained encoder models against in-context learning strategies. Using only a few hundred annotated data points per topic, we detect content related to three German policies in a database of scraped webpages. We compare multilingual and monolingual models, as well as zero and few-shot approaches, and investigate the impact of negative sampling strategies and the combination of URL & content-based features. Our results show that a small sample of annotated data is sufficient to train an effective classifier. Fine-tuning encoder-based models yields better results than in-context learning. Classifiers using both URL & content-based features perform best, while using URLs alone provides adequate results when content is unavailable.
- What’s in a url? genre classification from urls. In AAAI Workshop, pages 2–9.
- An effective detection approach for phishing websites using url and html features. Scientific Reports, 12:8842.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. CoRR preprint, abs/2204.05862.
- FHAC at germeval 2021: Identifying german toxic, engaging, and fact-claiming comments with ensemble learning. In Proceedings of the GermEval 2021 Shared Task on the Identification of Toxic, Engaging, and Fact-Claiming Comments, GermEval@KONVENS 2021, Düsseldorf, Germany, September 6, 2021, pages 105–111. Association for Computational Linguistics.
- German’s next language model. In Proceedings of the 28th International Conference on Computational Linguistics, COLING 2020, Barcelona, Spain (Online), December 8-13, 2020, pages 6788–6796. International Committee on Computational Linguistics.
- Vicuna: An open-source chatbot impressing GPT-4 with 90%* ChatGPT quality.
- Scaling instruction-finetuned language models. CoRR preprint, abs/2210.11416.
- ELECTRA: pre-training text encoders as discriminators rather than generators. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net.
- Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 8440–8451. Association for Computational Linguistics.
- BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 4171–4186. Association for Computational Linguistics.
- Separating the Wheat from the Chaff: Applications of Automated Document Classification Using Support Vector Machines. Political Analysis, 22:224–242.
- Filter Bubbles, Echo Chambers, and Online News Consumption. Public Opinion Quarterly, 80(S1):298–320.
- Ann-Sophie Gnehm and Simon Clematide. 2020. Text Zoning and Classification for Job Advertisements in German, French and English. In Proceedings of the Fourth Workshop on Natural Language Processing and Computational Social Science, pages 83–93, Online. Association for Computational Linguistics.
- Roland Graef. 2021. Leveraging text classification by co-training with bidirectional language models - A novel hybrid approach and its application for a german bank. In Innovation durch Informationssysteme - WI als zukunftsweisende Wissenschaft, 16. Internationale Tagung Wirtschaftsinformatik (WI 2021), March 09-11, 2021, Universität Duisburg-Essen, Germany. AISeL.
- Andrew M. Guess. 2021. (almost) everything in moderation: New evidence on americans’ online media diets. American Journal of Political Science, 65(4):1007–1022.
- Domain adaptation of transformer-based models using unlabeled data for relevance and polarity classification of german customer feedback. SN Comput. Sci., 4(2):142.
- Mistral 7B. arXiv preprint.
- Combining Lightly-Supervised Text Classification Models for Accurate Contextual Advertising. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 545–554, Taipei, Taiwan. Asian Federation of Natural Language Processing.
- Min-Yen Kan and Hoang Oanh Nguyen Thi. 2005. Fast webpage classification using URL features. In Proceedings of the 2005 ACM CIKM International Conference on Information and Knowledge Management, Bremen, Germany, October 31 - November 5, 2005, pages 325–326. ACM.
- What makes good in-context examples for GPT-3? In Proceedings of Deep Learning Inside Out: The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, DeeLIO@ACL 2022, Dublin, Ireland and Online, May 27, 2022, pages 100–114. Association for Computational Linguistics.
- Multilingual text categorization and sentiment analysis: a comparative analysis of the utilization of multilingual approaches for classifying twitter data. Neural Comput. Appl., 35(29):21415–21431.
- Efficient estimation of word representations in vector space. In 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, May 2-4, 2013, Workshop Track Proceedings.
- LEXTREME: A multi-lingual and multi-task benchmark for the legal domain. In Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023, pages 3016–3054. Association for Computational Linguistics.
- Cross-domain topic classification for political texts. Political Analysis, 31(1):59–80.
- Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022.
- Revisiting demonstration selection strategies in in-context learning. CoRR, abs/2401.12087.
- Glove: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pages 1532–1543. ACL.
- Team “DaDeFrNi” at CASE 2021 task 1: Document and sentence classification for protest event detection. In Proceedings of the 4th Workshop on Challenges and Applications of Automated Extraction of Socio-political Events from Text (CASE 2021), pages 171–178, Online. Association for Computational Linguistics.
- Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pages 3980–3990. Association for Computational Linguistics.
- Michael Scharkow. 2013. Thematic content analysis using supervised machine learning: An empirical evaluation using German online news. Quality & Quantity, 47:761–773.
- Miklós Sebők and Zoltán Kacsuk. 2021. The Multiclass Classification of Newspaper Articles with Machine Learning: The Hybrid Binary Snowball Approach. Political Analysis, 29:236–249.
- Post Post-Broadcast Democracy? News Exposure in the Age of Online Intermediaries. American Political Science Review, 116:768–774.
- Post post-broadcast democracy? News exposure in the age of online intermediaries. American Political Science Review, 116:768–774.
- Text Classification via Large Language Models. In Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023, pages 8990–9005. Association for Computational Linguistics.
- LLaMA: Open and Efficient Foundation Language Models. arXiv preprint.
- Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 5998–6008.
- Investigating Opinions on Public Policies in Digital Media: Setting up a Supervised Machine Learning Tool for Stance Classification. Communication Methods and Measures, 17:150–184.
- Avenues to news and diverse news exposure online: Comparing direct navigation, social media, news aggregators, search queries, and article hyperlinks. The International Journal of Press/Politics, 27(4):860–886.
- Boosting for Multi-Graph Classification. IEEE Transactions on Cybernetics, 45(3):416–429.
- Zhen Xu and James Miller. 2015. A New Webpage Classification Model Based on Visual Information Using Gestalt Laws of Grouping. In Web Information Systems Engineering - WISE 2015 - 16th International Conference, Miami, FL, USA, November 1-3, 2015, Proceedings, Part II, volume 9419 of Lecture Notes in Computer Science, pages 225–232. Springer.
- A robustly optimized BERT pre-training approach with post-training. In Chinese Computational Linguistics - 20th China National Conference, CCL 2021, Hohhot, China, August 13-15, 2021, Proceedings, volume 12869 of Lecture Notes in Computer Science, pages 471–484. Springer.
- Aya Model: An instruction finetuned open-access multilingual language model. arXiv preprint.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.