Small-Text: Active Learning for Text Classification in Python (2107.10314v7)
Abstract: We introduce small-text, an easy-to-use active learning library, which offers pool-based active learning for single- and multi-label text classification in Python. It features numerous pre-implemented state-of-the-art query strategies, including some that leverage the GPU. Standardized interfaces allow the combination of a variety of classifiers, query strategies, and stopping criteria, facilitating a quick mix and match, and enabling a rapid and convenient development of both active learning experiments and applications. With the objective of making various classifiers and query strategies accessible for active learning, small-text integrates several well-known machine learning libraries, namely scikit-learn, PyTorch, and Hugging Face transformers. The latter integrations are optionally installable extensions, so GPUs can be used but are not required. Using this new library, we investigate the performance of the recently published SetFit training paradigm, which we compare to vanilla transformer fine-tuning, finding that it matches the latter in classification accuracy while outperforming it in area under the curve. The library is available under the MIT License at https://github.com/webis-de/small-text, in version 1.3.0 at the time of writing.
- Michael Altschuler and Michael Bloodgood. 2019. Stopping active learning based on predicted change of F measure for text classification. In 2019 IEEE 13th International Conference on Semantic Computing (ICSC), pages 47–54. IEEE.
- Deep batch active learning by diverse, uncertain gradient lower bounds. In Proceedings of the 8th International Conference on Learning Representations (ICLR). OpenReview.net.
- Bayesian active learning for production, a systematic study and a reusable library. arXiv preprint arXiv:2006.09916.
- Scalable k-Means Clustering via Lightweight Coresets. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, (KDD), pages 1119–1127.
- A method for stopping active learning based on stabilizing predictions and the need for user-adjustable stopping. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL-2009), pages 39–47, Boulder, Colorado. Association for Computational Linguistics.
- Debiased contrastive learning. In Advances in Neural Information Processing Systems 33 (NeurIPS), volume 33, pages 8765–8775. Curran Associates, Inc.
- Similarity search for efficient active learning and search of rare concepts. Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 36(6):6402–6410.
- Aron Culotta and Andrew McCallum. 2005. Reducing labeling effort for structured prediction tasks. In Proceedings of the 20th National Conference on Artificial Intelligence (AAAI), volume 2, pages 746–751.
- Tivadar Danka and Peter Horvath. 2018. modAL: A modular active learning framework for Python. arXiv preprint arXiv:1805.00979.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), pages 4171–4186. Association for Computational Linguistics.
- Active Learning for BERT: An Empirical Study. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7949–7962, Online. Association for Computational Linguistics.
- Design Patterns: Elements of Reusable Object-Oriented Software, 1 edition. Addison-Wesley Longman Publishing Co., Inc., USA. 37. Reprint.
- Daniel Gissin and Shai Shalev-Shwartz. 2019. Discriminative active learning. arXiv preprint arXiv:1907.06347.
- To softmax, or not to softmax: that is the question when applying active learning for transformer models. arXiv preprint arXiv:2210.03005.
- Olivier J. Hénaff. 2020. Data-efficient image recognition with contrastive predictive coding. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pages 4182–4192. PMLR.
- Bayesian active learning for classification and preference learning. arXiv preprint arXiv:1112.5745.
- Wei-Ning Hsu and Hsuan-Tien Lin. 2015. Active learning by learning. Proceedings of the 29th AAAI Conference on Artificial Intelligence (AAAI), 29(1).
- Minqing Hu and Bing Liu. 2004. Mining and summarizing customer reviews. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’04, pages 168–177, New York, NY, USA. Association for Computing Machinery.
- Thorsten Joachims. 1998. Text categorization with support vector machines: Learning with many relevant features. In Machine Learning: ECML-98, 10th European Conference on Machine Learning, Chemnitz, Germany, April 21-23, 1998, Proceedings, volume 1398 of Lecture Notes in Computer Science, pages 137–142. Springer.
- Yoon Kim. 2014. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1746–1751.
- Is More Data Better? Re-thinking the Importance of Efficiency in Abusive Language Detection with Transformers-Based Active Learning. In Proceedings of the Third Workshop on Threat, Aggression and Cyberbullying (TRAC 2022), pages 52–61, Gyeongju, Republic of Korea. Association for Computational Linguistics.
- Introducing geometry in active learning for image segmentation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 2974–2982.
- scikitactiveml: A Library and Toolbox for Active Learning Algorithms. Preprints.org.
- Florian Laws and Hinrich Schütze. 2008. Stopping criteria for active learning of named entity recognition. In Proceedings of the 22nd International Conference on Computational Linguistics (COLING), pages 465–472.
- David D. Lewis and William A. Gale. 1994. A sequential algorithm for training text classifiers. In Proceedings of the 17th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, pages 3–12.
- Xin Li and Dan Roth. 2002. Learning question classifiers. In Proceedings of the 19th International Conference on Computational Linguistics (COLING), volume 1 of COLING ’02, pages 1–7, USA. Association for Computational Linguistics.
- Active Learning to Recognize Multiple Types of Plankton. Journal of Machine Learning Research (JMLR), 6:589–613.
- Active learning by acquiring contrastive examples. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 650–663.
- Efficient estimation of word representations in vector space. In Proceedings of the 1st International Conference on Learning Representations (ICLR).
- On the stability of fine-tuning BERT: misconceptions, explanations, and strong baselines. In Proceedings of the 9th International Conference on Learning Representations (ICLR 2021). OpenReview.net.
- A reverse-engineering approach to subsystem structure identification. Journal of Software Maintenance: Research and Practice, 5(4):181–204.
- Glenford J. Myers. 1975. Reliable Software through Composite Design. Petrocelli/Charter.
- Fredrik Olsson and Katrin Tomanek. 2009. An intrinsic stopping criterion for committee-based active learning. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL), pages 138–146.
- A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL), pages 271–278, Barcelona, Spain.
- Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL), pages 115–124, Ann Arbor, Michigan. Association for Computational Linguistics.
- Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pages 8024–8035.
- Scikit-learn: Machine learning in python. Journal of Machine Learning Research (JMLR), 12(85):2825–2830.
- Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, Hong Kong, China. Association for Computational Linguistics.
- Effective active learning strategy for multi-label learning. Neurocomputing, 273:494–508.
- JCLAL: A Java Framework for Active Learning. Journal of Machine Learning Research (JMLR), 17(95):1–5.
- Julia Romberg and Tobias Escher. 2022. Automated topic categorisation of citizens’ contributions: Reducing manual labelling efforts through active learning. In Electronic Government, pages 369–385, Cham. Springer International Publishing.
- Nicholas Roy and Andrew McCallum. 2001. Toward optimal active learning through sampling estimation of error reduction. In Proceedings of the Eighteenth International Conference on Machine Learning (ICML), pages 441–448.
- Revisiting uncertainty-based query strategies for active learning with transformers. In Findings of the Association for Computational Linguistics: ACL 2022 (Findings of ACL 2022), pages 2194–2203.
- Ozan Sener and Silvio Savarese. 2018. Active learning for convolutional neural networks: A core-set approach. In Proceedings of the 6th International Conference on Learning Representations (ICLR).
- Multiple-instance active learning. In Proceedings of the 20th International Conference on Neural Information Processing Systems (NIPS), pages 1289–1296.
- The Need for Open Source Software in Machine Learning. Journal of Machine Learning Research (JMLR), 8(81):2443–2466.
- ALiPy: Active learning in python. arXiv preprint arXiv:1901.03802.
- Paolo Tonella. 2001. Concept analysis for module restructuring. IEEE Trans. Software Eng., 27(4):351–363.
- ALToolbox: A set of tools for active learning annotation of natural language texts. In Proceedings of the The 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 406–434, Abu Dhabi, UAE. Association for Computational Linguistics.
- Efficient few-shot learning without prompts. arXiv preprint arXiv:2209.11055.
- Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems 30 (NeurIPS), pages 5998–6008.
- Andreas Vlachos. 2008. A stopping criterion for active learning. Computer Speech & Language, 22(3):295–312.
- Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (EMNLP), pages 38–45.
- libact: Pool-based active learning in python. arXiv preprint arXiv:1710.00379.
- AcTune: Uncertainty-based active self-training for active fine-tuning of pretrained language models. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1422–1436, Seattle, United States. Association for Computational Linguistics.
- Cold-start active learning through self-supervised language modeling. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7935–7948. Association for Computational Linguistics.
- Character-level convolutional networks for text classification. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Proceedings of the Advances in Neural Information Processing Systems 28 (NIPS), pages 649–657. Curran Associates, Inc., Montreal, Quebec, Canada.
- Active discriminative text representation learning. In Proceedings of the 31st AAAI Conference on Artificial Intelligence (AAAI), pages 3386–3392.
- Multi-criteria-based strategy to stop active learning for data annotation. In Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), pages 1129–1136, Manchester, UK. Coling 2008 Organizing Committee.