Small-Text: Active Learning for Text Classification in Python (2107.10314v7)

Published 21 Jul 2021 in cs.LG and cs.CL

Abstract: We introduce small-text, an easy-to-use active learning library, which offers pool-based active learning for single- and multi-label text classification in Python. It features numerous pre-implemented state-of-the-art query strategies, including some that leverage the GPU. Standardized interfaces allow the combination of a variety of classifiers, query strategies, and stopping criteria, facilitating a quick mix and match, and enabling a rapid and convenient development of both active learning experiments and applications. With the objective of making various classifiers and query strategies accessible for active learning, small-text integrates several well-known machine learning libraries, namely scikit-learn, PyTorch, and Hugging Face transformers. The latter integrations are optionally installable extensions, so GPUs can be used but are not required. Using this new library, we investigate the performance of the recently published SetFit training paradigm, which we compare to vanilla transformer fine-tuning, finding that it matches the latter in classification accuracy while outperforming it in area under the curve. The library is available under the MIT License at https://github.com/webis-de/small-text, in version 1.3.0 at the time of writing.

References (59)

Citations (19)

View on Semantic Scholar

Summary

The paper introduces Small-Text, a modular framework combining active learning with interchangeable classifiers and query strategies.
The paper leverages both GPU and CPU support to integrate advanced models such as BERT and SetFit for scalable text classification.
The paper demonstrates that SetFit achieves comparable or improved accuracy and efficiency over standard BERT fine-tuning in scarce data settings.

Active Learning for Text Classification: An Overview of Small-Text

This paper introduces "Small-Text," a Python library designed to facilitate active learning in text classification. The library stands out for its modularity and accessibility, enabling users to combine various classifiers, query strategies, and stopping criteria within an active learning framework. It integrates well-known machine learning libraries like scikit-learn, PyTorch, and Hugging Face transformers, targeting both researchers and practitioners.

Core Contributions

The authors detail the challenges of active learning in text classification, particularly when labeled data is scarce or costly. Small-Text addresses these challenges through:

Modular Architecture: The library's design allows easy interchange of components, such as classifiers and query strategies, making it flexible for varied experiments and applications.
GPU Support: While it offers GPU-based models, using CUDA, it also functions in CPU-only environments, avoiding unnecessary dependencies.
Extensive Query Strategies: The library provides thirteen query strategies, such as least confidence, prediction entropy, and BALD, amongst others, tailored for text data.
Integration and Extensibility: Small-Text includes interfaces for integrating scikit-learn classifiers and offers PyTorch and transformer extensions, enabling the seamless use of advanced models like BERT and SetFit.

Numerical Results and Findings

The paper reports on experiments comparing SetFit—using contrastive learning for transformer embeddings—against standard BERT fine-tuning. Notably:

Classification Accuracy: SetFit matches, or slightly surpasses, BERT fine-tuning in accuracy across various datasets.
Learning Efficiency: SetFit exhibits higher AUC scores, revealing its efficacy in reaching high performance with fewer samples, likely due to its sample-efficient training paradigm.

Implications and Future Directions

The implications of this research are both practical and theoretical:

Practical Utility: Small-Text's integration of cutting-edge models and strategies caters to real-world applications where resource constraints are significant. This flexibility encourages broader adoption in diverse domains requiring text classification.
Theoretical Insights: The performance of SetFit indicates a promising direction for enhancing transformer models' efficiency, particularly in active learning where initial labeled instances are limited.

The authors highlight Small-Text's adoption in several recent studies, underscoring its contribution to advancements in active learning and text classification. Future research may explore further enhancements in query strategies or broaden the library's applicability to other types of AI tasks.

Overall, Small-Text emerges as a valuable asset in the toolkit of researchers and developers working in natural language processing, specifically in scenarios necessitating efficient and adaptable active learning solutions.

GitHub

GitHub - webis-de/small-text: Active Learning for Text Classification in Python (561 stars)

Tweets

https://twitter.com/chschroed/status/1860770846081466865