Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
149 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Dominant Set-based Active Learning for Text Classification and its Application to Online Social Media (2202.00540v1)

Published 28 Jan 2022 in cs.CL

Abstract: Recent advances in NLP in online social media are evidently owed to large-scale datasets. However, labeling, storing, and processing a large number of textual data points, e.g., tweets, has remained challenging. On top of that, in applications such as hate speech detection, labeling a sufficiently large dataset containing offensive content can be mentally and emotionally taxing for human annotators. Thus, NLP methods that can make the best use of significantly less labeled data points are of great interest. In this paper, we present a novel pool-based active learning method that can be used for the training of large unlabeled corpus with minimum annotation cost. For that, we propose to find the dominant sets of local clusters in the feature space. These sets represent maximally cohesive structures in the data. Then, the samples that do not belong to any of the dominant sets are selected to be used to train the model, as they represent the boundaries of the local clusters and are more challenging to classify. Our proposed method does not have any parameters to be tuned, making it dataset-independent, and it can approximately achieve the same classification accuracy as full training data, with significantly fewer data points. Additionally, our method achieves a higher performance in comparison to the state-of-the-art active learning strategies. Furthermore, our proposed algorithm is able to incorporate conventional active learning scores, such as uncertainty-based scores, into its selection criteria. We show the effectiveness of our method on different datasets and using different neural network architectures.

Summary

We haven't generated a summary for this paper yet.