Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Weakly-Supervised Neural Text Classification (1809.01478v2)

Published 2 Sep 2018 in cs.IR, cs.CL, cs.LG, and stat.ML

Abstract: Deep neural networks are gaining increasing popularity for the classic text classification task, due to their strong expressive power and less requirement for feature engineering. Despite such attractiveness, neural text classification models suffer from the lack of training data in many real-world applications. Although many semi-supervised and weakly-supervised text classification models exist, they cannot be easily applied to deep neural models and meanwhile support limited supervision types. In this paper, we propose a weakly-supervised method that addresses the lack of training data in neural text classification. Our method consists of two modules: (1) a pseudo-document generator that leverages seed information to generate pseudo-labeled documents for model pre-training, and (2) a self-training module that bootstraps on real unlabeled data for model refinement. Our method has the flexibility to handle different types of weak supervision and can be easily integrated into existing deep neural models for text classification. We have performed extensive experiments on three real-world datasets from different domains. The results demonstrate that our proposed method achieves inspiring performance without requiring excessive training data and outperforms baseline methods significantly.

An Analysis of "Weakly-Supervised Neural Text Classification"

The paper "Weakly-Supervised Neural Text Classification" addresses the critical issue of label scarcity in the domain of neural text classification. Traditional techniques for text classification rely heavily on large labeled datasets for effective training. However, obtaining such labeled data can be an arduous and costly process. This work introduces a novel weakly-supervised methodology that enhances the utility of neural models while minimizing reliance on expansive labeled datasets.

At the core of the proposed method are two principal modules: a pseudo-document generator and a self-training mechanism. The pseudo-document generator leverages the concept of weak supervision, which derives from minimal seed information such as label surface names, class-related keywords, or a limited number of labeled documents. The generator uses this seed information to create synthetic labeled documents, known as pseudo-documents, which serve as surrogate training data.

The formulation of class distributions is achieved by representing class semantics with a von Mises-Fisher (vMF) distribution in a shared semantic space. This approach efficiently models word and document embeddings, allowing the generation of pseudo-documents that maintain semantic fidelity to real-world topics. The process is nuanced enough to accommodate various forms of weak supervision, which is a strength not commonly observed in other methods.

Subsequent to pseudo-document generation, the method employs a self-training module, which iteratively refines the classification model by incorporating high-confidence predictions on unlabeled real data. This bootstrapping approach enhances the model's accuracy without necessitating additional labeled data.

The paper extensively tests its methodology across three diverse real-world datasets: The New York Times, AG's News, and Yelp Reviews. Numerical results presented in the paper indicate significant performance improvements over various established baselines, including IR with tf-idf, latent variable models such as LDA, and explicit semantic-based models like Dataless classification. Notably, WeSTClass, when applied to convolutional neural networks (CNN) and hierarchical attention networks (HAN), consistently outperforms these baselines, showcasing the robustness and efficacy of the proposed approach.

One intriguing aspect is the demonstrated resilience of WeSTClass to different types of seed input, highlighting its adaptability and potential for widespread application. Moreover, the paper underscores the efficiency of the self-training module in boosting classification accuracy, especially when pretrained with pseudo-documents.

The implications of this research are substantial both practically and theoretically. Practically, it provides a framework to deploy neural classification models in scenarios where large labeled datasets are unfeasible. Theoretically, it represents a significant step forward in understanding how to harness minimal supervision in deep learning frameworks effectively.

Future work could explore more sophisticated ways to incorporate diverse forms of seed information concurrently, potentially improving classification performance across more varied domains. Additionally, investigating the hybrid use of multiple deep learning architectures through this weakly-supervised lens might yield further insights into optimizing model training under constrained label availability.

Overall, this paper contributes a significant advancement in neural text classification, offering a flexible and powerful solution for mitigating the challenge of training data scarcity.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Yu Meng (92 papers)
  2. Jiaming Shen (56 papers)
  3. Chao Zhang (907 papers)
  4. Jiawei Han (263 papers)
Citations (183)