Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Text Classification Using Label Names Only: A Language Model Self-Training Approach (2010.07245v1)

Published 14 Oct 2020 in cs.CL and cs.LG

Abstract: Current text classification methods typically require a good number of human-labeled documents as training data, which can be costly and difficult to obtain in real applications. Humans can perform classification without seeing any labeled examples but only based on a small set of words describing the categories to be classified. In this paper, we explore the potential of only using the label name of each class to train classification models on unlabeled data, without using any labeled documents. We use pre-trained neural LLMs both as general linguistic knowledge sources for category understanding and as representation learning models for document classification. Our method (1) associates semantically related words with the label names, (2) finds category-indicative words and trains the model to predict their implied categories, and (3) generalizes the model via self-training. We show that our model achieves around 90% accuracy on four benchmark datasets including topic and sentiment classification without using any labeled documents but learning from unlabeled data supervised by at most 3 words (1 in most cases) per class as the label name.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Yu Meng (92 papers)
  2. Yunyi Zhang (39 papers)
  3. Jiaxin Huang (48 papers)
  4. Chenyan Xiong (95 papers)
  5. Heng Ji (267 papers)
  6. Chao Zhang (909 papers)
  7. Jiawei Han (263 papers)
Citations (69)

Summary

We haven't generated a summary for this paper yet.