X-Class: Text Classification with Extremely Weak Supervision (2010.12794v2)

Published 24 Oct 2020 in cs.CL, cs.IR, and cs.LG

Abstract: In this paper, we explore text classification with extremely weak supervision, i.e., only relying on the surface text of class names. This is a more challenging setting than the seed-driven weak supervision, which allows a few seed words per class. We opt to attack this problem from a representation learning perspective -- ideal document representations should lead to nearly the same results between clustering and the desired classification. In particular, one can classify the same corpus differently (e.g., based on topics and locations), so document representations should be adaptive to the given class names. We propose a novel framework X-Class to realize the adaptive representations. Specifically, we first estimate class representations by incrementally adding the most similar word to each class until inconsistency arises. Following a tailored mixture of class attention mechanisms, we obtain the document representation via a weighted average of contextualized word representations. With the prior of each document assigned to its nearest class, we then cluster and align the documents to classes. Finally, we pick the most confident documents from each cluster to train a text classifier. Extensive experiments demonstrate that X-Class can rival and even outperform seed-driven weakly supervised methods on 7 benchmark datasets. Our dataset and code are released at https://github.com/ZihanWangKi/XClass/ .

Authors (3)

Zihan Wang (181 papers)
Dheeraj Mekala (19 papers)
Jingbo Shang (141 papers)

Citations (75)

View on Semantic Scholar

Summary

Insights on Text Classification with Extremely Weak Supervision

The paper "X-Class: Text Classification with Extremely Weak Supervision" by Wang, Mekala, and Shang introduces an innovative approach to text classification, a crucial area in the field of NLP. The work is notable for tackling the problem under an almost-unsupervised setting, utilizing only the surface text of class names for classification tasks. This is a significant departure from traditional weak supervision methods that often rely on seed words or partially labeled data, hence requiring considerable expert input.

Core Contributions

The paper puts forth a novel framework named X-Class. The core idea is to leverage representation learning to enable document classification without conventional supervision methods. X-Class aims to establish document representations that align clustering results closely with desired classification outcomes, emphasizing adaptability to class names specified by users.

Class-Oriented Document Representation: X-Class utilizes a combination of class attention mechanisms to derive document representations in a way that is sensitive to the user-provided class names. This involves incrementally enriching class representations by adding contextually relevant keywords until semantic consistency is maintained.
Document-Class Alignment: By employing clustering methods like Gaussian Mixture Models (GMM), X-Class aligns documents to classes based on the newly computed representations. The clustering is guided by the initial assignment of documents to their near-neighbor class representations.
Text Classifier Training: The framework subsequently selects the most confident document samples from the clusters to train a conventional text classifier, fine-tuning with pseudo-labeled data.

Experimental Validation

The effectiveness of the X-Class framework is aptly demonstrated through extensive experiments on seven benchmark datasets, spanning diverse domains such as news topics, sentiments, and ontology classifications. The results consistently show that X-Class performs competitively, often surpassing existing seed-driven weakly supervised models even in scenarios where those methods utilize multiple seed words per class.

Implications and Future Directions

The implications of this research are twofold. Practically, it reduces the dependency on detailed domain knowledge and expert-curated seeds, thereby democratizing access to effective text classification tools. Theoretically, it opens avenues for further exploration into representation learning's ability to replace traditional weak supervision across varied NLP tasks.

The promising results suggest several future avenues:

Broader NLP Applications: Extending the extremely weak supervision framework to other NLP tasks such as named entity recognition or entity linking could minimize the reliance on annotated data.
Towards Unsupervised Classification: With further advancements, an unsupervised approach could be envisioned where systems independently determine class names and classify documents imminently.

Conclusion

In summary, the X-Class framework presents a significant stride in text classification, operating under constraints of extremely weak supervision. The methodology and findings underscore the potential of representation learning to rival, and in some cases surpass, traditional weak supervision strategies. This paper contributes a robust approach to achieving meaningful classification outcomes with minimal human intervention, setting the stage for future innovations in automated text analysis.

PDF Markdown

Related Papers

Find Related Papers

GitHub

GitHub - ZihanWangKi/XClass (56 stars)