TELEClass: Taxonomy Enrichment and LLM-Enhanced Hierarchical Text Classification with Minimal Supervision (2403.00165v2)
Abstract: Hierarchical text classification aims to categorize each document into a set of classes in a label taxonomy. Most earlier works focus on fully or semi-supervised methods that require a large amount of human annotated data which is costly and time-consuming to acquire. To alleviate human efforts, in this paper, we work on hierarchical text classification with the minimal amount of supervision: using the sole class name of each node as the only supervision. Recently, LLMs (LLM) show competitive performance on various tasks through zero-shot prompting, but this method performs poorly in the hierarchical setting, because it is ineffective to include the large and structured label space in a prompt. On the other hand, previous weakly-supervised hierarchical text classification methods only utilize the raw taxonomy skeleton and ignore the rich information hidden in the text corpus that can serve as additional class-indicative features. To tackle the above challenges, we propose TELEClass, Taxonomy Enrichment and LLM-Enhanced weakly-supervised hierarchical text Classification, which (1) automatically enriches the label taxonomy with class-indicative terms to facilitate classifier training and (2) utilizes LLMs for both data annotation and creation tailored for the hierarchical label space. Experiments show that TELEClass can outperform previous weakly-supervised methods and LLM-based zero-shot prompting methods on two public datasets.
- Eugene Agichtein and Luis Gravano. 2000. Snowball: extracting relations from large plain-text collections. In Digital library.
- Data Programming for Learning Discourse Structure. In ACL.
- Hierarchical Transfer Learning for Multi-label Text Classification. In ACL.
- MixMatch: a holistic approach to semi-supervised learning. In NeurIPS.
- Importance of Semantic Representation: Dataless Classification. In AAAI.
- Hierarchy-aware Label Semantics Matching Network for Hierarchical Text Classification. In ACL-IJCNLP.
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT.
- Is GPT-3 a Good Data Annotator?. In ACL.
- Debiasing Made State-of-the-art: Revisiting the Simple Seed-based Weak Supervision for Text Classification. In EMNLP.
- Language Model as an Annotator: Exploring DialoGPT for Dialogue Summarization. In ACL-IJCNLP.
- Evgeniy Gabrilovich and Shaul Markovitch. 2007. Computing Semantic Relatedness Using Wikipedia-Based Explicit Semantic Analysis. In IJCAI.
- Siddharth Gopal and Yiming Yang. 2013. Recursive regularization for large-scale classification with hierarchical and graphical dependencies. In KDD.
- Variational Pretraining for Semi-supervised Text Classification. In ACL.
- AnnoLLM: Making Large Language Models to Be Better Crowdsourced Annotators. ArXiv abs/2303.16854 (2023).
- Hierarchical Multi-label Text Classification: An Attention-based Recurrent Network Approach. In CIKM.
- MEGClass: Extremely Weakly Supervised Text Classification via Mutually-Enhancing Text Granularities. In Findings of EMNLP.
- Quoc Le and Tomas Mikolov. 2014. Distributed Representations of Sentences and Documents. In ICML.
- DBpedia - A large-scale, multilingual knowledge base extracted from Wikipedia. Semantic Web 6 (2015), 167–195.
- Support vector machines classification with a very large-scale taxonomy. In KDD.
- Julian McAuley and Jure Leskovec. 2013. Hidden factors and hidden topics: understanding rating dimensions with review text. In RecSys.
- Generating Training Data with Language Models: Towards Zero-Shot Language Understanding. In NeurIPS.
- Weakly-Supervised Neural Text Classification. In CIKM.
- Weakly-supervised hierarchical text classification. In AAAI.
- Text Classification Using Label Names Only: A Language Model Self-Training Approach. In EMNLP.
- OpenAI. 2023. GPT-4 Technical Report. ArXiv abs/2303.08774 (2023).
- Large-Scale Hierarchical Text Classification with Recursively Regularized Deep Graph-CNN. In WWW.
- Data Programming: Creating Large Training Sets, Quickly. In NIPS.
- Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In EMNLP.
- Stephen Robertson and Hugo Zaragoza. 2009. The Probabilistic Relevance Framework: BM25 and Beyond. Found. Trends Inf. Retr. 3, 4 (apr 2009), 333–389.
- Automated Phrase Mining from Massive Text Corpora. TKDE 30, 10 (2018), 1825–1837.
- TaxoClass: Hierarchical Multi-Label Text Classification Using Only Class Names. In NAACL.
- Learning with Weak Supervision for Email Intent Detection. In SIGIR.
- Yangqiu Song and Dan Roth. 2014. On Dataless Hierarchical Text Classification. In AAAI.
- T. Sørensen. 1948. A Method of Establishing Groups of Equal Amplitude in Plant Sociology Based on Similarity of Species Content and Its Application to Analyses of the Vegetation on Danish Commons. Munksgaard in Komm.
- Text Classification via Large Language Models. In Findings of EMNLP.
- Doc2Cube: Allocating Documents to Text Cube Without Labeled Data. In ICDM.
- Llama 2: Open Foundation and Fine-Tuned Chat Models. ArXiv abs/2307.09288 (2023).
- WOT-Class: Weakly Supervised Open-world Text Classification. In CIKM.
- X-Class: Text Classification with Extremely Weak Supervision. In NAACL.
- Towards Zero-Label Language Learning. ArXiv abs/2109.09193 (2021).
- Hierarchical Multi-label Classification Networks. In ICML.
- ZeroGen: Efficient Zero-shot Learning via Dataset Generation. In EMNLP.
- Benchmarking Zero-shot Text Classification: Datasets, Evaluation and Entailment Approach. In EMNLP-IJCNLP.
- AttentionXML: Label Tree-based Attention-Aware Deep Model for High-Performance Extreme Multi-Label Text Classification. In NeurIPS.
- Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias. In NeurIPS.
- Weakly-supervised Text Classification Based on Keyword Graph. In EMNLP.
- LLMaAA: Making Large Language Models as Active Annotators. In Findings of EMNLP.
- Hierarchical Metadata-Aware Document Categorization under Weak Supervision. In WSDM.
- PIEClass: Weakly-Supervised Text Classification with Prompting and Noise-Robust Iterative Ensemble Training. In EMNLP.
- MATCH: Metadata-Aware Text Classification in A Large Hierarchy. In WWW.
- Hierarchy-Aware Global Model for Hierarchical Text Classification. In ACL.
- Yunyi Zhang (39 papers)
- Ruozhen Yang (2 papers)
- Xueqiang Xu (5 papers)
- Jinfeng Xiao (10 papers)
- Jiaming Shen (56 papers)
- Jiawei Han (263 papers)
- Rui Li (384 papers)