Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

TELEClass: Taxonomy Enrichment and LLM-Enhanced Hierarchical Text Classification with Minimal Supervision (2403.00165v2)

Published 29 Feb 2024 in cs.CL and cs.LG

Abstract: Hierarchical text classification aims to categorize each document into a set of classes in a label taxonomy. Most earlier works focus on fully or semi-supervised methods that require a large amount of human annotated data which is costly and time-consuming to acquire. To alleviate human efforts, in this paper, we work on hierarchical text classification with the minimal amount of supervision: using the sole class name of each node as the only supervision. Recently, LLMs (LLM) show competitive performance on various tasks through zero-shot prompting, but this method performs poorly in the hierarchical setting, because it is ineffective to include the large and structured label space in a prompt. On the other hand, previous weakly-supervised hierarchical text classification methods only utilize the raw taxonomy skeleton and ignore the rich information hidden in the text corpus that can serve as additional class-indicative features. To tackle the above challenges, we propose TELEClass, Taxonomy Enrichment and LLM-Enhanced weakly-supervised hierarchical text Classification, which (1) automatically enriches the label taxonomy with class-indicative terms to facilitate classifier training and (2) utilizes LLMs for both data annotation and creation tailored for the hierarchical label space. Experiments show that TELEClass can outperform previous weakly-supervised methods and LLM-based zero-shot prompting methods on two public datasets.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. Eugene Agichtein and Luis Gravano. 2000. Snowball: extracting relations from large plain-text collections. In Digital library.
  2. Data Programming for Learning Discourse Structure. In ACL.
  3. Hierarchical Transfer Learning for Multi-label Text Classification. In ACL.
  4. MixMatch: a holistic approach to semi-supervised learning. In NeurIPS.
  5. Importance of Semantic Representation: Dataless Classification. In AAAI.
  6. Hierarchy-aware Label Semantics Matching Network for Hierarchical Text Classification. In ACL-IJCNLP.
  7. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT.
  8. Is GPT-3 a Good Data Annotator?. In ACL.
  9. Debiasing Made State-of-the-art: Revisiting the Simple Seed-based Weak Supervision for Text Classification. In EMNLP.
  10. Language Model as an Annotator: Exploring DialoGPT for Dialogue Summarization. In ACL-IJCNLP.
  11. Evgeniy Gabrilovich and Shaul Markovitch. 2007. Computing Semantic Relatedness Using Wikipedia-Based Explicit Semantic Analysis. In IJCAI.
  12. Siddharth Gopal and Yiming Yang. 2013. Recursive regularization for large-scale classification with hierarchical and graphical dependencies. In KDD.
  13. Variational Pretraining for Semi-supervised Text Classification. In ACL.
  14. AnnoLLM: Making Large Language Models to Be Better Crowdsourced Annotators. ArXiv abs/2303.16854 (2023).
  15. Hierarchical Multi-label Text Classification: An Attention-based Recurrent Network Approach. In CIKM.
  16. MEGClass: Extremely Weakly Supervised Text Classification via Mutually-Enhancing Text Granularities. In Findings of EMNLP.
  17. Quoc Le and Tomas Mikolov. 2014. Distributed Representations of Sentences and Documents. In ICML.
  18. DBpedia - A large-scale, multilingual knowledge base extracted from Wikipedia. Semantic Web 6 (2015), 167–195.
  19. Support vector machines classification with a very large-scale taxonomy. In KDD.
  20. Julian McAuley and Jure Leskovec. 2013. Hidden factors and hidden topics: understanding rating dimensions with review text. In RecSys.
  21. Generating Training Data with Language Models: Towards Zero-Shot Language Understanding. In NeurIPS.
  22. Weakly-Supervised Neural Text Classification. In CIKM.
  23. Weakly-supervised hierarchical text classification. In AAAI.
  24. Text Classification Using Label Names Only: A Language Model Self-Training Approach. In EMNLP.
  25. OpenAI. 2023. GPT-4 Technical Report. ArXiv abs/2303.08774 (2023).
  26. Large-Scale Hierarchical Text Classification with Recursively Regularized Deep Graph-CNN. In WWW.
  27. Data Programming: Creating Large Training Sets, Quickly. In NIPS.
  28. Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In EMNLP.
  29. Stephen Robertson and Hugo Zaragoza. 2009. The Probabilistic Relevance Framework: BM25 and Beyond. Found. Trends Inf. Retr. 3, 4 (apr 2009), 333–389.
  30. Automated Phrase Mining from Massive Text Corpora. TKDE 30, 10 (2018), 1825–1837.
  31. TaxoClass: Hierarchical Multi-Label Text Classification Using Only Class Names. In NAACL.
  32. Learning with Weak Supervision for Email Intent Detection. In SIGIR.
  33. Yangqiu Song and Dan Roth. 2014. On Dataless Hierarchical Text Classification. In AAAI.
  34. T. Sørensen. 1948. A Method of Establishing Groups of Equal Amplitude in Plant Sociology Based on Similarity of Species Content and Its Application to Analyses of the Vegetation on Danish Commons. Munksgaard in Komm.
  35. Text Classification via Large Language Models. In Findings of EMNLP.
  36. Doc2Cube: Allocating Documents to Text Cube Without Labeled Data. In ICDM.
  37. Llama 2: Open Foundation and Fine-Tuned Chat Models. ArXiv abs/2307.09288 (2023).
  38. WOT-Class: Weakly Supervised Open-world Text Classification. In CIKM.
  39. X-Class: Text Classification with Extremely Weak Supervision. In NAACL.
  40. Towards Zero-Label Language Learning. ArXiv abs/2109.09193 (2021).
  41. Hierarchical Multi-label Classification Networks. In ICML.
  42. ZeroGen: Efficient Zero-shot Learning via Dataset Generation. In EMNLP.
  43. Benchmarking Zero-shot Text Classification: Datasets, Evaluation and Entailment Approach. In EMNLP-IJCNLP.
  44. AttentionXML: Label Tree-based Attention-Aware Deep Model for High-Performance Extreme Multi-Label Text Classification. In NeurIPS.
  45. Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias. In NeurIPS.
  46. Weakly-supervised Text Classification Based on Keyword Graph. In EMNLP.
  47. LLMaAA: Making Large Language Models as Active Annotators. In Findings of EMNLP.
  48. Hierarchical Metadata-Aware Document Categorization under Weak Supervision. In WSDM.
  49. PIEClass: Weakly-Supervised Text Classification with Prompting and Noise-Robust Iterative Ensemble Training. In EMNLP.
  50. MATCH: Metadata-Aware Text Classification in A Large Hierarchy. In WWW.
  51. Hierarchy-Aware Global Model for Hierarchical Text Classification. In ACL.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Yunyi Zhang (39 papers)
  2. Ruozhen Yang (2 papers)
  3. Xueqiang Xu (5 papers)
  4. Jinfeng Xiao (10 papers)
  5. Jiaming Shen (56 papers)
  6. Jiawei Han (263 papers)
  7. Rui Li (384 papers)
Citations (4)
X Twitter Logo Streamline Icon: https://streamlinehq.com