Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Open-world Multi-label Text Classification with Extremely Weak Supervision (2407.05609v1)

Published 8 Jul 2024 in cs.CL

Abstract: We study open-world multi-label text classification under extremely weak supervision (XWS), where the user only provides a brief description for classification objectives without any labels or ground-truth label space. Similar single-label XWS settings have been explored recently, however, these methods cannot be easily adapted for multi-label. We observe that (1) most documents have a dominant class covering the majority of content and (2) long-tail labels would appear in some documents as a dominant class. Therefore, we first utilize the user description to prompt a LLM for dominant keyphrases of a subset of raw documents, and then construct a (initial) label space via clustering. We further apply a zero-shot multi-label classifier to locate the documents with small top predicted scores, so we can revisit their dominant keyphrases for more long-tail labels. We iterate this process to discover a comprehensive label space and construct a multi-label classifier as a novel method, X-MLClass. X-MLClass exhibits a remarkable increase in ground-truth label space coverage on various datasets, for example, a 40% improvement on the AAPD dataset over topic modeling and keyword extraction methods. Moreover, X-MLClass achieves the best end-to-end multi-label classification accuracy.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  2. On the surprising behavior of distance metrics in high dimensional space. In Database Theory—ICDT 2001: 8th International Conference London, UK, January 4–6, 2001 Proceedings 8, pages 420–434. Springer.
  3. Dimo Angelov. 2020. Top2vec: Distributed representations of topics. arXiv preprint arXiv:2008.09470.
  4. Locally non-linear embeddings for extreme multi-label learning. arXiv preprint arXiv:1507.02743.
  5. Latent dirichlet allocation. Journal of machine Learning research, 3(Jan):993–1022.
  6. Florian Boudin. 2016. Pke: an open source python-based keyphrase extraction toolkit. In Proceedings of COLING 2016, the 26th international conference on computational linguistics: system demonstrations, pages 69–73.
  7. Icl markup: Structuring in-context learning using soft-token tags. arXiv preprint arXiv:2312.07405.
  8. Idas: Intent discovery with abstractive summarization. arXiv preprint arXiv:2305.19783.
  9. Franca Debole and Fabrizio Sebastiani. 2005. An analysis of the relative hardness of reuters-21578 subsets. Journal of the American Society for Information Science and technology, 56(6):584–596.
  10. Cédric Févotte and Jérôme Idier. 2011. Algorithms for nonnegative matrix factorization with the β𝛽\betaitalic_β-divergence. Neural computation, 23(9):2421–2456.
  11. Zero-shot text classification with self-training. arXiv preprint arXiv:2210.17541.
  12. Maarten Grootendorst. 2022. Bertopic: Neural topic modeling with a class-based tf-idf procedure. arXiv preprint arXiv:2203.05794.
  13. Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing.
  14. Dbpedia–a large-scale, multilingual knowledge base extracted from wikipedia. Semantic web, 6(2):167–195.
  15. Rcv1: A new benchmark collection for text categorization research. Journal of machine learning research, 5(Apr):361–397.
  16. Deep learning for extreme multi-label text classification. In Proceedings of the 40th international ACM SIGIR conference on research and development in information retrieval, pages 115–124.
  17. Bert-flow-vae: A weakly-supervised model for multi-label text classification.
  18. Julian McAuley and Jure Leskovec. 2013. Hidden factors and hidden topics: understanding rating dimensions with review text. Proceedings of the 7th ACM conference on Recommender systems.
  19. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426.
  20. Rada Mihalcea and Paul Tarau. 2004. Textrank: Bringing order into text. In Proceedings of the 2004 conference on empirical methods in natural language processing, pages 404–411.
  21. A weakly supervised textual entailment approach to zero-shot text classification. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 286–296.
  22. Topicgpt: A prompt-based topic modeling framework. arXiv preprint arXiv:2311.01449.
  23. Juan Ramos et al. 2003. Using tf-idf to determine word relevance in document queries. In Proceedings of the first instructional conference on machine learning, volume 242, pages 29–48. Citeseer.
  24. Anthony Rios and Ramakanth Kavuluru. 2018. Few-shot and zero-shot multi-label learning for structured label spaces. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing, volume 2018, page 3132. NIH Public Access.
  25. Taxoclass: Hierarchical multi-label text classification using only class names. In NAAC’21: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,{{\{{NAACL-HLT}}\}} 2021, volume 2021.
  26. One embedder, any task: Instruction-finetuned text embeddings. arXiv preprint arXiv:2212.09741.
  27. Wot-class: Weakly supervised open-world text classification. arXiv preprint arXiv:2305.12401.
  28. Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. Advances in Neural Information Processing Systems, 33:5776–5788.
  29. X-class: Text classification with extremely weak supervision. arXiv preprint arXiv:2010.12794.
  30. Goal-driven explainable clustering via language descriptions. arXiv preprint arXiv:2305.13749.
  31. Extreme zero-shot learning for extreme text classification. arXiv preprint arXiv:2112.08652.
  32. Sgm: sequence generation model for multi-label classification. arXiv preprint arXiv:1806.04822.
  33. Benchmarking zero-shot text classification: Datasets, evaluation and entailment approach. arXiv preprint arXiv:1909.00161.
  34. ClusterLLM: Large language models as a guide for text clustering. pages 13903–13920. Association for Computational Linguistics.
  35. New intent discovery with pre-training and contrastive learning. arXiv preprint arXiv:2205.12914.
  36. Text as image: Learning transferable adapter for multi-label classification. ArXiv, abs/2312.04160.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Xintong Li (48 papers)
  2. Jinya Jiang (1 paper)
  3. Ria Dharmani (2 papers)
  4. Jayanth Srinivasa (23 papers)
  5. Gaowen Liu (60 papers)
  6. Jingbo Shang (141 papers)
Citations (1)

Summary

We haven't generated a summary for this paper yet.