Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Prompting Large Language Models for Topic Modeling (2312.09693v1)

Published 15 Dec 2023 in cs.AI

Abstract: Topic modeling is a widely used technique for revealing underlying thematic structures within textual data. However, existing models have certain limitations, particularly when dealing with short text datasets that lack co-occurring words. Moreover, these models often neglect sentence-level semantics, focusing primarily on token-level semantics. In this paper, we propose PromptTopic, a novel topic modeling approach that harnesses the advanced language understanding of LLMs to address these challenges. It involves extracting topics at the sentence level from individual documents, then aggregating and condensing these topics into a predefined quantity, ultimately providing coherent topics for texts of varying lengths. This approach eliminates the need for manual parameter tuning and improves the quality of extracted topics. We benchmark PromptTopic against the state-of-the-art baselines on three vastly diverse datasets, establishing its proficiency in discovering meaningful topics. Furthermore, qualitative analysis showcases PromptTopic's ability to uncover relevant topics in multiple datasets.

This paper, "Prompting LLMs for Topic Modeling" (Wang et al., 2023 ), introduces a novel topic modeling approach called \textsf{PromptTopic} that leverages LLMs. The authors aim to address limitations of traditional topic models, such as difficulties with short texts, reliance on word-level semantics, and the need for extensive manual hyperparameter tuning.

The core idea behind \textsf{PromptTopic} is to use the advanced language understanding capabilities of LLMs to extract topics at a more semantic level, particularly focusing on sentence-level context, rather than just token-level statistics. The method is unsupervised and consists of three main stages:

  1. Topic Generation: For each document, an LLM is prompted to extract relevant topics. The prompts include demonstration examples using an in-context learning approach to guide the LLM's output format and quality. The authors found that using 4 demonstration examples yielded good performance, especially for smaller LLMs like LLaMA. For instruction-tuned models like ChatGPT, the format is less sensitive to the number of demonstrations.
  2. Topic Collapse: The initial topic generation can result in a large number of overlapping or highly similar topics across the entire dataset. This stage aims to group and condense these into a predefined number (KK) of distinct topics. Two approaches are proposed:
    • Prompt-Based Matching (PBM): Uses LLMs to iteratively merge the least frequent topic with an existing topic from a sorted list, prompting the LLM for the merge decision. A sliding window approach is used for datasets with a very large number of initial unique topics to stay within LLM token limits.
    • Word Similarity Matching (WSM): Computes similarity between topics based on the overlap of their top words, derived from Class-based Term Frequency-Inverse Document Frequency (c-TF-IDF) scores. Topics with high similarity are merged iteratively until KK topics remain. For large datasets, PBM is first used to reduce the topic count to an intermediate number (GG) before applying WSM.
  3. Topic Representation Generation: To evaluate the quality of the collapsed topics, they need to be represented by a set of salient words. The paper uses c-TF-IDF scores to identify the top words for each topic cluster. An LLM is then used as a final filtering step to select the top 10 most representative words from the c-TF-IDF list, ensuring relevance and coherence.

For implementation, the authors experimented with both the ChatGPT API and the LLaMA-13B model. They preprocessed the text data similarly to traditional topic modeling, removing punctuation, stopwords, and performing lemmatization (except for Twitter data). Key parameters determined empirically include the number of demonstration examples (N=4N=4) and the intermediate topic count for WSM (G=400G=400 for 20 NewsGroup and Twitter Tweet, G=200G=200 for Yelp Reviews).

The performance of \textsf{PromptTopic} was evaluated against several state-of-the-art baseline models (LDA, NMF, CTM, TopClus, Cluster-Analysis, BERTopic) on three diverse datasets (20 NewsGroup, Yelp Reviews, Twitter Tweet) using quantitative metrics (NPMI for coherence, Topic Diversity) and qualitative assessments (Word Intrusion Task, manual inspection of topic words).

Implementation Insights and Findings:

  • LLM Choice: Both ChatGPT and LLaMA-13B were used. LLaMA-13B, despite being significantly smaller and not instruction-tuned, showed comparable performance to ChatGPT with careful prompt simplification and few-shot examples. This suggests that even moderately sized LLMs can be effective.
  • Topic Collapse Strategy: \textsf{PromptTopic-WSM} generally outperformed baseline models and \textsf{PromptTopic-PBM} on standard quantitative metrics (NPMI, TD) across different datasets.
  • Short Text Performance: Human evaluation via the Word Intrusion Task revealed that while \textsf{PromptTopic-WSM} and BERTopic struggled with short texts (like Twitter Tweets), \textsf{PromptTopic-PBM} achieved notably higher accuracy, suggesting its strength in handling data with less co-occurrence and more reliance on sentence-level semantics captured by LLMs. This highlights a practical advantage of PBM despite potentially lower quantitative scores on some datasets (like Yelp Reviews, where PBM's high diversity caused low coherence due to specific food terms being dispersed).
  • Scalability: The paper acknowledges that using LLMs for topic generation across large datasets is resource-intensive, requiring significant GPU memory for models like LLaMA or incurring API costs. The iterative nature of PBM and the need for PBM assistance in WSM for massive topic sets also add computational complexity.

Practical Applications:

\textsf{PromptTopic} can be applied to tasks requiring understanding thematic structures in diverse text data, especially where traditional models struggle due to short text length or complex language use. Examples include:

  • Analyzing Social Media Data: Extracting topics from tweets, comments, or forum posts where context is often limited to a few sentences.
  • Processing Customer Reviews: Identifying themes in short product or service reviews.
  • Exploring Domain-Specific Texts: Discovering concepts in specialized documents where jargon or domain knowledge is important, which LLMs can potentially handle better than traditional models with fixed vocabularies.
  • Qualitative Data Analysis: Assisting researchers in quickly identifying recurring themes in interview transcripts or open-ended survey responses.

The method provides a potentially more accessible route to high-quality topic extraction by reducing the need for expert domain knowledge for hyperparameter tuning, relying instead on the inherent capabilities of LLMs.

Limitations and Future Work:

Key limitations include the computational cost and resource requirements of using LLMs, especially for very large document collections. The prompt-based merging in PBM could be improved to incorporate more context beyond just topic names to prevent merging unrelated concepts. Future work proposed by the authors includes enhancing batch-wise merging in PBM and further exploring prompt engineering techniques for optimizing topic modeling with LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,” Journal of machine Learning research, vol. 3, no. Jan, pp. 993–1022, 2003.
  2. M. Rosen-Zvi, T. Griffiths, M. Steyvers, and P. Smyth, “The author-topic model for authors and documents,” in Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, ser. UAI ’04.   Arlington, Virginia, USA: AUAI Press, 2004, p. 487–494.
  3. D. Blei and J. Lafferty, “Correlated topic models,” Advances in neural information processing systems, vol. 18, p. 147, 2006.
  4. C. Févotte and J. Idier, “Algorithms for nonnegative matrix factorization with the β𝛽\betaitalic_β-divergence,” Neural computation, vol. 23, no. 9, pp. 2421–2456, 2011.
  5. G. Xun, Y. Li, W. X. Zhao, J. Gao, and A. Zhang, “A correlated topic model using word embeddings.” in International Joint Conference on Artificial Intelligence (IJCAI), vol. 17, 2017, pp. 4207–4213.
  6. A. B. Dieng, F. J. R. Ruiz, and D. M. Blei, “Topic modeling in embedding spaces,” Trans. Assoc. Comput. Linguist., vol. 8, pp. 439–453, Dec. 2020.
  7. A. Srivastava and C. Sutton, “Autoencoding variational inference for topic models,” International Conference on Learning Representations (ICLR), 2017.
  8. F. Bianchi, S. Terragni, D. Hovy, D. Nozza, and E. Fersini, “Cross-lingual contextualized topic models with zero-shot learning,” Association for Computational Linguistics (ACL), 2020.
  9. M. Grootendorst, “BERTopic: Neural topic modeling with a class-based TF-IDF procedure,” arXiv preprint arXiv:2203.05794, Mar. 2022.
  10. H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample, “LLAMA: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023, 1, 2, 3.
  11. R. K.-W. Lee, T.-A. Hoang, and E.-P. Lim, “On analyzing user topic-specific platform preferences across multiple social media sites,” in Proceedings of the 26th International Conference on World Wide Web, 2017, pp. 1351–1359.
  12. ——, “Discovering hidden topical hubs and authorities in online social networks,” in Proceedings of the 2018 SIAM International Conference on Data Mining.   SIAM, 2018, pp. 378–386.
  13. ——, “Discovering hidden topical hubs and authorities across multiple online social networks,” IEEE Transactions on Knowledge and Data Engineering, vol. 33, no. 1, pp. 70–84, 2019.
  14. J. Mcauliffe and D. Blei, “Supervised topic models,” in Advances in Neural Information Processing Systems, vol. 20, 2007.
  15. D. M. Blei and J. D. Lafferty, “Dynamic topic models,” in Proceedings of the 23rd international conference on Machine learning, June 2006, pp. 113–120.
  16. T. Hofmann, “Probabilistic latent semantic analysis,” arXiv preprint arXiv:1301.6705, 2013.
  17. R. Das, M. Zaheer, and C. Dyer, “Gaussian lda for topic models with word embeddings,” in Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2015, pp. 795–804.
  18. K. Batmanghelich, A. Saeedi, K. Narasimhan, and S. Gershman, “Nonparametric spherical topic modeling with word embeddings,” in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers).   Berlin, Germany: Association for Computational Linguistics, Aug. 2016, pp. 537–542. [Online]. Available: https://aclanthology.org/P16-2087
  19. T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” International Conference on Learning Representations (ICLR), Jan. 2013.
  20. Y. Meng, Y. Zhang, J. Huang, Y. Zhang, and J. Han, “Topic discovery via latent space clustering of pretrained language model representations,” in Proceedings of the ACM Web Conference 2022, April 2022, pp. 3143–3152.
  21. S. Sia, A. Dalmia, and S. J. Mielke, “Tired of topic models? clusters of pretrained word embeddings make for fast and good topics too!” arXiv preprint arXiv:2004.14914, 2020.
  22. J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, J. Burstein, C. Doran, and T. Solorio, Eds.   Association for Computational Linguistics, 2019, pp. 4171–4186.
  23. A. M. Hoyle, P. Goel, and P. Resnik, “Improving Neural Topic Models using Knowledge Distillation,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP).   Online: Association for Computational Linguistics (ACL), Nov. 2020, pp. 1752–1771. [Online]. Available: https://aclanthology.org/2020.emnlp-main.137
  24. P. Gupta, Y. Chaudhary, and H. Schütze, “Multi-source neural topic modeling in multi-view embedding spaces,” in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.   Online: Association for Computational Linguistics, Jun. 2021, pp. 4205–4217. [Online]. Available: https://aclanthology.org/2021.naacl-main.332
  25. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, …, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems, vol. 30, 2017.
  26. Z. Chen, M. M. Balan, and K. Brown, “Language models are few-shot learners for prognostic prediction,” arXiv preprint arXiv:2302.12692, Feb. 2023.
  27. J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, and L. Sifre, “Training compute-optimal large language models,” arXiv preprint, 2022.
  28. OpenAI, “Gpt-4 technical report,” Tech. Rep., 2023, version b.
  29. H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual instruction tuning,” arXiv preprint, 2023.
  30. K. Lang, “Newsweeder: Learning to filter netnews,” in Machine Learning Proceedings 1995.   Elsevier, 1995, pp. 331–339.
  31. “Yelp dataset challenge,” Available from: http://www.yelp.com/dataset_challenge, 2015.
  32. J. Qiang, Z. Qian, Y. Li, Y. Yuan, and X. Wu, “Short text topic modeling techniques, applications, and performance: A survey,” IEEE Trans. Knowl. Data Eng., vol. 34, no. 3, pp. 1427–1445, Mar. 2022.
  33. S. Terragni, E. Fersini, B. G. Galuzzi, P. Tropeano, and A. Candelieri, “OCTIS: Comparing and optimizing topic models is simple!” in Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations.   Association for Computational Linguistics, Apr. 2021, pp. 263–270. [Online]. Available: https://www.aclweb.org/anthology/2021.eacl-demos.31
  34. G. Bouma, “Normalized (pointwise) mutual information in collocation extraction,” Proceedings of GSCL, vol. 30, pp. 31–40, 2009.
  35. A. Hoyle, P. Goel, A. Hian-Cheong, D. Peskov, J. Boyd-Graber, and P. Resnik, “Is automated topic model evaluation broken? the incoherence of coherence,” Advances in Neural Information Processing Systems, vol. 34, pp. 2018–2033, 2021.
  36. J. Chang, S. Gerrish, C. Wang, J. Boyd-Graber, and D. Blei, “Reading tea leaves: How humans interpret topic models,” Advances in Neural Information Processing Systems, vol. 22, 2009.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Han Wang (418 papers)
  2. Nirmalendu Prakash (8 papers)
  3. Nguyen Khoi Hoang (3 papers)
  4. Ming Shan Hee (17 papers)
  5. Usman Naseem (62 papers)
  6. Roy Ka-Wei Lee (68 papers)
Citations (13)