Prompting Large Language Models for Topic Modeling (2312.09693v1)

Published 15 Dec 2023 in cs.AI

Abstract: Topic modeling is a widely used technique for revealing underlying thematic structures within textual data. However, existing models have certain limitations, particularly when dealing with short text datasets that lack co-occurring words. Moreover, these models often neglect sentence-level semantics, focusing primarily on token-level semantics. In this paper, we propose PromptTopic, a novel topic modeling approach that harnesses the advanced language understanding of LLMs to address these challenges. It involves extracting topics at the sentence level from individual documents, then aggregating and condensing these topics into a predefined quantity, ultimately providing coherent topics for texts of varying lengths. This approach eliminates the need for manual parameter tuning and improves the quality of extracted topics. We benchmark PromptTopic against the state-of-the-art baselines on three vastly diverse datasets, establishing its proficiency in discovering meaningful topics. Furthermore, qualitative analysis showcases PromptTopic's ability to uncover relevant topics in multiple datasets.

PDF HTML Abstract

This paper, "Prompting LLMs for Topic Modeling" (Wang et al., 2023 ), introduces a novel topic modeling approach called \textsf{PromptTopic} that leverages LLMs. The authors aim to address limitations of traditional topic models, such as difficulties with short texts, reliance on word-level semantics, and the need for extensive manual hyperparameter tuning.

The core idea behind \textsf{PromptTopic} is to use the advanced language understanding capabilities of LLMs to extract topics at a more semantic level, particularly focusing on sentence-level context, rather than just token-level statistics. The method is unsupervised and consists of three main stages:

Topic Generation: For each document, an LLM is prompted to extract relevant topics. The prompts include demonstration examples using an in-context learning approach to guide the LLM's output format and quality. The authors found that using 4 demonstration examples yielded good performance, especially for smaller LLMs like LLaMA. For instruction-tuned models like ChatGPT, the format is less sensitive to the number of demonstrations.
Topic Collapse: The initial topic generation can result in a large number of overlapping or highly similar topics across the entire dataset. This stage aims to group and condense these into a predefined number ( $K$ $K$ ) of distinct topics. Two approaches are proposed:
- Prompt-Based Matching (PBM): Uses LLMs to iteratively merge the least frequent topic with an existing topic from a sorted list, prompting the LLM for the merge decision. A sliding window approach is used for datasets with a very large number of initial unique topics to stay within LLM token limits.
- Word Similarity Matching (WSM): Computes similarity between topics based on the overlap of their top words, derived from Class-based Term Frequency-Inverse Document Frequency (c-TF-IDF) scores. Topics with high similarity are merged iteratively until $K$ topics remain. For large datasets, PBM is first used to reduce the topic count to an intermediate number ( $G$ ) before applying WSM.
Topic Representation Generation: To evaluate the quality of the collapsed topics, they need to be represented by a set of salient words. The paper uses c-TF-IDF scores to identify the top words for each topic cluster. An LLM is then used as a final filtering step to select the top 10 most representative words from the c-TF-IDF list, ensuring relevance and coherence.

For implementation, the authors experimented with both the ChatGPT API and the LLaMA-13B model. They preprocessed the text data similarly to traditional topic modeling, removing punctuation, stopwords, and performing lemmatization (except for Twitter data). Key parameters determined empirically include the number of demonstration examples ( $N=4$ ) and the intermediate topic count for WSM ( $G=400$ for 20 NewsGroup and Twitter Tweet, $G=200$ for Yelp Reviews).

The performance of \textsf{PromptTopic} was evaluated against several state-of-the-art baseline models (LDA, NMF, CTM, TopClus, Cluster-Analysis, BERTopic) on three diverse datasets (20 NewsGroup, Yelp Reviews, Twitter Tweet) using quantitative metrics (NPMI for coherence, Topic Diversity) and qualitative assessments (Word Intrusion Task, manual inspection of topic words).

Implementation Insights and Findings:

LLM Choice: Both ChatGPT and LLaMA-13B were used. LLaMA-13B, despite being significantly smaller and not instruction-tuned, showed comparable performance to ChatGPT with careful prompt simplification and few-shot examples. This suggests that even moderately sized LLMs can be effective.
Topic Collapse Strategy: \textsf{PromptTopic-WSM} generally outperformed baseline models and \textsf{PromptTopic-PBM} on standard quantitative metrics (NPMI, TD) across different datasets.
Short Text Performance: Human evaluation via the Word Intrusion Task revealed that while \textsf{PromptTopic-WSM} and BERTopic struggled with short texts (like Twitter Tweets), \textsf{PromptTopic-PBM} achieved notably higher accuracy, suggesting its strength in handling data with less co-occurrence and more reliance on sentence-level semantics captured by LLMs. This highlights a practical advantage of PBM despite potentially lower quantitative scores on some datasets (like Yelp Reviews, where PBM's high diversity caused low coherence due to specific food terms being dispersed).
Scalability: The paper acknowledges that using LLMs for topic generation across large datasets is resource-intensive, requiring significant GPU memory for models like LLaMA or incurring API costs. The iterative nature of PBM and the need for PBM assistance in WSM for massive topic sets also add computational complexity.

Practical Applications:

\textsf{PromptTopic} can be applied to tasks requiring understanding thematic structures in diverse text data, especially where traditional models struggle due to short text length or complex language use. Examples include:

Analyzing Social Media Data: Extracting topics from tweets, comments, or forum posts where context is often limited to a few sentences.
Processing Customer Reviews: Identifying themes in short product or service reviews.
Exploring Domain-Specific Texts: Discovering concepts in specialized documents where jargon or domain knowledge is important, which LLMs can potentially handle better than traditional models with fixed vocabularies.
Qualitative Data Analysis: Assisting researchers in quickly identifying recurring themes in interview transcripts or open-ended survey responses.

The method provides a potentially more accessible route to high-quality topic extraction by reducing the need for expert domain knowledge for hyperparameter tuning, relying instead on the inherent capabilities of LLMs.

Limitations and Future Work:

Key limitations include the computational cost and resource requirements of using LLMs, especially for very large document collections. The prompt-based merging in PBM could be improved to incorporate more context beyond just topic names to prevent merging unrelated concepts. Future work proposed by the authors includes enhancing batch-wise merging in PBM and further exploring prompt engineering techniques for optimizing topic modeling with LLMs.

PDF Markdown Bookmark Chat (Pro)

References (36)

Authors (6)

Han Wang (418 papers)
Nirmalendu Prakash (8 papers)
Nguyen Khoi Hoang (3 papers)
Ming Shan Hee (17 papers)
Usman Naseem (62 papers)
Roy Ka-Wei Lee (68 papers)

Citations (13)

View on Semantic Scholar

Prompting Large Language Models for Topic Modeling (2312.09693v1)

Related Papers