Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
12 tokens/sec
GPT-4o
12 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
37 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
2000 character limit reached

BERTopic: Neural topic modeling with a class-based TF-IDF procedure (2203.05794v1)

Published 11 Mar 2022 in cs.CL

Abstract: Topic models can be useful tools to discover latent topics in collections of documents. Recent studies have shown the feasibility of approach topic modeling as a clustering task. We present BERTopic, a topic model that extends this process by extracting coherent topic representation through the development of a class-based variation of TF-IDF. More specifically, BERTopic generates document embedding with pre-trained transformer-based LLMs, clusters these embeddings, and finally, generates topic representations with the class-based TF-IDF procedure. BERTopic generates coherent topics and remains competitive across a variety of benchmarks involving classical models and those that follow the more recent clustering approach of topic modeling.

Citations (1,092)

Summary

  • The paper presents BERTopic, which combines transformer-based document embeddings with a modified class-based TF-IDF to yield coherent topic representations.
  • It leverages UMAP for dimensionality reduction and HDBSCAN for soft clustering, enhancing its ability to handle noisy, unstructured data.
  • The study demonstrates BERTopic’s competitive efficiency and flexibility in both static and dynamic topic modeling scenarios, offering practical benefits for NLP research.

BERTopic: Neural topic modeling with a class-based TF-IDF procedure

The paper "BERTopic: Neural topic modeling with a class-based TF-IDF procedure" introduces BERTopic, a novel approach to topic modeling that extends prior clustering-based methods by employing a class-based variation of Term Frequency-Inverse Document Frequency (TF-IDF) to extract coherent topic representations. The model incorporates document embeddings generated by transformer-based LLMs, leading to semantically enriched document representations. The primary innovation in BERTopic lies in its integration of pre-trained LLMs, clustering algorithms, and a modified TF-IDF procedure for robust topic interpretation.

Methodology

BERTopic's methodology consists of three primary steps:

  1. Document Embeddings: Utilizing Sentence-BERT (SBERT), each document is converted into a dense vector representation. These embeddings maintain the semantic context of documents, facilitating more accurate clustering.
  2. Dimensionality Reduction and Clustering: Given the high dimensionality of SBERT embeddings, Uniform Manifold Approximation and Projection (UMAP) is employed to reduce dimensionality while preserving local and global features. Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) is then used for clustering the reduced embeddings. HDBSCAN models clusters with a soft-clustering approach that allows for noise, improving the fidelity of cluster formation.
  3. Topic Representation: To generate topic representations, BERTopic modifies the classical TF-IDF to focus on clusters rather than individual documents. This class-based TF-IDF procedure treats all documents in a cluster as a single entity, calculating term importance relative to each cluster.

Dynamic Topic Modeling

BERTopic also supports dynamic topic modeling, enabling the analysis of topic evolution over time. The process involves generating global topic representations by initially ignoring temporal sequencing, followed by the creation of temporal-specific topic representations using pre-calculated global IDF values. A smoothing procedure is optionally applied to ensure linear evolution between time steps.

Experimental Setup and Results

The authors validate BERTopic against several datasets including 20 NewsGroups, BBC News, and Trump's tweets, with a comprehensive evaluation setup provided by OCTIS. The evaluation metrics employed are Topic Coherence (TC), measured by normalized pointwise mutual information (NPMI), and Topic Diversity (TD), reflecting the percentage of unique words across topics.

Performance Metrics

  • Topic Coherence: BERTopic demonstrates competitive performance across datasets, particularly excelling with the Trump dataset. This suggests its robustness in less preprocessed data scenarios.
  • Topic Diversity: While BERTopic achieves commendable scores, it consistently trails behind CTM in topic diversity, although it remains competitive.

Computational Efficiency

Wall times are measured to compare computational efficiency. BERTopic, using SBERT's "all-MiniLM-L6-v2" model, displays favorable performance compared to other neural topic models like CTM, which exhibits significantly higher computational costs. The classical models (NMF and LDA) are generally faster, but BERTopic's flexibility and performance trade-offs make it a compelling choice for many applications where embedding quality outweighs speed concerns.

Discussions

Strengths

  1. Flexibility in Embedding Selection: BERTopic's performance stability across different LLMs, particularly SBERT variations, highlights its adaptability.
  2. Separation of Clustering and Topic Representation: This methodological separation offers significant flexibility in preprocessing and fine-tuning, accommodating diverse use cases.
  3. Robust Word-Topic Distributions: The innovative class-based TF-IDF enables dynamic and class-specific topic modeling, with potential extensions to other meta-data dimensions.

Weaknesses

  1. Single Topic Assumption: BERTopic assumes a single topic per document, a limitation partially mitigated by leveraging HDBSCAN’s soft-clustering approach.
  2. Non-contextual Topic Representations: Generated topic representations are bag-of-words based, which may reduce interpretive clarity. This could potentially be addressed by incorporating maximal marginal relevance techniques.

Conclusion

BERTopic represents a significant methodological advancement in neural topic modeling by integrating sophisticated, pre-trained transformer-based embedding techniques with a novel class-based TF-IDF method for topic representation. Its flexible design allows it to leverage ongoing improvements in LLMs while maintaining consistent and competitive performance across varied datasets. The paper's thorough experimental setup and evaluation provide a compelling case for BERTopic's utility in both static and dynamic topic modeling applications. Further enhancements may address the identified limitations, making BERTopic an even more robust tool for natural language processing researchers.

Youtube Logo Streamline Icon: https://streamlinehq.com