BERTopic: Neural Topic Modeling
- The paper details a robust pipeline integrating transformer-based embeddings, UMAP, HDBSCAN, and c‑TF‑IDF to extract coherent and diverse neural topics.
- BERTopic is a neural topic modeling framework that leverages contextual document embeddings and density-based clustering to reveal interpretable topics across various languages.
- Empirical benchmarks show that BERTopic outperforms classical models like LDA and NMF in topic coherence and diversity, especially in low-resource and short-text settings.
BERTopic is a neural topic modeling framework that combines contextual transformer-based document embeddings, non-linear dimensionality reduction, density-based clustering, and class-based TF-IDF for interpretable topic representation. Unlike probabilistic generative models such as LDA or NMF, BERTopic leverages sentence-level semantic information through pre-trained transformer LLMs and thus enables flexible, high-coherence, and high-diversity discovery of latent topics, especially in short-text, low-resource, and multilingual settings (Grootendorst, 2022, Shinde et al., 4 Feb 2025, Medvecki et al., 5 Feb 2024, Koterwa et al., 10 May 2025).
1. Methodological Foundations and Model Pipeline
The BERTopic pipeline consists of four sequential stages:
- Contextual Document Embedding Each input document (sentence, paragraph, tweet, etc.) is mapped to a dense vector using a pre-trained transformer such as Sentence-BERT (SBERT). Typical backbones include “all-mpnet-base-v2” (768-dim) or “all-MiniLM-L6-v2” (384-dim), with pooling across token representations (usually mean-pooling, but max-pooling and CLS pooling are options) (Koterwa et al., 10 May 2025). The embedding model can be domain-specific, monolingual, or multilingual depending on the language and availability (Shinde et al., 4 Feb 2025, Medvecki et al., 5 Feb 2024, Mendonca et al., 27 Oct 2025). For morphologically rich or low-resource languages, monolingual or well-trained cross-lingual SBERT variants are preferred (Shinde et al., 4 Feb 2025, Medvecki et al., 5 Feb 2024).
- Dimensionality Reduction (UMAP) Since transformer embeddings are high-dimensional, BERTopic applies UMAP (Uniform Manifold Approximation and Projection) to reduce from -dimensional space to with (often ), preserving local structure. UMAP constructs high- and low-dimensional fuzzy simplicial sets and minimizes cross-entropy between them (Grootendorst, 2022, Arfaoui et al., 24 Nov 2025). Hyperparameters include (default 15), (default 0.1), and (Arfaoui et al., 24 Nov 2025, Medvecki et al., 5 Feb 2024, Schäfer et al., 11 Jul 2024).
- Density-Based Clustering (HDBSCAN) Reduced embeddings are clustered using HDBSCAN, a hierarchical, density-based method able to detect clusters of varying density and shape without requiring a fixed number of clusters. The key parameter is (often 5–15), which controls minimal cluster size. Documents that do not fit any cluster with sufficient density are labeled as noise/outliers (Grootendorst, 2022, Shinde et al., 4 Feb 2025, Medvecki et al., 5 Feb 2024, Arfaoui et al., 24 Nov 2025).
- Topic Representation via Class-Based TF‑IDF (c-TF‑IDF) For each cluster , all assigned documents are concatenated into a pseudo-document. c‑TF‑IDF ranks terms as follows:
where is the class term frequency, is the number of clusters, and the number of clusters containing term (Grootendorst, 2022, Shinde et al., 4 Feb 2025, Medvecki et al., 5 Feb 2024). The top terms (usually 5–20) are used as topic keywords.
This modular pipeline is extensible: any embedding model is supported, UMAP can be replaced by other nonlinear/projection methods, and clustering is decoupled from embedding generation (Grootendorst, 2022, Koterwa et al., 10 May 2025). The overall workflow is:
1 |
Preprocess text → Embed (BERT/SBERT) → UMAP → HDBSCAN → c-TF-IDF topic extraction |
2. Embedding Choices and Layer Strategies
BERTopic’s performance is sensitive to the embedding layer, pooling strategy, and choice of model:
- Transformer Backbone: Monolingual models capture morphological nuance; multilingual SBERTs are vital for lower-resource or cross-lingual corpora (Shinde et al., 4 Feb 2025, Medvecki et al., 5 Feb 2024, Schäfer et al., 11 Jul 2024). Larger models (mpnet-base-v2) consistently perform well, but smaller distilled models (MiniLM, distilUSE) are fast and robust after strong normalization (Medvecki et al., 5 Feb 2024).
- Pooling and Hidden State Selection:
Recent work demonstrates that representations from intermediate layers can outperform the default last-layer mean-pooled setup. Max pooling or aggregating (sum/concat) the last four layers can yield substantial gains in topic diversity and coherence (up to 10–20% relative improvement) (Koterwa et al., 10 May 2025). CLS pooling should be avoided for this task. Empirical results indicate that stop-word removal prior to embedding further boosts performance in almost all cases.
- Short Text and Morphological Richness:
On short, morphologically complex texts, embeddings from well-trained multilingual transformers suffice even without lemmatization; topic quality does not degrade significantly if only minimal preprocessing (lowercasing, punctuation removal) is performed (Medvecki et al., 5 Feb 2024).
3. Hyperparameter Selection and Stability
BERTopic introduces critical hyperparameters at multiple stages:
| Stage | Key Hyperparameters | Typical Values/Effects |
|---|---|---|
| Embedding | Model, layer, pooling | MPNet, MiniLM; mean/max pooling; see above |
| Dimensionality Reduction | n_neighbors, min_dist, d | 10–30, 0–0.1, 2–50 |
| Clustering | min_cluster_size, epsilon | 5–15 (tighter = more/finer topics, noisier) |
| Keyword Extraction | Top-M words; c-TF-IDF | M = 5–20; see pipeline above |
Fine-tuning is essential. Systematic grid or random search over UMAP and HDBSCAN parameters improves cluster stability and interpretability (Arfaoui et al., 24 Nov 2025, Medvecki et al., 5 Feb 2024, Schäfer et al., 11 Jul 2024). Researchers evaluate clustering robustness through bootstrapped resampling and metrics such as Adjusted Rand Index, Normalized Mutual Information, and Variation of Information (Arfaoui et al., 24 Nov 2025). Stability and coherence are often in tension; hierarchical merging of fine-grained topics into super-topics can balance these objectives (Arfaoui et al., 24 Nov 2025).
4. Evaluation Metrics and Empirical Benchmarks
Evaluation focuses on two primary metrics:
- Topic Coherence (, NPMI, UMass): Measures the semantic consistency of top keywords for each topic, typically using variants of normalized pointwise mutual information (NPMI) or sliding-window co-occurrence (Grootendorst, 2022, Shinde et al., 4 Feb 2025, Medvecki et al., 5 Feb 2024, Arfaoui et al., 24 Nov 2025). Higher values indicate more meaningful, interpretable topics.
- Topic Diversity (): Fraction of unique words among top words per topic, quantifying redundancy (Grootendorst, 2022, Shinde et al., 4 Feb 2025, Medvecki et al., 5 Feb 2024).
Empirical results establish several regularities:
- On benchmarking datasets (20 Newsgroups, BBC News, Trump Tweets), BERTopic with MPNET or similar SBERTs consistently outperforms LDA and NMF in both coherence ( vs. $0.058$ for LDA on 20 Newsgroups) and diversity ( vs. $0.75$) (Grootendorst, 2022).
- On low-resource, morphologically rich languages (Marathi, Serbian), BERTopic (with monolingual or robust multilingual SBERT) outperforms LDA and NMF in coherence (–$0.82$ for Marathi vs. $0.34$–$0.55$ for LDA) and delivers higher diversity () (Shinde et al., 4 Feb 2025, Medvecki et al., 5 Feb 2024).
- Preprocessing depth (lemmatization vs. raw text) has only marginal impact when strong contextual embeddings are used (Medvecki et al., 5 Feb 2024).
- Model remains robust to embedding model choice, provided it is well-matched to the language/domain (Grootendorst, 2022).
5. Applications: Dynamic Modeling, Multilinguality, Hierarchy
BERTopic is widely adopted for both static and dynamic topic modeling in various settings:
- Dynamic Topic Evolution:
By running the pipeline in time-sliced bins (e.g., monthly), BERTopic enables temporal analysis of topic evolution (Mendonca et al., 27 Oct 2025). Topic continuity across time is tracked via Jaccard overlap or cosine similarity over c-TF‑IDF vectors. This framework has been applied to the paper of political discourse on Twitter, including downstream alignment with domain lexica such as Moral Foundations Theory. Topic persistence, splits, and merges across time can be formally analyzed (Mendonca et al., 27 Oct 2025).
- Multilingual and Cross-lingual Analysis:
Capable of handling German, English, Indic languages, and more using appropriate SBERT backbones. Side-by-side modeling of multiple corpora (e.g., fake news from various countries) enables identification of cross-lingual thematic overlap (Schäfer et al., 11 Jul 2024, Shinde et al., 4 Feb 2025, Medvecki et al., 5 Feb 2024).
- Hierarchical and Hybrid Model Integration:
In large-scale or multi-scale scenarios, hierarchical pipelines combine coarse, interpretable themes (via NMF) followed by fine-grained subtopic discovery (via BERTopic), improving both scalability and semantic resolution (Cheng et al., 2022). This approach retains NMF's ability to provide multi-label assignment at the broad level, followed by single-topic assignment at the fine level.
- Qualitative Research and Human Validation:
Domain experts routinely validate topic interpretability using Likert-scale ratings, weighted Cohen's kappa, and ICC. Hierarchical merging of initially fine-grained clusters can further improve both statistical coherence and human agreement (Arfaoui et al., 24 Nov 2025).
6. Best Practices, Limitations, and Adaptations
Best Practices
- Always tune UMAP and HDBSCAN hyperparameters according to corpus size, document length, and desired granularity (Arfaoui et al., 24 Nov 2025, Medvecki et al., 5 Feb 2024, Schäfer et al., 11 Jul 2024).
- For short documents, use SBERT models trained/fine-tuned on semantic similarity tasks (NLI, STS) (Shinde et al., 4 Feb 2025).
- Leverage monolingual LLMs when available for morphologically complex or low-resource languages to capture fine nuances (Shinde et al., 4 Feb 2025, Medvecki et al., 5 Feb 2024).
- Remove stop-words prior to embedding for increased coherence and diversity (Koterwa et al., 10 May 2025).
- Systematic grid search and stability evaluation are advised for reproducibility and interpretability (Arfaoui et al., 24 Nov 2025).
- For large or hierarchical corpora, adopt staged (NMF → BERTopic) modeling pipelines (Cheng et al., 2022).
Limitations and Known Constraints
- Each document is assigned to a single topic; HDBSCAN soft probabilities enable some post hoc flexibility but do not natively allow for multi-topic assignments (Grootendorst, 2022, Cheng et al., 2022).
- c-TF-IDF provides bag-of-words representations, which may include semantically redundant top words for each topic—post hoc reranking is sometimes necessary (Grootendorst, 2022).
- Scalability on extremely large corpora is limited by memory and embedding computation time, though hierarchical and divide-and-conquer strategies mitigate this bottleneck (Cheng et al., 2022).
- Optimal cluster granularity must balance interpretability and statistical stability; hierarchical merging can address this but requires careful calibration (Arfaoui et al., 24 Nov 2025).
7. Comparison to Other Neural and Classical Topic Models
Relative to LDA, NMF, and autoregressive neural topic models (e.g., DocNADE), BERTopic delivers higher coherence, richer topic diversity, and better adaptability to short, informal, or non-English texts (Shinde et al., 4 Feb 2025, Medvecki et al., 5 Feb 2024, Grootendorst, 2022, Arfaoui et al., 24 Nov 2025). It is competitive with or complementary to other neural approaches such as Top2Vec and Contextualized Topic Models (CTM), and can be extended via transfer learning in multi-source neural topic architectures (Gupta et al., 2021). BERTopic's modular pipeline, decoupled from explicit generative assumptions, allows direct interoperability with dynamic modeling, domain-specific lexica, and supervised downstream analyses (Mendonca et al., 27 Oct 2025, Cheng et al., 2022).
By grounding topic assignment in dense contextual embeddings, flexible non-linear manifold learning, density-based clustering, and cluster-centric term weighting, BERTopic defines a state-of-the-art paradigm for neural topic modeling with demonstrated effectiveness across domains, scripts, and languages (Grootendorst, 2022, Shinde et al., 4 Feb 2025, Medvecki et al., 5 Feb 2024, Arfaoui et al., 24 Nov 2025, Koterwa et al., 10 May 2025, Schäfer et al., 11 Jul 2024).