BERTopic Modeling Overview
- BERTopic is a neural topic modeling framework that integrates sentence embeddings, UMAP, HDBSCAN, and c-TF-IDF to extract coherent topics.
- It employs transformer-based language models to generate contextual embeddings and uses UMAP for dimensionality reduction, enhancing performance.
- HDBSCAN clustering combined with c-TF-IDF topic representation yields superior coherence and diversity compared to classical approaches like LDA and NMF.
BERTopic is a neural topic modeling framework that aggregates recent advances in sentence embedding, manifold-based dimension reduction, density-based clustering, and interpretable topic representation. The method is distinguished by its use of Transformer-based LLMs for contextual embeddings, an unsupervised clustering approach in embedding space, and a class-based TF-IDF mechanism for robust topic labeling. BERTopic has achieved superior topic coherence and diversity across linguistic contexts, genres, and domain applications compared to classical approaches such as LDA and NMF.
1. Pipeline Architecture and Mathematical Foundations
BERTopic orchestrates topic modeling through four sequential modules: (1) document embedding, (2) dimensionality reduction, (3) clustering, and (4) cluster representation via class-based TF-IDF.
- Embedding Extraction: Each document is transduced into a dense vector where is a pre-trained Sentence Transformer (e.g. MiniLM, mpnet, distiluse) and is the latent dimension (384–768, depending on model) (Medvecki et al., 2024).
- Dimensionality Reduction (UMAP): The set of high-dimensional embeddings is projected into a manifold () using UMAP. UMAP constructs fuzzy simplicial sets in both high- and low-dimensional spaces and minimizes the cross-entropy between them:
where and denote membership strengths in the original and reduced spaces, respectively (Grootendorst, 2022).
- Clustering (HDBSCAN): HDBSCAN discovers clusters in the reduced space by building a hierarchical tree of density-connected points and selecting maximally stable clusters. Outliers are labeled as noise (cluster ). Key parameters include
min_cluster_size(controls topic granularity) andmin_samples(Medvecki et al., 2024). - Topic Representation (c-TF-IDF): For each topic , BERTopic computes class-based TF-IDF weights for term :
where is the cluster size, is the number of documents, is the document frequency of . The top- ranking terms by constitute the topic descriptor (Medvecki et al., 2024, Grootendorst, 2022).
2. Embedding Model Selection and Intermediate Layer Strategies
BERTopic is agnostic to the Transformer encoder used, allowing the integration of multilingual and domain-specific models. Key findings show that:
- Multilingual Sentence Transformers: When monolingual models are unavailable (e.g., Serbian), multilingual transformers pretrained on 50+ languages—such as paraphrase-multilingual-mpnet-base-v2 (768d), distiluse-base-multilingual-cased-v2 (512d), and paraphrase-multilingual-MiniLM-L12-v2 (384d)—yield high topic diversity and coherence. Among these, larger models (mpnet) tend to maximize topic coherence, especially on morphologically rich, partially processed texts (Medvecki et al., 2024).
- Embedding Layer Selection: Extracting representations from intermediate model layers and varying pooling methods (mean, max, CLS) serves as a powerful tuning strategy. Results show that using the embedding/token layer plus mean-pooling improved coherence on UN/Newsgroups, while max pooling on the sum of all encoder layers boosted coherence on short texts (Trump Tweets). CLS pooling underperformed, indicating [CLS] is insufficient for nuanced topic separation. Aggregating layers and max pooling tends to increase topic diversity, mean pooling favors coherence. Stop-word removal universally enhances performance (Koterwa et al., 10 May 2025).
3. Robustness Across Preprocessing Levels and Languages
BERTopic displays resilience to varying levels of text preprocessing:
- Partial Preprocessing: Even minimal normalization (transliteration, token filtering) without lemmatization yields interpretable topics. Full morphological normalization (lemmatization) delivers a slight gain in topic coherence but is not strictly required if contextual embeddings are used (Medvecki et al., 2024, Kandala et al., 20 Apr 2025).
- Low-resource and Morphologically Rich Languages: Despite the absence of monolingual SBERTs, BERTopic with multilingual models produces informative clusters in Serbian (Medvecki et al., 2024), Hindi (Mutsaddi et al., 7 Jan 2025), and Dutch (Kandala et al., 20 Apr 2025). In qualitative and quantitative benchmarks, BERTopic outperforms LDA, NMF, and Top2Vec on short and open-ended texts in terms of NPMI-based coherence and diversity, even when classical models are constrained to the same number of topics.
4. Hyperparameter Sensitivity and Optimization
BERTopic's performance pivots on several core hyperparameters:
- UMAP:
n_neighborsdetermines structure preservation (default: 15),min_distcontrols cluster tightness (default: 0.1). Increasingn_neighborsyields more global clusters, reducing it favors local specificity. - HDBSCAN:
min_cluster_sizeadjusts topic granularity; raising its value collapses small, noisy clusters into larger, interpretable ones. Automatic cluster count selection often produces overly fine topics; manual adjustment to domain-optimal values (e.g., 10–15) is common (Medvecki et al., 2024, Opu et al., 13 Jun 2025). - Cluster Outlier Strategy: HDBSCAN can discard >50% of points as noise. BERTopic's reduce_outliers method can assign noise points to topics via c-TF-IDF-based re-assignment, improving coverage (Medvecki et al., 2024).
- Vocabulary Size: When benchmarking against LDA/NMF, constraining vocabulary and topic counts to matched values enables fair comparison. BERTopic typically yields higher coherence and more granular subtopics (Medvecki et al., 2024, Krishnan, 2023).
5. Quantitative Benchmarks: Coherence and Diversity Evaluation
Topic quality in BERTopic is routinely assessed using normalized pointwise mutual information (NPMI) for coherence and the fraction of unique top terms for diversity:
- Topic Coherence (TC):
where
with probabilities estimated via document co-occurrences. TC ranges from ; higher is better (Grootendorst, 2022, Medvecki et al., 2024).
- Topic Diversity (TD):
TD near 1 signifies highly varied topic term sets; benchmarks show BERTopic achieves TD > 0.85 for most datasets and languages.
BERTopic typically outperforms classical models (LDA, NMF) on coherence and at least matches in diversity. For example, in Serbian vaccine-hesitancy tweets, BERTopic's best tc = –0.054 (distiluse-base) trumps LDA at –0.104, and diversity remains high (.892 versus .897 for LDA) (Medvecki et al., 2024).
6. Qualitative Insights, Limitations, and Best Practices
- Interpretability: Topic structures remain meaningful under minimal preprocessing, and embeddings capture fine-grained phenomena (e.g., manufacturer-specific vaccine mistrust, conspiracies) beyond classical models' reach (Medvecki et al., 2024).
- Parameter Tuning: Default hyperparameters suffice as baseline, but granularity can be adjusted by tuning
nr_topicsormin_cluster_sizeto match domain requirements. - Outlier Handling: Large fractions of points may be classified as noise; outlier reduction strategies should be employed, especially in short-text analysis.
- Model Selection: Larger transformer models (e.g., mpnet) increase coherence but incur higher computational cost. Layer-wise embedding selection offers additional gains if time and resources permit (Koterwa et al., 10 May 2025).
- Recommendations: Always set fixed random seeds for UMAP and HDBSCAN when analyzing topic assignments qualitatively.
Limitations include HDBSCAN's tendency to discard documents as outliers, the hard assignment of each document to a single topic, and the lack of probabilistic topic membership in the standard pipeline (Groot et al., 2022, Medvecki et al., 2024). Extensions incorporating soft clustering and multi-topic assignment have been identified as future directions.
7. Application Domains and Future Directions
BERTopic has been deployed in multilingual social media analytics, customer feedback mining, educational recommender systems, historical document analysis, aviation safety, and discourse study. Its architecture supports hybridized pipelines (e.g., integrating with NMF for hierarchical modeling (Cheng et al., 2022)) and dynamic topic tracking (e.g., mapping the evolution of regulatory themes over time (Golpayegani et al., 16 Sep 2025)).
Areas identified for future research include:
- Recursive or hierarchical topic modeling for subtopic discovery
- Integration of contextualized phrase ranking for improved topic labels
- Automated hyperparameter tuning based on coherence/diversity feedback
- Incorporation of soft topic assignments
- Domain-specific embedding fine-tuning for maximal performance in highly specialized corpora
In summary, BERTopic offers a robust, flexible neural topic modeling solution that consistently surpasses classical bag-of-words approaches in coherence, interpretability, and adaptability across a range of languages, genres, and application domains (Medvecki et al., 2024, Koterwa et al., 10 May 2025, Grootendorst, 2022).