Papers
Topics
Authors
Recent
2000 character limit reached

BERTopic Topic Modeling Framework

Updated 3 January 2026
  • BERTopic is a modular topic modeling framework that uses transformer-based sentence embeddings to capture latent topics in text corpora.
  • The pipeline employs UMAP for dimensionality reduction and HDBSCAN for density-based clustering to ensure coherent and fine-grained topic extraction.
  • It computes discriminative class-based TF–IDF for keyword extraction, outperforming traditional models in coherence, diversity, and contextual sensitivity.

BERTopic-based topic modeling is a modern, modular framework for unsupervised extraction and representation of latent topics in text corpora. Leveraging transformer-based sentence embeddings, non-linear dimensionality reduction, density-based clustering, and a discriminative class-based TF–IDF, BERTopic enables extraction of fine-grained, semantically coherent topics—demonstrably outperforming classical generative models such as LDA and LSA in coherence, diversity, and contextual sensitivity across diverse domains including finance, political discourse, software engineering, and short multi-lingual texts (Grootendorst, 2022, Sangaraju et al., 2022, Opu et al., 13 Jun 2025, Murugaraj et al., 12 Dec 2025).

1. Architecture and Pipeline Components

The canonical BERTopic pipeline consists of the following stages:

  1. Sentence Embedding Generation Each document did_i in a corpus D={d1,,dN}D = \{d_1,\ldots,d_N\} is encoded via a pre-trained transformer (e.g., SBERT, MiniLM, FinBERT) to obtain a dense, fixed-length embedding eiRMe_i \in \mathbb{R}^M. The embedding dimension MM depends on the model: 384 for “paraphrase-MiniLM-L6-v2”, 768 for BERT/FinBERT, etc. (Sangaraju et al., 2022, Grootendorst, 2022).
  2. Dimensionality Reduction (UMAP) High-dimensional embeddings are projected into a low-dimensional manifold via UMAP, which constructs fuzzy topological representations in both the original and target spaces and optimizes their cross-entropy:

minCij(pijlogpijqij+(1pij)log1pij1qij)\min_C \sum_{i \neq j} \left( p_{ij} \log \frac{p_{ij}}{q_{ij}} + (1-p_{ij}) \log \frac{1-p_{ij}}{1-q_{ij}} \right)

where pijp_{ij} (qijq_{ij}) are neighbor probabilities in the high- (low-) dimensional spaces. Hyperparameters such as nneighborsn_{\text{neighbors}}, ncomponentsn_{\text{components}} (typ. 5–100), and min_distmin\_dist (0.0–0.1) control locality and resolution (Sangaraju et al., 2022, Murugaraj et al., 12 Dec 2025, Opu et al., 13 Jun 2025).

  1. Clustering with HDBSCAN On the UMAP-reduced embeddings, HDBSCAN identifies dense clusters (topics) and separates noise points via mutual-reachability distance:

dmreach(i,j)=max{corek(i),corek(j),d(i,j)}d_{\text{mreach}}(i,j) = \max \{ \text{core}_k(i), \text{core}_k(j), d(i,j) \}

Key parameters: min_cluster_sizemin\_cluster\_size (controls minimum topic granularity, typ. 10–1250), min_samplesmin\_samples (conservativeness), and Euclidean or cosine metric (Sangaraju et al., 2022, Murugaraj et al., 12 Dec 2025, Opu et al., 13 Jun 2025).

  1. Class-Based TF–IDF Topic Extraction For each cluster/topic tt, all member documents are concatenated into a “meta-document.” The importance of word ww for topic tt is computed as:

cTFIDFt,w=(ft,wwVft,w)×log(Nnw)\mathrm{cTFIDF}_{t,w} = \left( \frac{f_{t,w}}{\sum_{w' \in V} f_{t,w'}} \right) \times \log\left(\frac{N}{n_w}\right)

where ft,wf_{t,w} is the frequency of ww in topic tt, nwn_w is the number of documents containing ww, and NN is the corpus size. Top-nn words by c-TF–IDF are selected as topic keywords (Sangaraju et al., 2022, Grootendorst, 2022).

2. Hyperparameterization and Implementation Variants

Key parameters and configuration decisions affect topic quality, coverage, and interpretability:

  • Embedding model: Off-the-shelf or domain-specific models (e.g., FinBERT for financial text) can be selected. Domain adaptation increases cluster homogeneity and topical separation (FinBERT cv=0.3327c_v=0.3327 vs. generic BERT cv=0.3225c_v=0.3225) (Sangaraju et al., 2022, Jehnen et al., 22 Apr 2025).
  • Dimensionality reduction: nneighborsn_{\text{neighbors}} controls topic granularity (lower values for shorter texts), ncomponentsn_{\text{components}} impacts the number and stability of clusters (Opu et al., 13 Jun 2025, Mendonca et al., 27 Oct 2025).
  • Clustering: Adjusting min_cluster_sizemin\_cluster\_size trades off fine-grained versus broad topics and fraction of outliers. High values (e.g., 1000) produce robust, coarse-grained topics suitable for large-scale developer texts (Opu et al., 13 Jun 2025); default (10–15) works for social or financial documents (Sangaraju et al., 2022).
  • Topic extraction: Keywords may be further refined with diversity penalties (e.g., hybrid KeyBERT+MMR) or BM25-style adjustments (Mendonca et al., 27 Oct 2025, Opu et al., 13 Jun 2025).
  • Iterative extensions: Iterative BERTopic applies the pipeline repeatedly, removing outliers and adjusting cluster counts based on partition similarity (ARI, VDM, NVI) until the topic set stabilizes (Wong et al., 2024).

3. Evaluation Metrics and Empirical Performance

BERTopic evaluation leverages standard and model-specific quantitative and qualitative measures:

  • Coherence: cvc_v (sliding-window, NPMI, cosine; higher is better, e.g., BERTopic–FinBERT cv=0.3327c_v=0.3327 vs. LDA cv=0.3197c_v=0.3197 on CFPB data) and UMass (document co-occurrence, less negative is better) (Sangaraju et al., 2022, Opu et al., 13 Jun 2025).
  • Diversity: Fraction of unique keywords in topic descriptors, indicating semantic distinctiveness among topics (Groot et al., 2022, Medvecki et al., 2024).
  • Cluster coverage / outlier analysis: HDBSCAN may label a large fraction as outliers (e.g., 74% in short course evaluations), suggesting consideration of k-Means as an alternative if full coverage is needed (Groot et al., 2022).
  • Qualitative inspection: Manual expert review of top keywords and representative texts, crucial for domains where semantic correctness and interpretability take precedence over numeric coherence (Opu et al., 13 Jun 2025, Sangaraju et al., 2022).
  • Task/Domain relevance: Domain-adapted encoders (e.g., FinBERT or FinTextSim) sharply increase topic coherence, precision, and alignment in specialized corpora (Jehnen et al., 22 Apr 2025).
Model/Configuration c_v (Coherence) UMass Notes
LSA (CFPB) 0.2365 -2.27 Baseline
LDA (CFPB) 0.3197 -6.12 Classical topic model
BERTopic–BERT (CFPB) 0.3225 -12.3 Generic embedding
BERTopic–FinBERT (CFPB) 0.3327 -12.7 Domain embedding
BERTopic–HDBSCAN (course eval, 20NG) 0.091/0.166 74%/few % outliers
BERTopic–k-Means (course eval, 20NG) 0.033/0.113 0% outliers, lower coherence

4. Applications and Empirical Impact Across Domains

BERTopic has been adopted in diverse domains:

  • Consumer financial complaints: Reveals granular, semantically precise topics using FinBERT (Sangaraju et al., 2022).
  • Political discourse: Captures topic evolution and alignment with moral frames (Care, Loyalty, Authority, etc.) over time; assigns morality scores per topic and quantifies topic longevity via intra-cluster tracking (Mendonca et al., 27 Oct 2025).
  • Software engineering: Identifies and hierarchizes 49 robust topics from 0.5M blockchain project issues with explicit separation into general versus blockchain-specific, and further subcategorized themes; resolution time and temporal dynamics are extracted by joining topic assignments with metadata (Opu et al., 13 Jun 2025).
  • Historical newspaper archives: Scalable to >100K documents; tracks temporal topic shifts in nuclear-energy discourse; achieves coherence gains of ~0.03–0.1 over LDA/NMF (Murugaraj et al., 12 Dec 2025).
  • Short text and low-resource language modeling: BERTopic outperforms LDA, NMF, and other classical models in Marathi and Hindi (topic coherence up to 0.82 with domain/pre-trained encoders) (Shinde et al., 4 Feb 2025, Mutsaddi et al., 7 Jan 2025).
  • Cross-lingual and noisy data: Minimal preprocessing suffices using strong multilingual encoders; hyperparameter tuning remains critical (Medvecki et al., 2024, Schäfer et al., 2024).

5. Methodological Advancements and Hybrid Models

Extensions and hybridizations further enhance BERTopic's interpretability and scalability:

  • Multi-scale hybridized frameworks: Shallow NMF partitions large text collections into broad topics, followed by BERTopic for detailed, context-rich subtopic extraction—yielding hierarchies of interpretable topics with improved resource efficiency and interpretational clarity (Cheng et al., 2022).
  • Intermediate-layer embeddings and pooling strategies: Max-/mean-pooling across transformer layers can outperform default configurations. Stop-word removal and aggregated-layer embeddings (sum/concat across last layers) can further improve coherence and diversity (Koterwa et al., 10 May 2025).
  • Iterative topic stabilization: Automated stopping rules (based on ARI, VDM, NVI) ensure convergence to a stable, semantically consistent topic set with minimized outliers (Wong et al., 2024).
  • Automated topic labeling: LLM-guided approaches for condensing BERTopic outputs into concise labels; selection of supporting context (e.g., summary sampling on largest subtopic) significantly affects representativeness (Khandelwal, 3 Feb 2025).

6. Comparative Analyses and Practical Considerations

Empirical studies consistently show BERTopic achieving superior or competitive coherence, diversity, and interpretability relative to:

  • Probabilistic models: LDA, PLSA, NMF, ARTM—all generally exhibit lower cvc_v, poorer semantic separation, and inability to naturally detect outliers (Sangaraju et al., 2022, Nanyonga et al., 30 May 2025, Mutsaddi et al., 7 Jan 2025).
  • Embedding-based and clustering alternatives: Top2Vec, k-Means; BERTopic’s HDBSCAN variant generally produces higher-coherence, though k-Means may be preferable when cluster coverage is mandatory (e.g., short-response settings) (Groot et al., 2022).
  • Human-in-the-loop evaluation: Qualitative researchers favored BERTopic for detailed, logically organized clusters, high topic diversity (0.995 vs. 0.733 for LDA), and the capacity to reveal niche or cross-cutting themes—although the method may yield an excessive number of fine-grained topics absent hierarchical grouping (Kaur et al., 2024).

Best practices include: selection or fine-tuning of the embedding model for domain specificity, minimal or corpus-sensitive preprocessing, careful tuning of UMAP/HDBSCAN hyperparameters, integration of expert validation for cluster naming, and complementary use of coherence/diversity metrics to guide parameter choices and topic postprocessing (Sangaraju et al., 2022, Opu et al., 13 Jun 2025, Shinde et al., 4 Feb 2025, Mutsaddi et al., 7 Jan 2025, Grootendorst, 2022, Kaur et al., 2024).

7. Limitations and Ongoing Directions

Despite demonstrable advantages, BERTopic-based pipelines exhibit characteristic limitations and active areas of methodological development:

  • Dense clustering may yield a high outlier rate: Particularly with HDBSCAN on short, heterogeneous texts; k-Means or parameter tuning may be required for adequate document coverage (Groot et al., 2022).
  • Lack of soft/overlapping cluster assignment: Current implementations perform hard clustering, obscuring documents' alignment with multiple topics; probabilistic/soft clustering remedies are under consideration (Murugaraj et al., 12 Dec 2025).
  • Redundant or semantically overlapping topics: Manual or automatic merging (e.g., via embedding similarity) is recommended to ameliorate topic fragmentation (Murugaraj et al., 12 Dec 2025).
  • Parameter-sensitivity and overfitting: Excessively granular topics may arise when optimizing solely for clustering validation scores; interpretability must remain central during tuning (Schäfer et al., 2024).
  • Resource requirements: Transformer inference remains limiting on large corpora, although embedding caching and minimalist preprocessing mitigate computational cost (Cheng et al., 2022, Murugaraj et al., 12 Dec 2025, Grootendorst, 2022).
  • Automated labeling and hierarchical organization: Recent progress in LLM-based topic summarization, combined with hierarchical layouts and representativeness metrics, improves end-user accessibility and the practical deployment of BERTopic outputs (Khandelwal, 3 Feb 2025).

BERTopic-based models thus represent the state of the art in unsupervised topic modeling for a wide range of text corpora, provided practitioners address outlier handling, domain-specific embedding selection, and the balance between granularity and interpretability. Empirical evidence across multiple studies supports its adoption as a preferred pipeline over classical probabilistic and bag-of-words alternatives in both general and domain-specific contexts.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (17)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to BERTopic-based Topic Modeling.