Papers
Topics
Authors
Recent
2000 character limit reached

BERTopic: Neural Approach to Topic Modeling

Updated 30 November 2025
  • BERTopic is a modular neural topic modeling framework that leverages transformer embeddings, UMAP for dimensionality reduction, HDBSCAN clustering, and c-TF-IDF to generate interpretable topics.
  • It outperforms classical methods like LDA, NMF, and PLSA by delivering higher topic coherence, diversity, and expert-rated interpretability across various domains.
  • Applications include aviation safety analysis, financial document clustering, social media discussions, and multilingual texts, showcasing its flexibility, scalability, and robust performance.

BERTopic is a modular, neural topic modeling framework that combines transformer-based document embeddings, non-linear dimensionality reduction, density-based clustering, and a novel class-based TF–IDF weighting scheme. Designed to extract interpretable, high-coherence topics from both long and short textual corpora across diverse domains, BERTopic has demonstrated superior topic quality, flexibility, and scalability when compared to classical approaches such as Probabilistic Latent Semantic Analysis (PLSA), Latent Dirichlet Allocation (LDA), and Non-negative Matrix Factorization (NMF) (Nanyonga et al., 30 May 2025, Groot et al., 2022, Kaur et al., 19 Dec 2024, Mutsaddi et al., 7 Jan 2025). The following sections synthesize the core methodologies, mathematical foundations, empirical benchmarks, optimization strategies, best-practice guidelines, and use cases established in the literature.

1. Pipeline Architecture and Mathematical Foundations

BERTopic operates via a four-stage pipeline:

  1. Contextual Embedding Generation: Each document did_i is tokenized and mapped to a dense vector eiRd\mathbf{e}_i \in \mathbb{R}^d using a pre-trained transformer model (e.g., bert-base-nli-mean-tokens, all-MiniLM-L6-v2, ModernBERT, various multilingual SBERT variants) with mean-pooling over token-level hidden states (Nanyonga et al., 30 May 2025, Jehnen et al., 22 Apr 2025, Grootendorst, 2022).
  2. Dimensionality Reduction: High-dimensional embeddings are projected into a lower-dimensional space (typically 2–10 dimensions) using UMAP. UMAP minimizes the fuzzy set cross-entropy between high- and low-dimensional neighborhood graphs, with key hyperparameters: n_neighbors=15, n_components=5–10, min_dist=0.0–0.1, and cosine metric (Nanyonga et al., 30 May 2025, Grootendorst, 2022, Mutsaddi et al., 7 Jan 2025, Kandala et al., 20 Apr 2025).
  3. Clustering: HDBSCAN clusters the reduced vectors, identifying semantically dense regions while labeling low-density points as "outliers." Essential hyperparameters include min_cluster_size (controlling minimum topic size; typical range 10–30) and cluster_selection_method="eom" (excess of mass) (Nanyonga et al., 30 May 2025, Groot et al., 2022, Kaur et al., 19 Dec 2024).
  4. Topic Representation via Class-based TF–IDF (c-TF-IDF):

c-TFIDF(t,c)=ft,ctft,clog(Nnt)\mathrm{c\text{-}TFIDF}(t, c) = \frac{f_{t,c}}{\sum_{t'} f_{t',c}} \cdot \log\left(\frac{N}{n_t}\right)

where ft,cf_{t,c} is the term frequency of word tt in cluster cc, tft,c\sum_{t'} f_{t',c} is the total term occurrences in cluster cc, NN is the number of clusters, ntn_t is the number of clusters containing tt (Nanyonga et al., 30 May 2025, Grootendorst, 2022, Kaur et al., 19 Dec 2024).

This modular approach enables leveraging the semantic richness of transformer embeddings, the manifold structure captured by UMAP, robust density-based clustering via HDBSCAN, and discriminative topic descriptions from c-TF-IDF.

2. Quantitative Benchmarks and Comparative Analyses

Extensive benchmarking against PLSA, LDA, NMF, Top2Vec, and domain-specific topic models reveals that BERTopic consistently achieves higher or comparable topic coherence, diversity, and interpretability across several domains and languages (Nanyonga et al., 30 May 2025, Krishnan, 2023, Mutsaddi et al., 7 Jan 2025, Groot et al., 2022, Kaur et al., 19 Dec 2024). Key findings include:

Model Task/Corpus Coherence CvC_v Diversity Interpretability (Expert, 1–5)
BERTopic Aviation, NTSB 0.41 4.3
PLSA Aviation, NTSB 0.37 3.7
LDA Short Hindi Texts 0.38
BERTopic Short Hindi Texts 0.76
BERTopic Reddit, qualitative 0.647 0.995 Preferred by 8/12 researchers
LDA Reddit, qualitative 0.500 0.733
NMF Reddit, qualitative 0.684 0.866

BERTopic outperforms PLSA by 0.04 in CvC_v coherence and demonstrates superior expert-rated interpretability on aviation safety reports (Nanyonga et al., 30 May 2025). On short Hindi texts, BERTopic's highest CvC_v is nearly double that of LDA (0.76 vs. 0.38) (Mutsaddi et al., 7 Jan 2025). For online discussions, BERTopic yields maximal topic diversity and highly granular clustering (Kaur et al., 19 Dec 2024). These benefits are consistently attributed to the contextual embeddings and c-TF-IDF representation, as density-based clustering aligns clusters to semantic submanifolds that purely probabilistic models fail to capture.

3. Hyperparameter Sensitivity and Optimization Strategies

Topic granularity, coherence, and coverage are highly sensitive to BERTopic’s pipeline parameters. Empirical studies advocate:

  • Embedding Model: Task-domain adaptation markedly improves results; e.g., financial-specific transformers (FinTextSim, FinBERT) yield intratopic similarity gains of 81% and intertopic dissimilarity gains of 100% over MiniLM for financial corpora (Jehnen et al., 22 Apr 2025, Sangaraju et al., 2022).
  • Intermediate-layer Strategies: Aggregating representations from intermediate or multiple transformer layers (sum last 4, max pooling) can yield up to 70% higher coherence than the default mean-pooled last layer (Koterwa et al., 10 May 2025).
  • UMAP Settings: n_neighbors=10–30 tunes local vs. global structure, n_components=5–10 often suffices for clustering (Grootendorst, 2022, Vanin et al., 23 Dec 2024, Shinde et al., 4 Feb 2025).
  • Clustering Algorithm: HDBSCAN provides adaptive topic discovery but may assign up to 74% of short, heterogeneous responses as outliers. Replacing HDBSCAN with k-means trades coherence for 100% coverage (Groot et al., 2022, Kaur et al., 19 Dec 2024).
  • Iterative Refinement: Iterative workflows—removing outliers and reclustering while monitoring adjusted Rand index, Van Dongen, or normalized variation of information—reduce noise and yield more complete topic partitions (Wong et al., 25 Jul 2024).

In domain-tailored applications (e.g., multilingual narratives, morphologically complex languages), monolingual transformers and pre-tuned preprocessing pipelines deliver state-of-the-art coherence and robustness (Shinde et al., 4 Feb 2025, Mutsaddi et al., 7 Jan 2025, Kandala et al., 20 Apr 2025).

4. Applications and Empirical Use Cases

BERTopic's capabilities are demonstrated across:

The effectiveness of BERTopic across domains stems from its contextually aware embeddings, robust to both text length and linguistic complexity, and its ability to yield interpretable, semantically tight topic clusters even in noisy, heterogeneous settings.

5. Limitations and Current Challenges

Identified limitations include:

  • Computational Burden: Transformer-based embedding and UMAP reduction are computationally intensive for large-scale corpora; efficient encoders or model distillation are active areas for future research (Nanyonga et al., 30 May 2025, Jehnen et al., 22 Apr 2025).
  • Coverage vs. Coherence: Density-based clustering may label large proportions of data as noise, which can be unacceptable in domains requiring exhaustive coverage. Alternative clustering methods (k-means, spectral clustering) or aggressive outlier assignment methods may provide more balanced solutions (Groot et al., 2022, Kandala et al., 20 Apr 2025).
  • Hyperparameter Instability: Topic coherence and stability are highly non-linear with respect to UMAP and HDBSCAN settings. Systematic grid search, bootstrap resampling, and hierarchical consolidation are recommended (Arfaoui et al., 24 Nov 2025, Wong et al., 25 Jul 2024).
  • Topic Interpretation and Redundancy: Excessive topic granularity may overwhelm users in qualitative analysis; hierarchical merging or interactive exploration interfaces are advised (Kaur et al., 19 Dec 2024, Arfaoui et al., 24 Nov 2025).
  • Resource-limited and Multilingual Contexts: Performance in unseen or extremely low-resource languages depends critically on embedding availability and quality; ongoing research addresses monolingual fine-tuning and morphological adaptation (Shinde et al., 4 Feb 2025, Medvecki et al., 5 Feb 2024).

6. Best Practices and Recommendations

The published literature establishes the following guidelines:

  1. Embedding Selection: Use domain-specific or monolingual SBERT models when available; aggregate intermediate layers for difficult or heterogeneous datasets (Koterwa et al., 10 May 2025, Jehnen et al., 22 Apr 2025, Mutsaddi et al., 7 Jan 2025).
  2. Minimal Preprocessing: Avoid aggressive stopword removal or lemmatization prior to embedding, especially for transformer-based encoders; perform light cleaning and stopword filtering only post-clustering for keyword extraction (Grootendorst, 2022, Nanyonga et al., 30 May 2025).
  3. Hyperparameter Tuning: Systematically search or cross-validate UMAP (n_neighbors, min_dist) and HDBSCAN (min_cluster_size) settings; align configuration to corpus size, document length, and topical granularity (Arfaoui et al., 24 Nov 2025, Mutsaddi et al., 7 Jan 2025, Vanin et al., 23 Dec 2024).
  4. Hybrid and Iterative Pipelines: For large, hierarchical, or highly diverse corpora, combine matrix factorization (e.g., NMF) with BERTopic for multi-scale topic discovery or employ iterative refinement with stability monitoring (Cheng et al., 2022, Wong et al., 25 Jul 2024).
  5. Evaluation: Use both quantitative coherence (C_v, NPMI, diversity) and human-in-the-loop interpretability, including expert rating and visual diagnostics, for topic validation (Nanyonga et al., 30 May 2025, Kaur et al., 19 Dec 2024, Arfaoui et al., 24 Nov 2025).
  6. Dynamic/Time-evolving Data: When analyzing longitudinal corpora (e.g., social/political discourse), segment by time slices or metadata, recompute c-TF-IDF per bin, and track topic trajectories using cosine similarity over topic representations (Mendonca et al., 27 Oct 2025).

In conclusion, BERTopic provides a flexible, high-precision framework for neural topic modeling by integrating state-of-the-art embedding models, nonlinear manifold learning, adaptive clustering, and cluster-level term weighting. Its empirical advantages—notably in topic coherence, diversity, and expert-rated interpretability—make it a reference method for research involving complex, high-dimensional, and multilingual textual data (Nanyonga et al., 30 May 2025, Mutsaddi et al., 7 Jan 2025, Kaur et al., 19 Dec 2024, Jehnen et al., 22 Apr 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (18)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to BERTopic Topic Modeling.