BERTopic: Transformer-Enhanced Topic Discovery
- BERTopic is a neural topic modeling framework that leverages transformer embeddings, UMAP reduction, and HDBSCAN clustering to generate semantically coherent, high-resolution topics.
- It modularly integrates various embedding models and hyperparameter settings, enabling its application across diverse domains such as finance, social media, and qualitative research.
- Empirical evaluations show BERTopic outperforms traditional models like LDA in handling short, multilingual, and domain-specific texts while offering flexible, interpretable topic representations.
BERTopic is a neural topic modeling framework that operationalizes transformer-based sentence embeddings, dimensionality reduction, density-based clustering, and class-based term weighting to extract semantically coherent, high-resolution topics from large, heterogeneous corpora. It is designed to surpass traditional topic models—such as Latent Dirichlet Allocation (LDA), Probabilistic Latent Semantic Analysis (PLSA), and Non-negative Matrix Factorization (NMF)—especially in the context of short, multilingual, morphologically rich, or domain-specific textual data. BERTopic’s pipeline is modular, supporting a diverse selection of embedding models, parameterizations, and cluster validation protocols, which has driven its adoption across social media analysis, financial document auditing, qualitative research, and policy analytics.
1. Core Methodological Pipeline
At the foundation of BERTopic is an integrated four-stage pipeline:
- Sentence Embedding: Documents or text segments are mapped to dense, high-dimensional vectors via transformer models (e.g., SBERT, XLM-RoBERTa, all-MiniLM-L6-v2, FinTextSim). Each document vector encodes semantic content beyond simple n-gram occurrence, enabling the model to recognize thematic affinity even across lexical or orthographic variation (Grootendorst, 2022, Jehnen et al., 22 Apr 2025, Kandala et al., 20 Apr 2025, Mutsaddi et al., 7 Jan 2025, Medvecki et al., 2024).
- Dimensionality Reduction: Uniform Manifold Approximation and Projection (UMAP) transforms the embedding matrix from to , preserving local neighborhood structure for clustering. UMAP’s loss optimizes the cross-entropy between high- and low-dimensional fuzzy simplicial sets:
with and encoding membership strengths in high and low-dimensional space, respectively (Grootendorst, 2022, Koterwa et al., 10 May 2025).
- Clustering: HDBSCAN segments the reduced embeddings into dense, variably sized clusters via mutual reachability distances and a hierarchical tree cut at points of maximum cluster stability. Key hyperparameters include
min_cluster_size(smallest allowed topic),min_samples(density sensitivity), and cluster selection method (eomfor most stable clusters). A fraction of points is typically labeled as “noise” or outlier, a design that increases semantic consistency at the cost of coverage (Groot et al., 2022, Kandala et al., 20 Apr 2025). - Topic Representation: For each cluster, BERTopic concatenates member documents and applies class-based TF-IDF (c-TF-IDF), scoring term in topic as follows:
where is the within-topic frequency, the number of topics, and the count of topics containing the word. BM25 weighting or domain stopword lists may further refine salient features (Grootendorst, 2022, Kandala et al., 20 Apr 2025).
2. Hyperparameterization and Model Selection
The representational capacity and thematic sharpness of BERTopic are contingent on several adjustable parameters:
- Embedding Model: Domain, language, or corpus-specific encoders (e.g., jina-embeddings-v3 for Dutch, FinTextSim for finance, mBERT-uncased for Hindi, paraphrase-multilingual-mpnet for Serbian) greatly improve both intra-topic similarity and topic-precision (Jehnen et al., 22 Apr 2025, Mutsaddi et al., 7 Jan 2025, Medvecki et al., 2024, Nanyonga et al., 30 May 2025).
- UMAP: Dimensionality (2–200), (5–200), and (0.0–0.1) affect cluster granularity and local/global structure (Schäfer et al., 2024).
- HDBSCAN: Lower
min_cluster_sizeyields more, finer topics at the expense of potential fragmentation or noise; higher values aggregate into broader clusters (Groot et al., 2022, Medvecki et al., 2024). - Pooling and Layer Choices: Aggregation strategies over transformer layers—mean/max/CLS pooling and combinations of layers—substantially affect coherence and diversity; the sum or concatenation of higher layers using mean or max pooling often outperforms the BERTopic default (Koterwa et al., 10 May 2025).
Empirical studies underscore the importance of grid-searching UMAP/HDBSCAN hyperparameters and fine-tuning the embedding choice to the linguistic or topical profile of the corpus (Kandala et al., 20 Apr 2025, Medvecki et al., 2024).
3. Evaluation Metrics and Quality Assessment
BERTopic topic models employ both automated and human-centric metrics:
- Topic Coherence: Most studies report NPMI, , , and , quantifying the syntagmatic and paradigmatic association among top keywords. Formulaic details, e.g.,
where is estimated over the corpus, are standard (Groot et al., 2022, Kandala et al., 20 Apr 2025, Mutsaddi et al., 7 Jan 2025, Jehnen et al., 22 Apr 2025).
- Topic Diversity: Unique words across all topic top-N lists divided by , penalizing redundancy (Groot et al., 2022, Grootendorst, 2022).
- Human Evaluation: Ratings of semantic coherence and domain resonance (e.g., 1–5 scale), and triangulation against manual or NVivo-derived themes, especially for qualitative and policy studies (Golpayegani et al., 16 Sep 2025, Kandala et al., 20 Apr 2025, Kaur et al., 2024).
- Cluster Validation: Outlier rate (fraction of texts assigned “noise” label), coverage, and silhouette/DBCV scores measure granularity and the quality of clustering solutions (Groot et al., 2022, Schäfer et al., 2024).
- Organizing Power: Intra- and Intertopic similarity scores, especially in domain-constrained settings (e.g., finance), measure how well topic clusters correspond to true semantic boundaries (Jehnen et al., 22 Apr 2025).
A systematic evaluation typically balances coherence, diversity, coverage, and interpretability, supplemented by manual review and, where possible, gold-standard comparison (Golpayegani et al., 16 Sep 2025, Kandala et al., 20 Apr 2025).
4. Comparative Performance Across Domains and Languages
Numerous studies report that BERTopic outperforms LDA, PLSA, NMF, and even some neural baselines (CTM, Top2Vec, ETM) on short, heterogeneous, and/or morphologically rich text:
- Short Text and Low-Resource Languages: On Hindi, Serbian, and Dutch narratives, BERTopic yields higher coherence (C_V up to 0.76 on Hindi) and more granular, culturally salient themes, even under minimal preprocessing—lemmatization and stopword removal are often unnecessary when leveraging strong contextual embeddings (Mutsaddi et al., 7 Jan 2025, Medvecki et al., 2024, Kandala et al., 20 Apr 2025).
- Multilingual or Dialectal Data: Domain-tuned multilingual embeddings and customized stopword lists markedly enhance both recognizability and the clustering of region-specific phenomena (Kandala et al., 20 Apr 2025, Medvecki et al., 2024, Schäfer et al., 2024).
- Financial and Policy Discourse: Domain-adaptive transformers (e.g., FinTextSim) increase topic-precision (1.0 vs. 0.31 for off-the-shelf encoders), eliminate overlapping topics (100% reduction in intertopic similarity), and recover the full set of semantic classes (all 14 financial domains) (Jehnen et al., 22 Apr 2025).
- Qualitative and Social Analysis: High topic diversity (≥0.99), low KL divergence, and logical organization enable exploratory research and support interpretative work beyond what is possible with traditional BoW models (Kaur et al., 2024, Compton, 26 Aug 2025).
A plausible implication is that BERTopic’s modularity, especially regarding embedding choice and clustering hyperparameters, enables direct adaptation to new text genres, domains, and languages otherwise ill-served by count-based or generative probabilistic models.
5. Practical Extensions, Limitations, and Customizations
Researchers have extended or modified the BERTopic workflow in several important ways:
- Sentence-Level Splitting: For long documents (e.g., AI policy PDFs), splitting into sentences as “documents” enhances topic granularity and supports fine-grained analysis (Golpayegani et al., 16 Sep 2025, Kandala et al., 20 Apr 2025).
- Intermediate Layer Representations: Using max pooling and aggregation across multiple transformer layers improves both topic coherence and diversity, routinely overtaking the default “mean pooling final layer” setting (Koterwa et al., 10 May 2025).
- Domain Stopword Lists and Customized Preprocessing: Domain-specific stopwords (Loughran–McDonald for finance), lemmatization for morphological normalization, or length-based filtering for social media narratives optimize the signal-to-noise ratio while minimizing the risk of discarding meaningful variance (Medvecki et al., 2024, Kandala et al., 20 Apr 2025).
- Clustering Alternatives: Replacing HDBSCAN with k-Means resolves the “outlier problem” at the expense of slightly reduced topic coherence and diversity. k-Means ensures full clustering coverage, important when each text is required to be assigned an interpretable topic (Groot et al., 2022).
- Hybrid Evaluation Frameworks: Integrated pipelines frequently combine automatic metrics, domain-expert review, consensus coding, and recall of gold-standard themes for comprehensive model validation (Golpayegani et al., 16 Sep 2025, Kaur et al., 2024, Kandala et al., 20 Apr 2025, Compton, 26 Aug 2025).
Nevertheless, coverage-coherence trade-offs persist, HDBSCAN can discard large fractions of data as noise, and hyperparameter sensitivity can introduce substantial run-to-run variation. For multi-topic documents, per-sentence or per-paragraph embedding is recommended, since BERTopic natively assumes single-label assignment (Grootendorst, 2022).
6. Empirical Application and Thematic Insights
Multiple studies illustrate BERTopic’s empirical utility:
- AI Governance: Revealed temporal shifts in EU policy discourse from “ethical AI” to “regulatory enforcement” and “operationalized legal AI,” with persistent risk management and waning attention on environmental impacts (Golpayegani et al., 16 Sep 2025).
- Financial Reporting: Enabled recovery of all key economic classes from 10-K filings with clear cluster separation, driving applications in risk management, valuation, and financial analytics (Jehnen et al., 22 Apr 2025).
- Open-Ended Narratives and Social Media: Uncovered regionally specific, semantically coherent topics in daily narratives and Reddit communities, surfacing both expected and unexpected semantic associations (Kandala et al., 20 Apr 2025, Kaur et al., 2024).
- Discourse Analysis and Historical Frames: Facilitated systematic triangulation of semantic frames with lexical bigram searches, supporting reproducible, multi-layered interpretations in political QDA (Compton, 26 Aug 2025).
7. Best Practices and Recommendations
Key recommendations derived from published studies include:
- Prioritize embedding models adapted to the domain or language and systematically tune UMAP/HDBSCAN settings for the target corpus (Jehnen et al., 22 Apr 2025, Medvecki et al., 2024).
- Preprocess with domain-appropriate stopwords and lemmatization as required, but avoid over-processing that risks losing contextual cues encoded in transformer spaces (Kandala et al., 20 Apr 2025).
- Use outlier reassignment procedures to avoid losing rare but important documents, and apply grid-search or iterative tuning for cluster size and tightness (Medvecki et al., 2024, Groot et al., 2022).
- Regularly validate output via mixed metrics (coherence, diversity, cluster stability) and human review, merging or discarding incoherent topics where appropriate (Golpayegani et al., 16 Sep 2025, Compton, 26 Aug 2025).
- For highest performance, especially on large or morphologically complex datasets, run multiple embedding/pooling strategies and select the configuration that optimally balances coherence and diversity (Koterwa et al., 10 May 2025).
In summary, BERTopic’s transformer-driven approach, augmented by flexible dimensionality reduction and clustering, and its unsupervised yet interpretable c-TF-IDF topic representation, have positioned it as a leading tool for topic discovery in research environments demanding high semantic fidelity, coverage, and model transparency (Grootendorst, 2022).