BERTopic Pipeline Overview
- BERTopic is a neural topic modeling pipeline that leverages transformer embeddings and density-based clustering to generate human-interpretable topics.
- It integrates text preprocessing, sentence embedding, UMAP dimensionality reduction, HDBSCAN clustering, and c-TF-IDF for succinct topic labeling.
- The modular design adapts to various domains and languages, outperforming classic models like LDA on short, noisy, or multilingual texts.
BERTopic is a neural topic modeling pipeline that combines transformer-based embeddings, non-linear dimensionality reduction, and density-based clustering with a class-based TF–IDF representation to produce human-interpretable topics from unstructured text corpora. The pipeline’s core stages—text preprocessing, sentence embedding, dimensionality reduction, clustering, and topic representation—are modular, allowing adaptation to a range of domains, languages, and input lengths. BERTopic is designed to address limitations of classical topic modeling (e.g., LDA) on short, noisy, or multi-lingual data by leveraging contextual embeddings and flexible density-based clustering, and it supports integration into advanced information extraction and summarization workflows (Grootendorst, 2022, Torres et al., 2024, Schäfer et al., 2024, Groot et al., 2022).
1. Data Preparation and Preprocessing
The pipeline begins with corpus-specific text cleaning. Commonly, both minimal and extensive preprocessing regimes are used, depending on corpus characteristics:
- Standard operations: Removal of HTML tags, URLs, non-ASCII characters, punctuation, and numeric tokens; lowercasing; collapse of whitespace; stripping non-informative tokens or custom domain-specific boilerplate.
- Stopword removal: Standard or custom stop word lists (e.g., NLTK’s for English or Italian).
- Lemmatization: Performed optionally using tools such as spaCy, NLTK’s WordNet lemmatizer, or classla for morphologically rich languages; empirical results indicate that transformer embeddings are robust to omission of lemmatization on short, inflection-rich text (Medvecki et al., 2024).
- Filtering: For focused topic modeling, relevance filtering may be applied (e.g., top 200 documents by cosine similarity to a query), and documents below length thresholds or containing only redacted/boilerplate content are typically discarded (Torres et al., 2024, Bhandarkar et al., 8 Oct 2025).
In multilingual or morphologically complex contexts, language identification (FastText), transliteration, or sentence-level splitting may be introduced (Schäfer et al., 2024, Medvecki et al., 2024).
2. Embedding Generation with Transformer Models
Core to BERTopic is the transformation of preprocessed documents into dense semantic vectors:
- Embedding model: SBERT-family models such as “all-MiniLM-L6-v2” (384-dim), “all-mpnet-base-v2” (768-dim), or domain-specific transformers (e.g., distil-ita-legal-bert for Italian legal text) (Grootendorst, 2022, Marulli et al., 13 May 2025).
- Sentence-level or document-level granularity: Text units may be full abstracts, concatenations (e.g., title+abstract), paragraphs, or sentences, depending on the analysis.
- Batch encoding: Models are generally used as-is; fine-tuning is rare in BERTopic pipelines unless extensive domain adaptation is required.
- Normalization: L2 normalization of embeddings is standard, particularly when UMAP and HDBSCAN use cosine or Euclidean distances in reduced space.
Recommendations point to richer, larger embedding models giving marginally improved coherence/diversity, but “all-MiniLM-L6-v2” provides a favorable tradeoff between speed and accuracy (Medvecki et al., 2024, Compton, 26 Aug 2025).
3. Dimensionality Reduction via UMAP
The high dimensionality of transformer embeddings (typically 384–768) impairs clusterability; UMAP is used to embed these into a low-dimensional latent space (typ. 2–5 dimensions):
- Algorithm: UMAP (Uniform Manifold Approximation and Projection) minimizes a cross-entropy between fuzzy simplicial sets constructed in high and low dimensions.
- Principal hyperparameters:
- n_neighbors (controls local/global structure tradeoff; common settings = 15–20)
- n_components (dimensionality of projection; 2 for visualization, 5+ for clustering)
- min_dist (controls minimum inter-point distance)
- metric (‘cosine’ typically, sometimes ‘euclidean’ if required by downstream clustering)
- Empirical rationale: UMAP preserves local semantic neighborhoods and global manifold structure, resulting in more coherent, dense clusters compared to PCA or t-SNE, especially for short/noisy texts or multilingual corpora (Schäfer et al., 2024, Mendonca et al., 27 Oct 2025, Torres et al., 2024, Cheng et al., 2022).
4. Clustering: Density-Based Methods and Alternatives
Clustering is performed on UMAP-reduced embeddings using density-based algorithms:
- Default: HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise)
- Key parameters: min_cluster_size (minimum points per cluster; typical values 10–50), min_samples (conservativeness for core point definition).
- Distance metric: ‘euclidean’ on UMAP coordinates.
- Outliers: Documents not assigned to any dense cluster are labeled −1 and excluded from further topic representation.
- Alternatives: k-means can substitute HDBSCAN (e.g., for full coverage and when every input must be assigned to a topic), but sacrifices density-adaptive granularity and outlier handling (Groot et al., 2022).
- Hyperparameter tuning: Grid search or empirical adjustment based on metrics such as DBCV, silhouette score, and topic coherence. Coverage and topic granularity are directly affected by min_cluster_size and UMAP parameters (Schäfer et al., 2024, Compton, 26 Aug 2025).
5. Topic Representation with Class-Based TF–IDF
Topic representation distinguishes BERTopic from standard clustering-based approaches:
- Class-based TF–IDF (c-TF-IDF):
- For each cluster c, construct a pseudo-document by concatenating all texts in the cluster.
- Compute normalized term frequency , inverse topic frequency across all clusters, and combine as:
Definitions: - : frequency of term t in cluster c, normalized by total tokens in c - : number of clusters in which t appears - : total number of clusters - Optionally, advanced variants such as BM25 scoring or frequency reduction (e.g., sqrt frequency scaling) are implemented (Mendonca et al., 27 Oct 2025).
Keyword extraction and topic labeling:
- Words are ranked by c-TF-IDF score per cluster; the top-n (typically 5–15) form the interpretable topic descriptor.
- For increased interpretability, these keywords may be post-processed by LLMs (e.g., GPT-4, Claude 3.7) to generate concise, human-readable titles or summaries (Torres et al., 2024, Marulli et al., 13 May 2025).
6. Evaluation, Tuning, and Integration
BERTopic pipelines are characterized by extensive metrics-driven parameter selection, evaluation, and downstream integration:
- Topic quality metrics:
- Coherence: Most commonly measured via NPMI, C_v (Gensim), or UMass; empirical topic sets in PROMPTHEUS yielded coherence scores ranging from 0.41–0.48 for scientific SLRs, and higher for legal and German news corpora when tuning is adequate (Torres et al., 2024, Marulli et al., 13 May 2025, Schäfer et al., 2024).
- Diversity: Fraction of unique terms among top-k keywords of all topics; optimal values depend on domain (e.g., TD = 0.88–0.90 for short multi-domain text; TD = 0.68 for segment-optimized legal corpora).
- Coverage: Proportion of documents assigned to non-outlier topics; critical when outlier rate is non-trivial (e.g., 74% outliers in some HDBSCAN runs on short text).
- DBCV/Silhouette: Cluster quality indices to guide hyperparameter sweeps (Schäfer et al., 2024).
- Comparison to classical models: BERTopic consistently outperforms LDA/NMF in both coherence and interpretability, especially on short, noisy, or multi-lingual text (Groot et al., 2022, Marulli et al., 13 May 2025, Medvecki et al., 2024).
- Downstream integration: Topic clusters can guide extractive summarization, literature synthesis, legal document indexing, API documentation summarization, or empirical workflow evaluation (Naghshzan et al., 2023, Torres et al., 2024, Cheng et al., 2022, Marulli et al., 13 May 2025).
- Pipeline examples: PROMPTHEUS employs SBERT→UMAP→HDBSCAN→c-TF-IDF→LLM for SLRs, with dynamic parameter adaptation to dataset size and tolerance for outliers (Torres et al., 2024). MSHTM uses a hierarchical approach with NMF for coarse topic discovery, followed by BERTopic for fine-grained subtopic extraction (Cheng et al., 2022).
7. Advanced Practices, Pitfalls, and Domain Adaptation
Practical experience across domains has established several best practices and limitations:
- Minimal preprocessing is favored unless explicit needs exist (e.g., non-standard script, OCR-generated noise). Modern transformer embeddings are highly robust to stopword presence, lemmatization omission, or minor domain-specific artifacts (Medvecki et al., 2024, Compton, 26 Aug 2025).
- Parameter tuning requires balancing topic granularity, semantic coherence, and coverage. Aggressive outlier removal may cause undesirable document loss in survey or evaluation settings (Groot et al., 2022).
- Multilingual and domain-adapted embeddings: BERTopic supports deployment with language- or domain-specific SBERT variants for improved cross-language or in-domain topic recovery (Schäfer et al., 2024, Marulli et al., 13 May 2025).
- Legal, scientific, and conversational domains: Case studies in Italian legal judgments, systematic literature reviews, conversational preference data, API documentation, and political discourse have validated BERTopic’s flexibility (Mendonca et al., 27 Oct 2025, Marulli et al., 13 May 2025, Bhandarkar et al., 8 Oct 2025, Naghshzan et al., 2023).
- Integration with LLMs for labeling/summarization: Topic descriptions generated by LLMs (using c-TF-IDF keywords as prompts) achieve high BERTScore F1 compared to human-generated labels, supporting semi-automated interpretation (Marulli et al., 13 May 2025, Torres et al., 2024).
- Hierarchical and hybrid approaches: BERTopic may be nested within larger multi-scale models, including initial broad-topic discovery (e.g., NMF) followed by BERTopic-driven subtopic analysis (Cheng et al., 2022).
Potential limitations include sensitivity to corpus size and clusterability, the need for careful hyperparameter tuning in low-resource or highly imbalanced contexts, and the exclusion of outliers which may not be suitable when exhaustive coverage is required.
References:
(Grootendorst, 2022, Groot et al., 2022, Medvecki et al., 2024, Schäfer et al., 2024, Torres et al., 2024, Cheng et al., 2022, Mendonca et al., 27 Oct 2025, Marulli et al., 13 May 2025, Bhandarkar et al., 8 Oct 2025, Compton, 26 Aug 2025, Naghshzan et al., 2023)