Papers
Topics
Authors
Recent
Search
2000 character limit reached

Top2Vec: Embedding-Based Topic Modeling

Updated 12 February 2026
  • Top2Vec is an unsupervised topic modeling algorithm that jointly embeds words and documents, enabling discovery of semantically coherent topics.
  • It leverages UMAP for dimensionality reduction and HDBSCAN for clustering, which removes the need for predefining the number of topics.
  • Top2Vec has been applied in legal, financial, and social media domains, offering scalable topic extraction with minimal preprocessing.

Top2Vec is an unsupervised topic modeling algorithm that discovers latent semantic topics by constructing a joint embedding space for words and documents and identifying dense regions within this space as candidate topics. Unlike generative probabilistic approaches such as Latent Dirichlet Allocation (LDA), Top2Vec is fundamentally an embedding- and clustering-based framework. It bypasses the need for bag-of-words representations, does not require specification of the number of topics a priori, and enables extraction of semantically meaningful topic vectors and keywords using geometric proximity measures in a continuous vector space. Top2Vec has been employed across diverse domains, including legal document analysis, Arabic social media studies, financial document modeling, and for topological data analysis of research corpora (Angelov, 2020, Bastola et al., 31 Aug 2025, Mohdeb et al., 18 Apr 2025, Liu, 7 Dec 2025, Yadav et al., 16 Oct 2025, Krishnan, 2023, Kardos, 29 Jan 2026).

1. Algorithmic Foundations

Top2Vec operates by leveraging dense vector representations of words and documents, typically using the doc2vec Distributed Bag-of-Words (DBOW) or skip-gram neural embedding architectures. The workflow is as follows:

  • Embedding learning: Top2Vec trains, or adopts, a neural embedding model (DBOW, skip-gram, or a pre-trained Universal Sentence Encoder or Transformer) to jointly learn vectors for all words and documents. For a vocabulary VV and embedding dimension dd, every word ww and document DD obtains a vector vw,dDRdv_w, d_D \in \mathbb{R}^d.
  • Document vector construction: If word vectors are learned, each document vector is computed as the normalized mean of its constituent word embeddings: dD=1DwDvw\vec{d}_D = \frac{1}{|D|} \sum_{w \in D} \vec{v}_w (Bastola et al., 31 Aug 2025, Angelov, 2020).
  • Semantic space: Both words and documents are embedded in the same space, allowing proximity to reflect topical similarity.
  • Dimensionality reduction: High-dimensional document vectors are projected to a lower-dimensional space (typically 5–50d) using UMAP, which preserves neighborhood structure using a fuzzy simplicial set and a cross-entropy cost (Angelov, 2020, Yadav et al., 16 Oct 2025).
  • Clustering: HDBSCAN, a hierarchical density-based clustering method, is applied to the reduced vectors to locate clusters that correspond to topics, automatically estimating the number of significant topics and identifying outlier/noise documents (Angelov, 2020, Bastola et al., 31 Aug 2025).
  • Topic vector definition: For each dense cluster CkC_k discovered by HDBSCAN, the topic vector τk\tau_k is defined as the centroid in the original embedding space: τk=1CkDCkdD\tau_k = \frac{1}{|C_k|} \sum_{D\in C_k} d_D.
  • Keyword extraction: For each topic, words are ranked by cosine similarity to the topic vector: score(w,k)=cos(vw,τk)\operatorname{score}(w, k) = \cos(v_w, \tau_k), and the top-N words become the topic descriptors.

This pipeline is scalable and does not require aggressive text preprocessing such as lemmatization or domain-dependent stopword removal, as the embedding learning's subsampling and tokenization algorithms tend to suppress high-frequency noise (Angelov, 2020, Bastola et al., 31 Aug 2025, Mohdeb et al., 18 Apr 2025).

2. Mathematical Formulation and Hyperparameters

The core mathematical objectives of Top2Vec derive from established neural embedding practices:

  • Skip-gram training: For each center-context word pair (wt,wt+j)(w_t, w_{t+j}), maximize

t=1Tcjc,j0logP(wt+jwt)\sum_{t=1}^T \sum_{-c \leq j \leq c,\,j \ne 0} \log P(w_{t+j} \mid w_t)

using negative sampling or hierarchical softmax (Bastola et al., 31 Aug 2025, Angelov, 2020, Yadav et al., 16 Oct 2025).

  • Document embedding (DBOW): For each (document, word) pair,

P(wd)=euwdj=1DeuwdjP(w \mid d) = \frac{e^{u_w^\top d}}{\sum_{j=1}^{|D|} e^{u_w^\top d_j}}

and the total loss is the sum over all such pairs.

  • Clustering: HDBSCAN operates on UMAP-projected vectors, using local density estimation and mutual reachability distance to recover persistent clusters (Yadav et al., 16 Oct 2025, Bastola et al., 31 Aug 2025).
  • Topic-word assignment: Rank words by cosine similarity to centroids in the original space.

Key tunable hyperparameters include:

Parameter Typical Value Role
Embedding dimension dd 300–512 Controls representational capacity, training cost
Skip-gram window size cc 5–15 Semantic context size, affects word co-occurrence scope
Negative samples (skip-gram) 5–15 Noise-contrast in learning word proximity
UMAP n_neighborsn\_neighbors 15–50 Local vs global structure in manifold projection
UMAP output dimension 5 Lower-dimensional space for clustering
HDBSCAN min_cluster_size 5–20 Smallest topic considered non-outlier

Top2Vec is robust to defaults on short or moderately long documents, but can suffer on verbose texts or highly specialized domains unless embeddings are adapted (Bastola et al., 31 Aug 2025, Liu, 7 Dec 2025, Krishnan, 2023).

3. Empirical Performance and Application Domains

Top2Vec has been benchmarked across multiple domains:

  • Legal documents: Combined with Node2Vec for hybrid text-structural clustering, Top2Vec's semantic-only embeddings yield high internal clustering metrics (Silhouette ≈ 0.685, Calinski–Harabasz ≈ 15,340, Davies–Bouldin ≈ 0.45), outperforming LDA and NMF on legal text corpora (Bastola et al., 31 Aug 2025).
  • Arabic social media: Robust cross-dialectal topic discovery in Arabic, outperforming classical and neural baselines in human-assessed interpretability; six core narrative categories emerged from qualitative analysis (Mohdeb et al., 18 Apr 2025).
  • Customer reviews: On short, focused feedback data, Top2Vec achieves mid-range to high cvc_v coherence (0.56 for K=5K=5 topics), outperforming LDA, NMF, and PAM in topic coherence but trailing BERTopic and LSA on longer reviews (Krishnan, 2023).
  • Financial documents: Applied to hedge fund paragraphs (512d USE embeddings), Top2Vec delivers F1 = 0.8761 at K=20K=20 topics for classification tasks, outperforming LDA in informativeness, though with lower topic coherence (C_V = 0.3315 vs. LDA's 0.5442) (Liu, 7 Dec 2025).
  • Topological data analysis: Used as the embedding backbone for persistent homology studies, enabling discovery and interpretation of “holes” or missing context in scientific publication corpora (Yadav et al., 16 Oct 2025).

4. Comparative Evaluation and Limitations

Top2Vec’s unsupervised, embedding-based paradigm yields several distinctive trade-offs versus LDA, NMF, BERTopic, and Topeax:

  • Strengths:
    • No manual KK selection—number of topics arises from density structure.
    • Joint embedding enables semantic keyword extraction and document assignment.
    • Minimal preprocessing—stopwords, stemming, lemmatization frequently unnecessary.
    • Adaptable to multilingual and cross-domain corpora (Mohdeb et al., 18 Apr 2025, Krishnan, 2023).
  • Weaknesses:
    • Brittle to hyperparameters (especially UMAP n_neighborsn\_neighbors, HDBSCAN min cluster size): topic count and quality can swing erratically, especially under subsampling or parameter changes (Kardos, 29 Jan 2026).
    • Ignores explicit word frequency statistics in keyword assignment—rare/junk tokens may contaminate topic labels, leading to less coherent topic keywords (Kardos, 29 Jan 2026).
    • Assumes spherical/centroidal clusters; does not incorporate cluster shape or variance, potentially degrading interpretability for complex topic distributions (Kardos, 29 Jan 2026).
    • Coherence inferior to BERTopic/LSA on long-form and verbose texts; topic assignments are less stable than LDA under topic-number variation (Krishnan, 2023, Liu, 7 Dec 2025).
    • Highly sensitive to corpus size—topic count grows nearly linearly with sample size; mean absolute percentage error in recovering gold cluster counts can reach 1797% (SD 2623) (Kardos, 29 Jan 2026).
  • Remedies and alternatives: The Topeax model addresses these deficiencies by combining kernel density peak detection for robust cluster inference and a lexical–semantic ranking of keywords, systematically outperforming Top2Vec in cluster recovery, topic coherence, and resilience to hyperparameter/corpus size variations (Kardos, 29 Jan 2026).

5. Practical Engineering Considerations and Text Adaptations

Top2Vec's design emphasizes turnkey deployment but several corpus-level adaptations can amplify performance:

  • Preprocessing: Default pipelines often forgo explicit stopword removal and lemmatization. However, for highly noisy datasets, aggressive normalization and Unicode cleaning may increase quality, as observed in Arabic social content modeling (Mohdeb et al., 18 Apr 2025).
  • Multilingual support: Top2Vec operates natively on non-Latin scripts with appropriate tokenizers; dialectal variety assimilated into the same semantic space (Mohdeb et al., 18 Apr 2025).
  • Domain-specific embeddings: While Top2Vec typically trains document and word vectors on the input corpus, performance may be further boosted by integrating domain-adapted embedding models, though this remains underexplored (Bastola et al., 31 Aug 2025).
  • Chunked or paragraph-level application: For lengthy documents (legal contracts, financial reports), applying Top2Vec at the paragraph or chunk level leads to improved granularity of topic discovery and more interpretable results (Liu, 7 Dec 2025).
  • Topic reduction: To enforce a fixed topic count, hierarchical topic merging is supported, iteratively joining smallest/nearest clusters using centroid recomputation (Angelov, 2020, Liu, 7 Dec 2025).

6. Impact, Adoption, and Research Directions

Top2Vec’s unsupervised, clustering-based framework has established a new paradigm in open-vocabulary, adaptive topic discovery:

  • Adoption: Widely used in legal, financial, customer feedback, and social media settings due to its minimal need for prior knowledge or expensive annotation (Bastola et al., 31 Aug 2025, Mohdeb et al., 18 Apr 2025, Liu, 7 Dec 2025, Krishnan, 2023).
  • Analytical innovations: Backbone for advanced meta-analytical pipelines, e.g., using persistent homology to probe embedding space structure and interpret conceptual “negative space” within research corpora (Yadav et al., 16 Oct 2025).
  • Limitations stimulating further research: Top2Vec’s reliance on centroid-based geometric proximity and lack of lexical weighting motivates development of models that fuse semantic and frequency evidence, such as Topeax (Kardos, 29 Jan 2026).
  • Future Horizons: Exploration of domain-specific embeddings, integration with graph-based models, more robust topic-number estimation, and principled human-in-the-loop validation processes are highlighted as strategic directions for enhancing Top2Vec’s reliability and relevance in practical applications (Bastola et al., 31 Aug 2025, Kardos, 29 Jan 2026).

7. Quantitative Summary of Performance and Evaluation

Top2Vec's empirical effectiveness is supported by standard topic modeling metrics:

Dataset/Domain Silhouette Calinski–Harabasz DBI F1 cvc_v Coherence Notes
Legal (ACORD) 0.685 15,340 0.45 Outperforms LDA/NMF (Bastola et al., 31 Aug 2025)
Arabic social media – (qualitative) More coherent by human eval (Mohdeb et al., 18 Apr 2025)
Customer reviews (K=5) 0.56 Mid-range, below BERTopic (Krishnan, 2023)
Hedge fund text 0.8761 0.3315 F1 ≥ LDA, lower coherence (Liu, 7 Dec 2025)
Cluster error (MAPE) 1797% ± 2623 (vs. Topeax 60.5%) (Kardos, 29 Jan 2026)

These results demonstrate that Top2Vec is highly competitive on short, homogeneous documents, but exhibits instability on larger, more diverse corpora and lacks explicit mechanisms for lexical coherence optimization. The Topeax methodology systematically improves upon Top2Vec’s deficiencies in both cluster discovery and topic description (Kardos, 29 Jan 2026).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Top2Vec.