Papers
Topics
Authors
Recent
Search
2000 character limit reached

Topeax: Automated Topic Modeling

Updated 5 February 2026
  • Topeax is an automated topic modeling framework that detects dense regions in low-dimensional sentence embedding spaces using Gaussian kernel density estimation.
  • It employs a Gaussian Mixture Model with soft cluster assignments to infer the natural number of topics without manual parameter tuning.
  • By fusing normalized lexical and semantic scores for keyword selection, Topeax outperforms UMAP+HDBSCAN approaches in stability and interpretability.

Topeax is an automated topic modeling and clustering framework that jointly addresses cluster discovery and keyword selection for natural language corpora, with empirical superiority over UMAP + HDBSCAN models such as Top2Vec and BERTopic. Topeax detects dense regions in low-dimensional sentence embedding spaces via Gaussian kernel density estimation, infers the natural number of topic clusters from these density peaks, and fuses normalized lexical and semantic scores to produce interpretable and robust topic keywords. This framework provides high-quality topic clustering and description that remains stable under changes in sample size and hyperparameter settings, offering a systematized approach to clustering topic modeling for natural language processing applications (Kardos, 29 Jan 2026).

1. Density-Peak-Based Clustering in Embedding Space

Topeax begins by embedding all documents into a 2D vector space via a Sentence-Transformer model, with nonparametric t-SNE for projection (using cosine distance and user-set perplexity, default 50). A Gaussian kernel density estimate (KDE) is computed on a 100×100100\times100 grid covering the embedding space: f^(x)=1nh2d=1nexp(12xxdh2)\hat f(x) = \frac{1}{n\,h^2} \sum_{d=1}^n \exp\left(-\frac{1}{2}\left\|\frac{x - x_d}{h}\right\|^2\right) where hh is determined by Scott’s rule. Local maxima in this density field, each identified within a five-unit-radius neighborhood (connectivity =25= 25), serve as candidate cluster centers ("peaks").

Topeax eschews hard Voronoi partitioning; instead, it initializes a Gaussian Mixture Model (GMM) with fixed means at these peaks and learns component weights and covariances via the EM algorithm. Each document dd with embedding xdx_d receives soft cluster memberships: Tkd=P(zk=1xd)T_{kd} = P(z_k = 1 \mid x_d) and final assignments are made by

zd=argmaxkTkdz_d = \arg\max_k T_{kd}

This approach infers the number of clusters (topics) directly from data, removing the need for manual selection of KK.

2. Lexical and Semantic Term Importance Fusion

Topeax assigns topic keywords by unifying two orthogonal signals: semantic similarity and lexical association.

  • Semantic score skjs_{kj}: For each cluster kk, the soft centroid is

tk=dTkdxddTkdt_k = \frac{\sum_d T_{kd} x_d}{\sum_d T_{kd}}

Each word embedding wjw_j is compared to tkt_k by cosine similarity,

skj=cos(tk,wj)[1,1]s_{kj} = \cos(t_k,\,w_j)\in[-1,1]

  • Lexical score ($\npm_{kj}$): Based on normalized pointwise mutual information (NPMI), with Dirichlet prior (α=2\alpha=2), token counts njn_j (global) and njkn_{jk} (within cluster), and smoothed probabilities

p(vj)=nj+αN+αV,p(vjzk)=njk+αnk+αVp(v_j) = \frac{n_j + \alpha}{N + \alpha V}, \quad p(v_j|z_k) = \frac{n_{jk} + \alpha}{n_k + \alpha V}

NPMI is then: $\mathrm{pmi}_{kj} = \log_2 \frac{p(v_j|z_k)}{p(v_j)},\, \npm_{kj} = -\frac{\mathrm{pmi}_{kj}}{\log_2 p(v_j, z_k)},\, p(v_j, z_k) = p(v_j|z_k)p(z_k)$

  • Combined score: Both skjs_{kj} and $\npm_{kj}$ are min-max normalized to [0,1][0,1] via (1+)/2(1+\cdot)/2, and the geometric mean yields the final ranking: $B_{kj} = \sqrt{\Bigl(\tfrac{1 + \npm_{kj}}{2}\Bigr)\,\Bigl(\tfrac{1 + s_{kj}}{2}\Bigr)}$ Topic keywords are selected as the highest-ranked terms by BkjB_{kj}.

3. Complete Pipeline and Hyperparameterization

The Topeax workflow proceeds through the following specific stages:

  1. Sentence encoding: Documents are embedded with a Sentence-Transformer.
  2. Dimensionality reduction: Embeddings are projected to 2D using t-SNE (default perplexity 50; stable for perplexity ≥ 30).
  3. Density estimation: Gaussian KDE is calculated on a 100×100100\times100 grid (Scott’s rule for bandwidth).
  4. Peak detection: Grid cells with the highest local density within a 5-unit radius are taken as peaks.
  5. Clustering: GMM is initialized at peaks and fit via EM. Documents are assigned by highest posterior.
  6. Centroid/term scoring: Soft cluster centroids tkt_k are used for semantic scoring, corpus statistics for lexical scoring, followed by geometric mean fusion.
  7. Topic interpretation: Top keywords for each topic are reported by BkjB_{kj}.

Principal hyperparameters are t-SNE perplexity (recommended: 30–100, default 50), KDE grid size, neighborhood radius (5 units), and Dirichlet smoothing α\alpha (default 2). Empirical studies show the stable inference of cluster count and topic quality for corpora with sample size 5000\geq 5000; for smaller datasets, manual inspection of peaks and keywords is suggested.

4. Comparative Performance and Empirical Results

Topeax resolves several known weaknesses of Top2Vec and BERTopic, which use UMAP + HDBSCAN for clustering and either cosine similarity to hard centroids (Top2Vec) or class–TF–IDF for keywords (BERTopic). Major limitations addressed by Topeax include:

  • Hyperparameter instability: UMAP perplexity, HDBSCAN min-cluster-size, and related settings in prior models cause drastic variation in result granularity and topic coherence.
  • Topic count misestimation: Top2Vec/BERTopic frequently over-fit (hundreds of tiny clusters) or under-fit (high outlier rates). Topeax, in contrast, demonstrates a mean absolute percentage error (MAPE) for topic count estimation of 60.5%60.5\% (SD=26.2SD=26.2), vastly lower than Top2Vec (1797%,SD=2623)(1797\%,\,SD=2623) and BERTopic (2439%,SD=3012)(2439\%,\,SD=3012) (see Figure 1 of (Kardos, 29 Jan 2026)).
  • Keyword quality: The lexical-semantic fusion in Topeax prevents the over-inclusion of stop words or rare semantically irrelevant terms. Internal coherence $\Cin$ and overall interpretability II for Topeax are 0.35±0.150.35\pm 0.15 and $0.55$, outperforming the $0.21$/$0.24$ ($\Cin$) and $0.38$/$0.35$ (II) of the baseline methods (Table 1).

Under subsampling, Topeax’s number of topics and quality stabilize at n5000n \geq 5000 (see Figure 2), while prior models show increasing fragmentation and topic inflation with larger corpora.

5. Stability, Limitations, and Practical Considerations

Topeax is engineered for minimal sensitivity to hyperparameters, with all major steps governed by well-defined statistical criteria (Scott's rule for bandwidth; local maxima for peaks). The main sources of non-determinism are t-SNE initialization and, to a lesser extent, the randomization in GMM expectation-maximization; however, Figure 3 demonstrates rapid stabilization of topic count and quality as perplexity and data size increase.

Practical recommendations include:

  • Use t-SNE perplexity between 30 and 100 (default: 50) for best trade-off between local and global structure.
  • Retain α=2\alpha=2 except in highly sparse corpora, where heavier smoothing may be required.
  • For datasets under 1,000 documents, directly inspect detected peaks and assigned keywords as low counts may increase variance in NPMI estimates.

This suggests Topeax is directly applicable to large modern corpora with minimal user intervention while retaining interpretability for smaller samples via manual review.

6. Significance and Outlook

Topeax’s architecture—combining parameter-free cluster count inference, density-peak-driven mixture modeling, and balanced lexical-semantic keyword selection—overcomes several longstanding issues in clustering-based topic models. By tying topic formation to data-intrinsic structure in embedding space and blending both statistical and semantic signals for description, Topeax supports robust, reproducible, and transparent topic modeling. Future directions encompass adaptation to alternative embedding spaces, fine-grained control over the fusion of lexical and semantic scores, and integration with interactive topic exploration pipelines (Kardos, 29 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Topeax.