Topeax: Automated Topic Modeling
- Topeax is an automated topic modeling framework that detects dense regions in low-dimensional sentence embedding spaces using Gaussian kernel density estimation.
- It employs a Gaussian Mixture Model with soft cluster assignments to infer the natural number of topics without manual parameter tuning.
- By fusing normalized lexical and semantic scores for keyword selection, Topeax outperforms UMAP+HDBSCAN approaches in stability and interpretability.
Topeax is an automated topic modeling and clustering framework that jointly addresses cluster discovery and keyword selection for natural language corpora, with empirical superiority over UMAP + HDBSCAN models such as Top2Vec and BERTopic. Topeax detects dense regions in low-dimensional sentence embedding spaces via Gaussian kernel density estimation, infers the natural number of topic clusters from these density peaks, and fuses normalized lexical and semantic scores to produce interpretable and robust topic keywords. This framework provides high-quality topic clustering and description that remains stable under changes in sample size and hyperparameter settings, offering a systematized approach to clustering topic modeling for natural language processing applications (Kardos, 29 Jan 2026).
1. Density-Peak-Based Clustering in Embedding Space
Topeax begins by embedding all documents into a 2D vector space via a Sentence-Transformer model, with nonparametric t-SNE for projection (using cosine distance and user-set perplexity, default 50). A Gaussian kernel density estimate (KDE) is computed on a grid covering the embedding space: where is determined by Scott’s rule. Local maxima in this density field, each identified within a five-unit-radius neighborhood (connectivity ), serve as candidate cluster centers ("peaks").
Topeax eschews hard Voronoi partitioning; instead, it initializes a Gaussian Mixture Model (GMM) with fixed means at these peaks and learns component weights and covariances via the EM algorithm. Each document with embedding receives soft cluster memberships: and final assignments are made by
This approach infers the number of clusters (topics) directly from data, removing the need for manual selection of .
2. Lexical and Semantic Term Importance Fusion
Topeax assigns topic keywords by unifying two orthogonal signals: semantic similarity and lexical association.
- Semantic score : For each cluster , the soft centroid is
Each word embedding is compared to by cosine similarity,
- Lexical score ($\npm_{kj}$): Based on normalized pointwise mutual information (NPMI), with Dirichlet prior (), token counts (global) and (within cluster), and smoothed probabilities
NPMI is then: $\mathrm{pmi}_{kj} = \log_2 \frac{p(v_j|z_k)}{p(v_j)},\, \npm_{kj} = -\frac{\mathrm{pmi}_{kj}}{\log_2 p(v_j, z_k)},\, p(v_j, z_k) = p(v_j|z_k)p(z_k)$
- Combined score: Both and $\npm_{kj}$ are min-max normalized to via , and the geometric mean yields the final ranking: $B_{kj} = \sqrt{\Bigl(\tfrac{1 + \npm_{kj}}{2}\Bigr)\,\Bigl(\tfrac{1 + s_{kj}}{2}\Bigr)}$ Topic keywords are selected as the highest-ranked terms by .
3. Complete Pipeline and Hyperparameterization
The Topeax workflow proceeds through the following specific stages:
- Sentence encoding: Documents are embedded with a Sentence-Transformer.
- Dimensionality reduction: Embeddings are projected to 2D using t-SNE (default perplexity 50; stable for perplexity ≥ 30).
- Density estimation: Gaussian KDE is calculated on a grid (Scott’s rule for bandwidth).
- Peak detection: Grid cells with the highest local density within a 5-unit radius are taken as peaks.
- Clustering: GMM is initialized at peaks and fit via EM. Documents are assigned by highest posterior.
- Centroid/term scoring: Soft cluster centroids are used for semantic scoring, corpus statistics for lexical scoring, followed by geometric mean fusion.
- Topic interpretation: Top keywords for each topic are reported by .
Principal hyperparameters are t-SNE perplexity (recommended: 30–100, default 50), KDE grid size, neighborhood radius (5 units), and Dirichlet smoothing (default 2). Empirical studies show the stable inference of cluster count and topic quality for corpora with sample size ; for smaller datasets, manual inspection of peaks and keywords is suggested.
4. Comparative Performance and Empirical Results
Topeax resolves several known weaknesses of Top2Vec and BERTopic, which use UMAP + HDBSCAN for clustering and either cosine similarity to hard centroids (Top2Vec) or class–TF–IDF for keywords (BERTopic). Major limitations addressed by Topeax include:
- Hyperparameter instability: UMAP perplexity, HDBSCAN min-cluster-size, and related settings in prior models cause drastic variation in result granularity and topic coherence.
- Topic count misestimation: Top2Vec/BERTopic frequently over-fit (hundreds of tiny clusters) or under-fit (high outlier rates). Topeax, in contrast, demonstrates a mean absolute percentage error (MAPE) for topic count estimation of (), vastly lower than Top2Vec and BERTopic (see Figure 1 of (Kardos, 29 Jan 2026)).
- Keyword quality: The lexical-semantic fusion in Topeax prevents the over-inclusion of stop words or rare semantically irrelevant terms. Internal coherence $\Cin$ and overall interpretability for Topeax are and $0.55$, outperforming the $0.21$/$0.24$ ($\Cin$) and $0.38$/$0.35$ () of the baseline methods (Table 1).
Under subsampling, Topeax’s number of topics and quality stabilize at (see Figure 2), while prior models show increasing fragmentation and topic inflation with larger corpora.
5. Stability, Limitations, and Practical Considerations
Topeax is engineered for minimal sensitivity to hyperparameters, with all major steps governed by well-defined statistical criteria (Scott's rule for bandwidth; local maxima for peaks). The main sources of non-determinism are t-SNE initialization and, to a lesser extent, the randomization in GMM expectation-maximization; however, Figure 3 demonstrates rapid stabilization of topic count and quality as perplexity and data size increase.
Practical recommendations include:
- Use t-SNE perplexity between 30 and 100 (default: 50) for best trade-off between local and global structure.
- Retain except in highly sparse corpora, where heavier smoothing may be required.
- For datasets under 1,000 documents, directly inspect detected peaks and assigned keywords as low counts may increase variance in NPMI estimates.
This suggests Topeax is directly applicable to large modern corpora with minimal user intervention while retaining interpretability for smaller samples via manual review.
6. Significance and Outlook
Topeax’s architecture—combining parameter-free cluster count inference, density-peak-driven mixture modeling, and balanced lexical-semantic keyword selection—overcomes several longstanding issues in clustering-based topic models. By tying topic formation to data-intrinsic structure in embedding space and blending both statistical and semantic signals for description, Topeax supports robust, reproducible, and transparent topic modeling. Future directions encompass adaptation to alternative embedding spaces, fine-grained control over the fusion of lexical and semantic scores, and integration with interactive topic exploration pipelines (Kardos, 29 Jan 2026).