FT-Topic/SenClu: Interpretable Topic & Sentence Clustering
- FT-Topic/SenClu is a framework that integrates deep embedding fine-tuning with EM-style hard clustering for interpretable topic modeling and sentence-level grouping.
- It leverages unsupervised triplet loss training and bag-of-sentences approaches to improve topic coherence (PMI) and clustering quality (NMI) over classical models.
- The method offers extensibility to evolving topics and streaming data with explicit control over document-topic diversity through prior annealing.
FT-Topic/SenClu is a suite of methods for interpretable, high-coherence topic modeling and sentence-level clustering, at the intersection of modern deep embedding algorithms and classical EM-style clustering approaches. The framework encompasses methods for self-supervised fine-tuning of sentence encoders (FT-Topic), high-speed expectation-maximization topic inference over “bags of sentences” (SenClu), and provides extensibility to sentence-level and evolving-topic applications, including integration with LLMs and contextualized clustering (Schneider, 2024). FT-Topic/SenClu advances over classical models by combining representation learning, probabilistic inference, and controllable document-topic priors, yielding state-of-the-art topic coherence and fine-grained control over per-document topic diversity.
1. Automatic Construction of Unsupervised Fine-Tuning Sets (FT-Topic)
FT-Topic is an approach for unsupervised representation learning tailored for topic modeling pipelines. Given a document corpus partitioned into non-overlapping groups of sentences (“bags of sentences”), FT-Topic generates a triplet training dataset by leveraging local sequential context to heuristically assign pseudo-labels:
- Anchor is a sentence group .
- Positive is an immediate neighbor group within the same document ( or ).
- Negative is a randomly sampled group from a different document.
To improve label quality, FT-Topic computes group embeddings using a baseline (off-the-shelf) encoder and removes triplets with largest intra-positive distance ( fraction pruned) and margin-difference scores ( pruned). This pruning is empirical: , (Schneider, 2024).
The remaining triplets train the encoder under standard triplet loss: where . Fine-tuning is typically performed for four epochs; negatives per anchor ; group size sentences.
Empirically, this process yields embeddings that outperform un-tuned BERT/SBERT on topic coherence and normalized mutual information (Schneider, 2024).
2. The SenClu EM-Style Bag-of-Sentences Topic Model
SenClu is a hard-EM, centroid-based topic model designed for sentence (or sentence-group) granularity. Each group in document is embedded (), and topics are cluster centroids . The model assigns each group to one topic per epoch using an E/M step approximation of the aspect model.
- Probabilistic formulation:
with likelihood proxy .
- Prior encoding:
is the count of sentence groups assigned to topic in , is number of topics, is a smoothing constant annealed from $8$ down to user-specified .
- E-step: For each group, assign .
- M-step: Update centroids and priors:
is annealed over epochs: .
Annealing () controls topic sparsity per document; a higher value enforces a broader topic mixture, a lower value yields sparser, more focused assignments.
Convergence is typically achieved in 10 epochs; in early epochs a second-best topic assignment is occasionally used to escape local optima. Computational complexity per epoch is for groups, topics, embedding dimension .
3. Integration of Fine-Tuned Embeddings and Downstream Benefits
Fine-tuned embeddings from FT-Topic are used in SenClu or any embedding-based clustering pipeline, replacing static LM features. Empirical studies show that FT-Topic+SenClu outperforms LDA, BERTopic, and TopClus in both PMI topic coherence and NMI clustering quality, with particularly large gains observed on the 20News, NYT, Gutenberg, and Yelp datasets (Schneider, 2024). Specifically, FT-Topic+SenClu yields PMI vs $0.35$ for LDA on 20News, and NMI vs $0.24$ for clustering alignment.
A summary table for key pipeline stages:
| Stage | Algorithmic Choice | Key Hyperparameters |
|---|---|---|
| Group construction | Non-overlapping n-sentence windows | |
| FT-Topic triplet mining | Sequential, negative sampling, prune | |
| Encoder fine-tuning | Triplet loss SGD | epochs=4 |
| SenClu clustering | Hard-EM, cosine similarity, prior | (topics), |
Empirical ablation shows that embedding fine-tuning improves both PMI and NMI by 1–5 and 0.01–0.05 points, respectively. Filtering of triplets consistently provides further benefit over no filtering.
4. Control of Document-Topic Distributions via Prior Annealing
SenClu enables researchers to encode prior knowledge about document-topic diversity directly through the parameter, analogous to the Dirichlet prior in LDA. A larger yields more uniform document-topic distributions (more topics per document on average), while a smaller results in sparser topic coverage—allowing explicit control over whether documents are guaranteed to be multi-topical or more focused.
Annealing from down to during EM iterations improves optimization stability and avoids poor local optima (Schneider, 2024).
5. Extensions: Sentence-Level and Streaming/Time-Evolving Topics
Adapting FT-Topic/SenClu to sentence-level clustering is straightforward: treat each sentence (or arbitrary short group) as a single “document,” build the sentence-by-vocabulary (tf-idf or embedding) matrix, and proceed identically through the pipeline (Murfi et al., 2021). For streaming or time-evolving corpora, incremental clustering or online fine-tuning of the encoder offers a promising direction, although a fully streaming variant of FT-Topic/SenClu is not yet established.
The framework is also compatible with alternative clustering objectives—for example, fuzzy c-means (DFCM) for soft topic memberships at the sentence level—with analogous improvements in topic coherence over NMF/LDA baselines (Murfi et al., 2021).
6. Computational Efficiency and Runtime Considerations
FT-Topic fine-tuning (on contemporary hardware) adds $20$–$50$ minutes to total pipeline runtime, while SenClu inference remains efficient ($2$–$6$ minutes for moderate-sized corpora). This is a significant speedup over embedding-based hierarchical clustering methods (e.g., TopClus: 150 minutes) (Schneider, 2024). Inference speed after fine-tuning is unaffected. For massive data, efficient batch clustering over sentence groups, and hard assignments with EM-style iteration, maintain practical scalability.
7. Empirical Performance and Qualitative Topic Interpretability
On canonical topic modeling benchmarks, FT-Topic/SenClu offers state-of-the-art or near-best performance among embedding-based and classical models. For 20News, SenClu achieves PMI (vs $0.35$ for LDA) and NMI (vs $0.24$). Similar relative improvements are observed across NYT, Gutenberg, and Yelp. Data filtering in FT-Topic demonstrably increases topic coherence and coverage (Table 6) (Schneider, 2024).
Qualitatively, topic clusters generated by FT-Topic/SenClu exhibit compact, interpretable word sets and sharply assign relevant sentences/groups, while giving users explicit control over multi-topic document representation. The approach extends naturally to idiomatic pipelines for short or informal texts—such as tweets—where classical topic models underperform.
References
- "Topic Modeling with Fine-tuning LLMs and Bag of Sentences" (Schneider, 2024)
- "Deep Autoencoder-based Fuzzy C-Means for Topic Detection" (Murfi et al., 2021)
- "TopiCLEAR: Topic extraction by CLustering Embeddings with Adaptive dimensional Reduction" (Fujita et al., 7 Dec 2025)