FT-Topic/SenClu: Interpretable Topic & Sentence Clustering

Updated 15 December 2025

FT-Topic/SenClu is a framework that integrates deep embedding fine-tuning with EM-style hard clustering for interpretable topic modeling and sentence-level grouping.
It leverages unsupervised triplet loss training and bag-of-sentences approaches to improve topic coherence (PMI) and clustering quality (NMI) over classical models.
The method offers extensibility to evolving topics and streaming data with explicit control over document-topic diversity through prior annealing.

FT-Topic/SenClu is a suite of methods for interpretable, high-coherence topic modeling and sentence-level clustering, at the intersection of modern deep embedding algorithms and classical EM-style clustering approaches. The framework encompasses methods for self-supervised fine-tuning of sentence encoders (FT-Topic), high-speed expectation-maximization topic inference over “bags of sentences” (SenClu), and provides extensibility to sentence-level and evolving-topic applications, including integration with LLMs and contextualized clustering (Schneider, 2024). FT-Topic/SenClu advances over classical models by combining representation learning, probabilistic inference, and controllable document-topic priors, yielding state-of-the-art topic coherence and fine-grained control over per-document topic diversity.

1. Automatic Construction of Unsupervised Fine-Tuning Sets (FT-Topic)

FT-Topic is an approach for unsupervised representation learning tailored for topic modeling pipelines. Given a document corpus $\mathcal{D}$ partitioned into non-overlapping groups of $n_s$ sentences (“bags of sentences”), FT-Topic generates a triplet training dataset $T = (A, P, N)$ by leveraging local sequential context to heuristically assign pseudo-labels:

Anchor $A$ is a sentence group $g_i$ .
Positive $P$ is an immediate neighbor group within the same document ( $g_{i+1}$ or $g_{i-1}$ ).
Negative $N$ is a randomly sampled group from a different document.

To improve label quality, FT-Topic computes group embeddings using a baseline (off-the-shelf) encoder $E$ and removes triplets with largest intra-positive distance ( $f_{pos}$ fraction pruned) and margin-difference scores ( $f_{tri}$ pruned). This pruning is empirical: $f_{pos}=0.08$ , $f_{tri}=0.24$ (Schneider, 2024).

The remaining triplets train the encoder under standard triplet loss: $\mathcal{L}(A,P,N) = \max\left(\|v_A - v_P\|_2 - \|v_A - v_N\|_2 + m, 0\right)$ where $m=0.16$ . Fine-tuning is typically performed for four epochs; negatives per anchor $n_{neg}=2$ ; group size $n_s=3$ sentences.

Empirically, this process yields embeddings $A'$ that outperform un-tuned BERT/SBERT on topic coherence and normalized mutual information (Schneider, 2024).

2. The SenClu EM-Style Bag-of-Sentences Topic Model

SenClu is a hard-EM, centroid-based topic model designed for sentence (or sentence-group) granularity. Each group $g$ in document $d$ is embedded ( $v_g = A'(g)$ ), and topics $t$ are cluster centroids $v_t$ . The model assigns each group to one topic per epoch using an E/M step approximation of the aspect model.

Probabilistic formulation:

$p(g|d) = \sum_t p(g|t)\,p(t|d)$

with likelihood proxy $h(g,t) := \cos(v_g, v_t)$ .

Prior encoding:

$p(t|d) = \frac{|A_{t,d}| + c}{|d| + k\,c}$

$|A_{t,d}|$ is the count of sentence groups assigned to topic $t$ in $d$ , $k$ is number of topics, $c$ is a smoothing constant annealed from $8$ down to user-specified $\alpha \geq 0$ .

E-step: For each group, assign $t_{g,d} = \arg\max_t \{ h(g,t) \cdot p(t|d) \}$ .
M-step: Update centroids and priors:

$v_t = \frac{1}{\sum_d |A_{t,d}|} \sum_d \sum_{g : t_{g,d}=t} v_g \qquad p(t|d) = \frac{|A_{t,d}| + c_i}{|d| + k\,c_i}$

$c_i$ is annealed over epochs: $c_i = \max(c_{i-1}/2, \alpha)$ .

Annealing ( $\alpha$ ) controls topic sparsity per document; a higher value enforces a broader topic mixture, a lower value yields sparser, more focused assignments.

Convergence is typically achieved in 10 epochs; in early epochs a second-best topic assignment is occasionally used to escape local optima. Computational complexity per epoch is $O(N\,k\,d_s)$ for $N$ groups, $k$ topics, embedding dimension $d_s$ .

3. Integration of Fine-Tuned Embeddings and Downstream Benefits

Fine-tuned embeddings from FT-Topic are used in SenClu or any embedding-based clustering pipeline, replacing static LM features. Empirical studies show that FT-Topic+SenClu outperforms LDA, BERTopic, and TopClus in both PMI topic coherence and NMI clustering quality, with particularly large gains observed on the 20News, NYT, Gutenberg, and Yelp datasets (Schneider, 2024). Specifically, FT-Topic+SenClu yields PMI $\approx 0.79$ vs $0.35$ for LDA on 20News, and NMI $\approx 0.47$ vs $0.24$ for clustering alignment.

A summary table for key pipeline stages:

Stage	Algorithmic Choice	Key Hyperparameters
Group construction	Non-overlapping n-sentence windows	$n_s=3$
FT-Topic triplet mining	Sequential, negative sampling, prune	$f_{pos}=0.08, f_{tri}=0.24, m=0.16$
Encoder fine-tuning	Triplet loss SGD	epochs=4
SenClu clustering	Hard-EM, cosine similarity, prior	$k$ (topics), $\alpha$

Empirical ablation shows that embedding fine-tuning improves both PMI and NMI by 1–5 and 0.01–0.05 points, respectively. Filtering of triplets consistently provides further benefit over no filtering.

4. Control of Document-Topic Distributions via Prior Annealing

SenClu enables researchers to encode prior knowledge about document-topic diversity directly through the $\alpha$ parameter, analogous to the Dirichlet prior in LDA. A larger $\alpha$ yields more uniform document-topic distributions (more topics per document on average), while a smaller $\alpha$ results in sparser topic coverage—allowing explicit control over whether documents are guaranteed to be multi-topical or more focused.

Annealing $c_i$ from $c_0=8$ down to $\alpha$ during EM iterations improves optimization stability and avoids poor local optima (Schneider, 2024).

5. Extensions: Sentence-Level and Streaming/Time-Evolving Topics

Adapting FT-Topic/SenClu to sentence-level clustering is straightforward: treat each sentence (or arbitrary short group) as a single “document,” build the sentence-by-vocabulary (tf-idf or embedding) matrix, and proceed identically through the pipeline (Murfi et al., 2021). For streaming or time-evolving corpora, incremental clustering or online fine-tuning of the encoder offers a promising direction, although a fully streaming variant of FT-Topic/SenClu is not yet established.

The framework is also compatible with alternative clustering objectives—for example, fuzzy c-means (DFCM) for soft topic memberships at the sentence level—with analogous improvements in topic coherence over NMF/LDA baselines (Murfi et al., 2021).

6. Computational Efficiency and Runtime Considerations

FT-Topic fine-tuning (on contemporary hardware) adds $20$–$50$ minutes to total pipeline runtime, while SenClu inference remains efficient ($2$–$6$ minutes for moderate-sized corpora). This is a significant speedup over embedding-based hierarchical clustering methods (e.g., TopClus: $>$ 150 minutes) (Schneider, 2024). Inference speed after fine-tuning is unaffected. For massive data, efficient batch clustering over sentence groups, and hard assignments with EM-style iteration, maintain practical scalability.

7. Empirical Performance and Qualitative Topic Interpretability

On canonical topic modeling benchmarks, FT-Topic/SenClu offers state-of-the-art or near-best performance among embedding-based and classical models. For 20News, SenClu achieves PMI $\approx 0.79$ (vs $0.35$ for LDA) and NMI $\approx 0.47$ (vs $0.24$). Similar relative improvements are observed across NYT, Gutenberg, and Yelp. Data filtering in FT-Topic demonstrably increases topic coherence and coverage (Table 6) (Schneider, 2024).

Qualitatively, topic clusters generated by FT-Topic/SenClu exhibit compact, interpretable word sets and sharply assign relevant sentences/groups, while giving users explicit control over multi-topic document representation. The approach extends naturally to idiomatic pipelines for short or informal texts—such as tweets—where classical topic models underperform.

References

"Topic Modeling with Fine-tuning LLMs and Bag of Sentences" (Schneider, 2024)
"Deep Autoencoder-based Fuzzy C-Means for Topic Detection" (Murfi et al., 2021)
"TopiCLEAR: Topic extraction by CLustering Embeddings with Adaptive dimensional Reduction" (Fujita et al., 7 Dec 2025)

Markdown Report Issue Upgrade to Chat

References (3)

Topic Modeling with Fine-tuning LLMs and Bag of Sentences (2024)

Deep Autoencoder-based Fuzzy C-Means for Topic Detection (2021)

TopiCLEAR: Topic extraction by CLustering Embeddings with Adaptive dimensional Reduction (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FT-Topic/SenClu.

FT-Topic/SenClu: Interpretable Topic & Sentence Clustering

1. Automatic Construction of Unsupervised Fine-Tuning Sets (FT-Topic)

2. The SenClu EM-Style Bag-of-Sentences Topic Model

3. Integration of Fine-Tuned Embeddings and Downstream Benefits

4. Control of Document-Topic Distributions via Prior Annealing

5. Extensions: Sentence-Level and Streaming/Time-Evolving Topics

6. Computational Efficiency and Runtime Considerations

7. Empirical Performance and Qualitative Topic Interpretability

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

FT-Topic/SenClu: Interpretable Topic & Sentence Clustering

1. Automatic Construction of Unsupervised Fine-Tuning Sets (FT-Topic)

2. The SenClu EM-Style Bag-of-Sentences Topic Model

3. Integration of Fine-Tuned Embeddings and Downstream Benefits

4. Control of Document-Topic Distributions via Prior Annealing

5. Extensions: Sentence-Level and Streaming/Time-Evolving Topics

6. Computational Efficiency and Runtime Considerations

7. Empirical Performance and Qualitative Topic Interpretability

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research