Papers
Topics
Authors
Recent
Search
2000 character limit reached

FT-Topic/SenClu: Interpretable Topic & Sentence Clustering

Updated 15 December 2025
  • FT-Topic/SenClu is a framework that integrates deep embedding fine-tuning with EM-style hard clustering for interpretable topic modeling and sentence-level grouping.
  • It leverages unsupervised triplet loss training and bag-of-sentences approaches to improve topic coherence (PMI) and clustering quality (NMI) over classical models.
  • The method offers extensibility to evolving topics and streaming data with explicit control over document-topic diversity through prior annealing.

FT-Topic/SenClu is a suite of methods for interpretable, high-coherence topic modeling and sentence-level clustering, at the intersection of modern deep embedding algorithms and classical EM-style clustering approaches. The framework encompasses methods for self-supervised fine-tuning of sentence encoders (FT-Topic), high-speed expectation-maximization topic inference over “bags of sentences” (SenClu), and provides extensibility to sentence-level and evolving-topic applications, including integration with LLMs and contextualized clustering (Schneider, 2024). FT-Topic/SenClu advances over classical models by combining representation learning, probabilistic inference, and controllable document-topic priors, yielding state-of-the-art topic coherence and fine-grained control over per-document topic diversity.

1. Automatic Construction of Unsupervised Fine-Tuning Sets (FT-Topic)

FT-Topic is an approach for unsupervised representation learning tailored for topic modeling pipelines. Given a document corpus D\mathcal{D} partitioned into non-overlapping groups of nsn_s sentences (“bags of sentences”), FT-Topic generates a triplet training dataset T=(A,P,N)T = (A, P, N) by leveraging local sequential context to heuristically assign pseudo-labels:

  • Anchor AA is a sentence group gig_i.
  • Positive PP is an immediate neighbor group within the same document (gi+1g_{i+1} or gi1g_{i-1}).
  • Negative NN is a randomly sampled group from a different document.

To improve label quality, FT-Topic computes group embeddings using a baseline (off-the-shelf) encoder EE and removes triplets with largest intra-positive distance (fposf_{pos} fraction pruned) and margin-difference scores (ftrif_{tri} pruned). This pruning is empirical: fpos=0.08f_{pos}=0.08, ftri=0.24f_{tri}=0.24 (Schneider, 2024).

The remaining triplets train the encoder under standard triplet loss: L(A,P,N)=max(vAvP2vAvN2+m,0)\mathcal{L}(A,P,N) = \max\left(\|v_A - v_P\|_2 - \|v_A - v_N\|_2 + m, 0\right) where m=0.16m=0.16. Fine-tuning is typically performed for four epochs; negatives per anchor nneg=2n_{neg}=2; group size ns=3n_s=3 sentences.

Empirically, this process yields embeddings AA' that outperform un-tuned BERT/SBERT on topic coherence and normalized mutual information (Schneider, 2024).

2. The SenClu EM-Style Bag-of-Sentences Topic Model

SenClu is a hard-EM, centroid-based topic model designed for sentence (or sentence-group) granularity. Each group gg in document dd is embedded (vg=A(g)v_g = A'(g)), and topics tt are cluster centroids vtv_t. The model assigns each group to one topic per epoch using an E/M step approximation of the aspect model.

  • Probabilistic formulation:

p(gd)=tp(gt)p(td)p(g|d) = \sum_t p(g|t)\,p(t|d)

with likelihood proxy h(g,t):=cos(vg,vt)h(g,t) := \cos(v_g, v_t).

  • Prior encoding:

p(td)=At,d+cd+kcp(t|d) = \frac{|A_{t,d}| + c}{|d| + k\,c}

At,d|A_{t,d}| is the count of sentence groups assigned to topic tt in dd, kk is number of topics, cc is a smoothing constant annealed from $8$ down to user-specified α0\alpha \geq 0.

  • E-step: For each group, assign tg,d=argmaxt{h(g,t)p(td)}t_{g,d} = \arg\max_t \{ h(g,t) \cdot p(t|d) \}.
  • M-step: Update centroids and priors:

vt=1dAt,ddg:tg,d=tvgp(td)=At,d+cid+kciv_t = \frac{1}{\sum_d |A_{t,d}|} \sum_d \sum_{g : t_{g,d}=t} v_g \qquad p(t|d) = \frac{|A_{t,d}| + c_i}{|d| + k\,c_i}

cic_i is annealed over epochs: ci=max(ci1/2,α)c_i = \max(c_{i-1}/2, \alpha).

Annealing (α\alpha) controls topic sparsity per document; a higher value enforces a broader topic mixture, a lower value yields sparser, more focused assignments.

Convergence is typically achieved in 10 epochs; in early epochs a second-best topic assignment is occasionally used to escape local optima. Computational complexity per epoch is O(Nkds)O(N\,k\,d_s) for NN groups, kk topics, embedding dimension dsd_s.

3. Integration of Fine-Tuned Embeddings and Downstream Benefits

Fine-tuned embeddings from FT-Topic are used in SenClu or any embedding-based clustering pipeline, replacing static LM features. Empirical studies show that FT-Topic+SenClu outperforms LDA, BERTopic, and TopClus in both PMI topic coherence and NMI clustering quality, with particularly large gains observed on the 20News, NYT, Gutenberg, and Yelp datasets (Schneider, 2024). Specifically, FT-Topic+SenClu yields PMI 0.79\approx 0.79 vs $0.35$ for LDA on 20News, and NMI 0.47\approx 0.47 vs $0.24$ for clustering alignment.

A summary table for key pipeline stages:

Stage Algorithmic Choice Key Hyperparameters
Group construction Non-overlapping n-sentence windows ns=3n_s=3
FT-Topic triplet mining Sequential, negative sampling, prune fpos=0.08,ftri=0.24,m=0.16f_{pos}=0.08, f_{tri}=0.24, m=0.16
Encoder fine-tuning Triplet loss SGD epochs=4
SenClu clustering Hard-EM, cosine similarity, prior kk (topics), α\alpha

Empirical ablation shows that embedding fine-tuning improves both PMI and NMI by 1–5 and 0.01–0.05 points, respectively. Filtering of triplets consistently provides further benefit over no filtering.

4. Control of Document-Topic Distributions via Prior Annealing

SenClu enables researchers to encode prior knowledge about document-topic diversity directly through the α\alpha parameter, analogous to the Dirichlet prior in LDA. A larger α\alpha yields more uniform document-topic distributions (more topics per document on average), while a smaller α\alpha results in sparser topic coverage—allowing explicit control over whether documents are guaranteed to be multi-topical or more focused.

Annealing cic_i from c0=8c_0=8 down to α\alpha during EM iterations improves optimization stability and avoids poor local optima (Schneider, 2024).

5. Extensions: Sentence-Level and Streaming/Time-Evolving Topics

Adapting FT-Topic/SenClu to sentence-level clustering is straightforward: treat each sentence (or arbitrary short group) as a single “document,” build the sentence-by-vocabulary (tf-idf or embedding) matrix, and proceed identically through the pipeline (Murfi et al., 2021). For streaming or time-evolving corpora, incremental clustering or online fine-tuning of the encoder offers a promising direction, although a fully streaming variant of FT-Topic/SenClu is not yet established.

The framework is also compatible with alternative clustering objectives—for example, fuzzy c-means (DFCM) for soft topic memberships at the sentence level—with analogous improvements in topic coherence over NMF/LDA baselines (Murfi et al., 2021).

6. Computational Efficiency and Runtime Considerations

FT-Topic fine-tuning (on contemporary hardware) adds $20$–$50$ minutes to total pipeline runtime, while SenClu inference remains efficient ($2$–$6$ minutes for moderate-sized corpora). This is a significant speedup over embedding-based hierarchical clustering methods (e.g., TopClus: >>150 minutes) (Schneider, 2024). Inference speed after fine-tuning is unaffected. For massive data, efficient batch clustering over sentence groups, and hard assignments with EM-style iteration, maintain practical scalability.

7. Empirical Performance and Qualitative Topic Interpretability

On canonical topic modeling benchmarks, FT-Topic/SenClu offers state-of-the-art or near-best performance among embedding-based and classical models. For 20News, SenClu achieves PMI 0.79\approx 0.79 (vs $0.35$ for LDA) and NMI 0.47\approx 0.47 (vs $0.24$). Similar relative improvements are observed across NYT, Gutenberg, and Yelp. Data filtering in FT-Topic demonstrably increases topic coherence and coverage (Table 6) (Schneider, 2024).

Qualitatively, topic clusters generated by FT-Topic/SenClu exhibit compact, interpretable word sets and sharply assign relevant sentences/groups, while giving users explicit control over multi-topic document representation. The approach extends naturally to idiomatic pipelines for short or informal texts—such as tweets—where classical topic models underperform.


References

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FT-Topic/SenClu.