Class-Based TF-IDF (c-TF-IDF)

Updated 20 January 2026

Class-based TF-IDF (c-TF-IDF) is a term weighting scheme that aggregates term frequencies over clusters, generating coherent and distinctive topic representations.
It leverages aggregated counts and a log-scaled adjustment with average cluster length to emphasize cluster-specific, high-salience terms.
Empirical evaluations show improved topic coherence across various datasets, underscoring its robustness in dynamic and static topic modeling applications.

Class-based TF-IDF (c-TF-IDF) is a term weighting scheme for topic modeling that generalizes standard TF-IDF from single documents to clusters, or "classes," of semantically similar documents. This approach was introduced in the context of the BERTopic model, where documents are first embedded using transformer-based models, clustered based on semantic similarity, and then described by high-scoring terms extracted using c-TF-IDF. The central aim of c-TF-IDF is to create coherent and distinctive topic representations by measuring term salience within a cluster relative to its prevalence across all clusters (Grootendorst, 2022).

1. Formal Definition and Mathematical Formulation

c-TF-IDF operates over a collection of $D$ documents partitioned into $C$ clusters (topics), with $V$ the vocabulary of unique terms. Key definitions:

$tf_{t, d}$ : raw count of term $t$ in document $d$
$tf_{t, c} = \sum_{d \in \mathcal{C}_c} tf_{t, d}$ : aggregated frequency of $t$ in cluster $c$
$T_c = \sum_{t\in V} tf_{t, c}$ : total tokens in cluster $C$ 0
$C$ 1: mean tokens per cluster
$C$ 2: total occurrence of $C$ 3 over corpus

The c-TF-IDF score for term $C$ 4 in cluster $C$ 5 is:

$C$ 6

No additional smoothing is applied, and the log ensures non-negativity. Optionally, each class vector $C$ 7 may be $C$ 8-normalized.

2. Conceptual Motivation and Distinction from Standard TF–IDF

Standard TF-IDF quantifies how important a term is to a single document versus the entire corpus, using inverse document frequency (idf): $C$ 9, where $V$ 0 is the number of documents containing $V$ 1. In c-TF-IDF, the focus shifts to measuring a term’s prominence within a cluster of documents (pseudo-document), using global frequency $V$ 2 in lieu of document frequency, and scaling by average class length for class normalization.

The table below summarizes the main contrasts:

Quantity	Standard TF-IDF	c-TF-IDF
Document unit	single doc $V$ 3	cluster/pseudo-doc $V$ 4
Term frequency	$V$ 5	$V$ 6
Inverse frequency	$V$ 7	$V$ 8
Normalization	$V$ 9 or $tf_{t, d}$ 0 (postweight)	optional $tf_{t, d}$ 1 (pre-smoothing/dynamic models)

c-TF-IDF thus emphasizes terms that are frequent in one cluster yet infrequent globally, supporting highly specific topic descriptions.

3. Algorithmic Outline

The extraction of topic representations using c-TF-IDF in the BERTopic pipeline follows these steps:

Embedding: Each document $tf_{t, d}$ 2 is embedded with a pre-trained transformer model (e.g., SBERT).
Dimensionality Reduction: Embeddings are reduced with UMAP.
Clustering: HDBSCAN assigns cluster labels $tf_{t, d}$ 3.
Term Aggregation: For each cluster $tf_{t, d}$ $t f_{t, d}$ 4:
- Compute $tf_{t, d}$ 5 by summing term counts for all $tf_{t, d}$ 6 with $tf_{t, d}$ 7.
- Compute cluster size $tf_{t, d}$ 8.
Global Stats: Calculate $tf_{t, d}$ 9 and $t$ 0.
Weight Calculation: Compute $t$ 1 for all $t$ 2.
(Optional) Normalization: Apply $t$ 3 normalization to each $t$ 4.
Topic Representation: For each $t$ 5, select the top $t$ 6 terms by $t$ 7 as topic descriptors.

Key hyperparameters include $t$ 8 (topic words per topic), $t$ 9, and minimum term frequency thresholds (Grootendorst, 2022). Preprocessing recommendations include stopword removal, non-alphabetic token filtering, and lemmatization.

4. Integration in BERTopic and Practical Recommendations

Within BERTopic, c-TF-IDF serves as the final stage to derive representative keywords for each topic discovered via clustering of semantic embeddings. The procedure leverages:

Pre-trained SBERT variants (e.g., all-mpnet-base-v2) for embeddings.
UMAP for dimensionality reduction.
HDBSCAN for density-based clustering with adjustable granularity (typical min_cluster_size: 5–15).
Preprocessing steps to reduce noise and sparsity, including lemmatization and rare-term filtering.
Post-processing methods such as merging small topics by c-TF-IDF vector proximity for more robust topic sets.

In dynamic topic modeling, global idf terms can be reused across time slices, while $d$ 0 is recalculated to reflect evolving clusters. Smoothing of topic vectors is accomplished by temporal averaging of normalized c-TF-IDF scores.

5. Empirical Evaluation of c-TF-IDF

Empirical results reported for c-TF-IDF in BERTopic demonstrate strong topic coherence across multiple benchmarks:

On 20 Newsgroups: NPMI = 0.166 (vs. LDA 0.058, NMF 0.089, Top2Vec–SBERT 0.068, CTM 0.096)
On BBC News: 0.167 (vs. LDA 0.014, NMF 0.012, Top2Vec–SBERT –0.027, CTM 0.094)
On Trump tweets: 0.066 (vs. LDA –0.011, NMF 0.009, Top2Vec–Doc2Vec –0.169, CTM 0.009)

Across a range of embedding models (USE, Doc2Vec, MiniLM, MPNET), c-TF-IDF maintains stability in both coherence and topic diversity. In dynamic topic modeling scenarios, BERTopic with c-TF-IDF outperforms LDA Sequence, achieving higher NPMI coherence (for example, 0.079 vs. 0.009 on Trump data).

A plausible implication is that clustering contextual embeddings followed by c-TF-IDF scoring produces interpretable topics that are more coherent than those produced by classical topic models, and that this framework is robust to changes in document embedding models (Grootendorst, 2022).

6. Applicability, Limitations, and Usage Recommendations

Practical usage of c-TF-IDF centers on topic modeling pipelines seeking fast, coherent, and semantically consistent topic representations from clustered embeddings. Key recommendations include:

Selecting embedding models fine-tuned for semantic similarity.
Keeping clustering hyperparameters fixed for comparability.
Employing aggressive stopword, noise, and rare-term filtering for stability.
Optionally merging undersized topics iteratively based on c-TF-IDF similarity.
Leveraging c-TF-IDF in dynamic topic modeling by persistent idf terms and time-local $d$ 1 computation.

No evidence is reported regarding controversies or significant conceptual limitations specific to the c-TF-IDF approach in topic modeling; however, its dependence on high-quality document embeddings and clustering quality underlines the importance of upstream components. This suggests that effectiveness may be constrained by the suitability of embedding and clustering choices for a given corpus or application context (Grootendorst, 2022).

Markdown Report Issue Upgrade to Chat

References (1)

BERTopic: Neural topic modeling with a class-based TF-IDF procedure (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Class-based TF-IDF (c-TF-IDF).