Class-Based TF-IDF (c-TF-IDF)
- Class-based TF-IDF (c-TF-IDF) is a term weighting scheme that aggregates term frequencies over clusters, generating coherent and distinctive topic representations.
- It leverages aggregated counts and a log-scaled adjustment with average cluster length to emphasize cluster-specific, high-salience terms.
- Empirical evaluations show improved topic coherence across various datasets, underscoring its robustness in dynamic and static topic modeling applications.
Class-based TF-IDF (c-TF-IDF) is a term weighting scheme for topic modeling that generalizes standard TF-IDF from single documents to clusters, or "classes," of semantically similar documents. This approach was introduced in the context of the BERTopic model, where documents are first embedded using transformer-based models, clustered based on semantic similarity, and then described by high-scoring terms extracted using c-TF-IDF. The central aim of c-TF-IDF is to create coherent and distinctive topic representations by measuring term salience within a cluster relative to its prevalence across all clusters (Grootendorst, 2022).
1. Formal Definition and Mathematical Formulation
c-TF-IDF operates over a collection of documents partitioned into clusters (topics), with the vocabulary of unique terms. Key definitions:
- : raw count of term in document
- : aggregated frequency of in cluster
- : total tokens in cluster
- : mean tokens per cluster
- : total occurrence of over corpus
The c-TF-IDF score for term in cluster is:
No additional smoothing is applied, and the log ensures non-negativity. Optionally, each class vector may be -normalized.
2. Conceptual Motivation and Distinction from Standard TF–IDF
Standard TF-IDF quantifies how important a term is to a single document versus the entire corpus, using inverse document frequency (idf): , where is the number of documents containing . In c-TF-IDF, the focus shifts to measuring a term’s prominence within a cluster of documents (pseudo-document), using global frequency in lieu of document frequency, and scaling by average class length for class normalization.
The table below summarizes the main contrasts:
| Quantity | Standard TF-IDF | c-TF-IDF |
|---|---|---|
| Document unit | single doc | cluster/pseudo-doc |
| Term frequency | ||
| Inverse frequency | ||
| Normalization | or (postweight) | optional (pre-smoothing/dynamic models) |
c-TF-IDF thus emphasizes terms that are frequent in one cluster yet infrequent globally, supporting highly specific topic descriptions.
3. Algorithmic Outline
The extraction of topic representations using c-TF-IDF in the BERTopic pipeline follows these steps:
- Embedding: Each document is embedded with a pre-trained transformer model (e.g., SBERT).
- Dimensionality Reduction: Embeddings are reduced with UMAP.
- Clustering: HDBSCAN assigns cluster labels .
- Term Aggregation: For each cluster :
- Compute by summing term counts for all with .
- Compute cluster size .
- Global Stats: Calculate and .
- Weight Calculation: Compute for all .
- (Optional) Normalization: Apply normalization to each .
- Topic Representation: For each , select the top terms by as topic descriptors.
Key hyperparameters include (topic words per topic), , and minimum term frequency thresholds (Grootendorst, 2022). Preprocessing recommendations include stopword removal, non-alphabetic token filtering, and lemmatization.
4. Integration in BERTopic and Practical Recommendations
Within BERTopic, c-TF-IDF serves as the final stage to derive representative keywords for each topic discovered via clustering of semantic embeddings. The procedure leverages:
- Pre-trained SBERT variants (e.g., all-mpnet-base-v2) for embeddings.
- UMAP for dimensionality reduction.
- HDBSCAN for density-based clustering with adjustable granularity (typical min_cluster_size: 5–15).
- Preprocessing steps to reduce noise and sparsity, including lemmatization and rare-term filtering.
- Post-processing methods such as merging small topics by c-TF-IDF vector proximity for more robust topic sets.
In dynamic topic modeling, global idf terms can be reused across time slices, while is recalculated to reflect evolving clusters. Smoothing of topic vectors is accomplished by temporal averaging of normalized c-TF-IDF scores.
5. Empirical Evaluation of c-TF-IDF
Empirical results reported for c-TF-IDF in BERTopic demonstrate strong topic coherence across multiple benchmarks:
- On 20 Newsgroups: NPMI = 0.166 (vs. LDA 0.058, NMF 0.089, Top2Vec–SBERT 0.068, CTM 0.096)
- On BBC News: 0.167 (vs. LDA 0.014, NMF 0.012, Top2Vec–SBERT –0.027, CTM 0.094)
- On Trump tweets: 0.066 (vs. LDA –0.011, NMF 0.009, Top2Vec–Doc2Vec –0.169, CTM 0.009)
Across a range of embedding models (USE, Doc2Vec, MiniLM, MPNET), c-TF-IDF maintains stability in both coherence and topic diversity. In dynamic topic modeling scenarios, BERTopic with c-TF-IDF outperforms LDA Sequence, achieving higher NPMI coherence (for example, 0.079 vs. 0.009 on Trump data).
A plausible implication is that clustering contextual embeddings followed by c-TF-IDF scoring produces interpretable topics that are more coherent than those produced by classical topic models, and that this framework is robust to changes in document embedding models (Grootendorst, 2022).
6. Applicability, Limitations, and Usage Recommendations
Practical usage of c-TF-IDF centers on topic modeling pipelines seeking fast, coherent, and semantically consistent topic representations from clustered embeddings. Key recommendations include:
- Selecting embedding models fine-tuned for semantic similarity.
- Keeping clustering hyperparameters fixed for comparability.
- Employing aggressive stopword, noise, and rare-term filtering for stability.
- Optionally merging undersized topics iteratively based on c-TF-IDF similarity.
- Leveraging c-TF-IDF in dynamic topic modeling by persistent idf terms and time-local computation.
No evidence is reported regarding controversies or significant conceptual limitations specific to the c-TF-IDF approach in topic modeling; however, its dependence on high-quality document embeddings and clustering quality underlines the importance of upstream components. This suggests that effectiveness may be constrained by the suitability of embedding and clustering choices for a given corpus or application context (Grootendorst, 2022).