Papers
Topics
Authors
Recent
Search
2000 character limit reached

Class-Based TF-IDF (c-TF-IDF)

Updated 20 January 2026
  • Class-based TF-IDF (c-TF-IDF) is a term weighting scheme that aggregates term frequencies over clusters, generating coherent and distinctive topic representations.
  • It leverages aggregated counts and a log-scaled adjustment with average cluster length to emphasize cluster-specific, high-salience terms.
  • Empirical evaluations show improved topic coherence across various datasets, underscoring its robustness in dynamic and static topic modeling applications.

Class-based TF-IDF (c-TF-IDF) is a term weighting scheme for topic modeling that generalizes standard TF-IDF from single documents to clusters, or "classes," of semantically similar documents. This approach was introduced in the context of the BERTopic model, where documents are first embedded using transformer-based models, clustered based on semantic similarity, and then described by high-scoring terms extracted using c-TF-IDF. The central aim of c-TF-IDF is to create coherent and distinctive topic representations by measuring term salience within a cluster relative to its prevalence across all clusters (Grootendorst, 2022).

1. Formal Definition and Mathematical Formulation

c-TF-IDF operates over a collection of DD documents partitioned into CC clusters (topics), with VV the vocabulary of unique terms. Key definitions:

  • tft,dtf_{t, d}: raw count of term tt in document dd
  • tft,c=dCctft,dtf_{t, c} = \sum_{d \in \mathcal{C}_c} tf_{t, d}: aggregated frequency of tt in cluster cc
  • Tc=tVtft,cT_c = \sum_{t\in V} tf_{t, c}: total tokens in cluster cc
  • A=1Cc=1CTcA = \frac{1}{C}\sum_{c=1}^C T_c: mean tokens per cluster
  • tft=c=1Ctft,ctf_t = \sum_{c=1}^C tf_{t, c}: total occurrence of tt over corpus

The c-TF-IDF score for term tt in cluster cc is:

Wt,cc-TF-IDF=tft,clog(1+Atft)W_{t,c}^{\mathrm{c\text{-}TF\text{-}IDF}} = tf_{t,c} \cdot \log \left( 1 + \frac{A}{tf_t} \right)

No additional smoothing is applied, and the log ensures non-negativity. Optionally, each class vector W:,cW_{:,c} may be L1L_1-normalized.

2. Conceptual Motivation and Distinction from Standard TF–IDF

Standard TF-IDF quantifies how important a term is to a single document versus the entire corpus, using inverse document frequency (idf): idft=log(N/dft)idf_t = \log(N/df_t), where dftdf_t is the number of documents containing tt. In c-TF-IDF, the focus shifts to measuring a term’s prominence within a cluster of documents (pseudo-document), using global frequency tfttf_t in lieu of document frequency, and scaling by average class length for class normalization.

The table below summarizes the main contrasts:

Quantity Standard TF-IDF c-TF-IDF
Document unit single doc dd cluster/pseudo-doc cc
Term frequency tft,dtf_{t, d} tft,c=dctft,dtf_{t, c} = \sum_{d \in c} tf_{t, d}
Inverse frequency log(N/dft)\log(N/df_t) log(1+A/tft)\log(1 + A/tf_t)
Normalization L2L_2 or L1L_1 (postweight) optional L1L_1 (pre-smoothing/dynamic models)

c-TF-IDF thus emphasizes terms that are frequent in one cluster yet infrequent globally, supporting highly specific topic descriptions.

3. Algorithmic Outline

The extraction of topic representations using c-TF-IDF in the BERTopic pipeline follows these steps:

  1. Embedding: Each document dd is embedded with a pre-trained transformer model (e.g., SBERT).
  2. Dimensionality Reduction: Embeddings are reduced with UMAP.
  3. Clustering: HDBSCAN assigns cluster labels d{1C}\ell_d \in \{1\ldots C\}.
  4. Term Aggregation: For each cluster cc:
    • Compute tft,ctf_{t,c} by summing term counts for all dd with d=c\ell_d=c.
    • Compute cluster size TcT_c.
  5. Global Stats: Calculate A=1Cc=1CTcA = \frac{1}{C}\sum_{c=1}^C T_c and tft=c=1Ctft,ctf_t = \sum_{c=1}^C tf_{t,c}.
  6. Weight Calculation: Compute Wt,c=tft,clog(1+A/tft)W_{t,c} = tf_{t,c} \cdot \log(1 + A/tf_t) for all t,ct,c.
  7. (Optional) Normalization: Apply L1L_1 normalization to each W:,cW_{:,c}.
  8. Topic Representation: For each cc, select the top nn terms by Wt,cW_{t,c} as topic descriptors.

Key hyperparameters include top_ntop\_n (topic words per topic), min_cluster_sizemin\_cluster\_size, and minimum term frequency thresholds (Grootendorst, 2022). Preprocessing recommendations include stopword removal, non-alphabetic token filtering, and lemmatization.

4. Integration in BERTopic and Practical Recommendations

Within BERTopic, c-TF-IDF serves as the final stage to derive representative keywords for each topic discovered via clustering of semantic embeddings. The procedure leverages:

  • Pre-trained SBERT variants (e.g., all-mpnet-base-v2) for embeddings.
  • UMAP for dimensionality reduction.
  • HDBSCAN for density-based clustering with adjustable granularity (typical min_cluster_size: 5–15).
  • Preprocessing steps to reduce noise and sparsity, including lemmatization and rare-term filtering.
  • Post-processing methods such as merging small topics by c-TF-IDF vector proximity for more robust topic sets.

In dynamic topic modeling, global idf terms can be reused across time slices, while tft,ctf_{t,c} is recalculated to reflect evolving clusters. Smoothing of topic vectors is accomplished by temporal averaging of normalized c-TF-IDF scores.

5. Empirical Evaluation of c-TF-IDF

Empirical results reported for c-TF-IDF in BERTopic demonstrate strong topic coherence across multiple benchmarks:

  • On 20 Newsgroups: NPMI = 0.166 (vs. LDA 0.058, NMF 0.089, Top2Vec–SBERT 0.068, CTM 0.096)
  • On BBC News: 0.167 (vs. LDA 0.014, NMF 0.012, Top2Vec–SBERT –0.027, CTM 0.094)
  • On Trump tweets: 0.066 (vs. LDA –0.011, NMF 0.009, Top2Vec–Doc2Vec –0.169, CTM 0.009)

Across a range of embedding models (USE, Doc2Vec, MiniLM, MPNET), c-TF-IDF maintains stability in both coherence and topic diversity. In dynamic topic modeling scenarios, BERTopic with c-TF-IDF outperforms LDA Sequence, achieving higher NPMI coherence (for example, 0.079 vs. 0.009 on Trump data).

A plausible implication is that clustering contextual embeddings followed by c-TF-IDF scoring produces interpretable topics that are more coherent than those produced by classical topic models, and that this framework is robust to changes in document embedding models (Grootendorst, 2022).

6. Applicability, Limitations, and Usage Recommendations

Practical usage of c-TF-IDF centers on topic modeling pipelines seeking fast, coherent, and semantically consistent topic representations from clustered embeddings. Key recommendations include:

  • Selecting embedding models fine-tuned for semantic similarity.
  • Keeping clustering hyperparameters fixed for comparability.
  • Employing aggressive stopword, noise, and rare-term filtering for stability.
  • Optionally merging undersized topics iteratively based on c-TF-IDF similarity.
  • Leveraging c-TF-IDF in dynamic topic modeling by persistent idf terms and time-local tft,ctf_{t,c} computation.

No evidence is reported regarding controversies or significant conceptual limitations specific to the c-TF-IDF approach in topic modeling; however, its dependence on high-quality document embeddings and clustering quality underlines the importance of upstream components. This suggests that effectiveness may be constrained by the suitability of embedding and clustering choices for a given corpus or application context (Grootendorst, 2022).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Class-based TF-IDF (c-TF-IDF).