Papers
Topics
Authors
Recent
Search
2000 character limit reached

Class-Based TF–IDF: Methods & Impact

Updated 2 March 2026
  • Class-based TF–IDF is a term-weighting scheme that builds on traditional TF–IDF by aggregating documents into classes to measure term importance based on local representativeness and global discriminativeness.
  • It aggregates texts into pseudo-documents per class and recalculates term frequencies to highlight keywords that are unique to each group.
  • Applied in topic modeling (e.g., BERTopic) and supervised sentiment analysis, it improves topic coherence and classification performance through efficient, class-aware weighting.

Class-based TF–IDF refers to a family of term-weighting schemes that extend the classical TF–IDF (Term Frequency–Inverse Document Frequency) paradigm by incorporating group or class structure when measuring the importance of terms. Unlike standard TF–IDF, which computes distinctiveness of terms at the level of individual documents, class-based variants operate at the level of clusters or classes—pseudo-documents formed by aggregation over topic, cluster, or label membership. This approach has proven especially effective in modern topic modeling (notably in BERTopic), as well as in supervised settings such as sentiment analysis, where class-conditional distributions of terms are exploited for greater discriminative power (Grootendorst, 2022, Carvalho et al., 2020).

1. Theoretical Formulation

Classical TF–IDF assigns to each term tt in document dd the weight

Wt,d=tft,d  ×  log(Ndft)W_{t,d} = tf_{t,d} \;\times\; \log\left(\frac{N}{df_t}\right)

where tft,dtf_{t,d} is the term frequency of tt in dd, dftdf_t is the number of documents containing tt, and NN is the total number of documents.

Class-based TF–IDF methods generalize this by (i) aggregating documents within a class (cluster, topic, or label), thereby treating each class cc as a pseudo-document, and (ii) redefining the inverse-frequency component to capture between-class rather than global distinctiveness. Two major instantiations are prominent:

a) BERTopic’s c-TF-IDF:

Wt,c=tft,c×log(1+Atft)W_{t,c} = tf_{t,c} \times \log\left(1+\frac{A}{tf_{t}}\right)

with tft,ctf_{t,c} the frequency of tt in class cc, tfttf_t the global frequency of tt across all classes, and AA the average number of word tokens per class. This formulation replaces the per-document perspective with a per-cluster approach; the inverse class frequency (ICF) penalizes terms by their prevalence across all classes (Grootendorst, 2022).

b) TF-IDFC-RF for Sentiment Analysis:

WTF.IDFC.RF(ti,dj)=TF(ti,dj)log2(2+max(A,C)max(2,min(A,C)))B+DW_{\mathrm{TF.IDFC.RF}}(t_i,d_j) = \sqrt{\mathrm{TF}(t_i,d_j)} \cdot \log_2 \left( \frac{2+\max(A,C)}{\max(2, \min(A,C))} \right) \cdot \sqrt{B+D}

where AA, BB, CC, DD represent document counts (per a term–class contingency table), and the weighting emphasizes terms that are both class-discriminative and globally rare (Carvalho et al., 2020).

2. Methodological Principles and Implementation

The core procedure underlying class-based TF–IDF involves the following canonical workflow:

  1. Group aggregation: Documents are partitioned by cluster or class; all texts within a group are concatenated.
  2. Term frequency computation: For each class cc, compute the frequency tft,ctf_{t,c} of each term tt.
  3. Global statistics: Compute for each term tt its cumulative frequency tfttf_t across all classes; compute class-level global scalars such as average class length AA as needed.
  4. Scoring: Apply the class-based TF–IDF weighting equation to yield a term–class matrix WW.
  5. Interpretation: For topic modeling, top-nn words per class are identified by sorting WW within each class.

Complexity: The dominant computational cost is in term counting over concatenated class texts, OO(total tokens). Because the number of classes (KK) is typically KNK \ll N (number of documents), this method is highly efficient and scalable for large corpora (Grootendorst, 2022).

Illustrative Example:

For two clusters, text ‘apple apple banana’ (cluster 1) and ‘apple cucumber cucumber’ (cluster 2), the method will rank ‘banana’ as a top word for cluster 1 despite ‘apple’ being more frequent overall. This effect arises because ‘banana’ is uniquely associated with cluster 1, maximizing discriminative power.

3. Motivation and Distinctiveness

The underlying motivation for class-based TF–IDF is to identify terms that are:

  • Frequent within a given class (local representativeness),
  • Rare across other classes (global discriminativeness).

Whereas standard TF–IDF discounts terms prevalent in the entire corpus, class-based variants specifically penalize terms frequent across multiple classes or clusters. This results in topic signatures—the sets of most strongly weighted terms per class—that are semantically coherent and human-interpretable without post hoc filtering or extra modeling steps (Grootendorst, 2022).

For supervised analysis (e.g., TF-IDFC-RF) the approach further incorporates label structure: class-conditional document frequencies (IDFC) and, in some cases, additional relevance or absence-based factors (RF). This increases the robustness and discriminative capacity of feature vectors for downstream classifiers (Carvalho et al., 2020).

4. Comparison to Standard and Other Supervised Weighting Schemes

The main contrast between class-based TF–IDF and classical approaches is summarized below:

Feature Standard TF–IDF c-TF-IDF / TF-IDFC-RF
Unit of analysis Document Class/cluster (pseudo-document)
Inverse frequency Global document freq. Global class or class contrast
Discriminativeness Penalizes corpus-common terms Penalizes class-common terms
Use case Generic IR, unsupervised topics Topic modeling, supervised tasks

Relative to other supervised schemes, such as Delta TF–IDF or TF-RF, TF-IDFC-RF’s symmetric class-contrast log-ratio and inclusion of absence-based relevance frequency introduces a two-pronged signal: it prefers terms both specific to a class and globally rare (Carvalho et al., 2020).

5. Empirical Assessment and Observed Impact

Within BERTopic (Grootendorst, 2022), c-TF-IDF yields highly coherent topics as measured by normalized pointwise mutual information (NPMI) across datasets such as 20 NewsGroups (TC ≈ 0.166, TD ≈ 0.851 with MPNet embeddings), outperforming classical baselines like LDA and NMF (TC < 0.10). Topic coherence remains stable across embedding models (MPNet, MiniLM, USE, Doc2Vec), while topic diversity remains competitive.

For supervised sentiment tasks, TF-IDFC-RF is empirically validated on multiple benchmarks (Polarity, Amazon Sarcasm, Subjectivity, Movie Review Snippets) (Carvalho et al., 2020). It achieves the single highest weighted F1F_1 scores on two datasets (e.g., Polarity: 88.30% with SVM, 84.25% with NB), and is consistently within the top ranks on others. Its combination of class conditionality and global rarity factors supports robust feature selection across different corpora.

In dynamic or temporally evolving topic modeling scenarios, reusing the global class-based inverse-frequency vector in combination with local term statistics produces improved coherence relative to sequential LDA (NPMI: .079 vs .009).

Computation of c-TF-IDF scores itself is not a runtime bottleneck; in neural topic modeling pipelines such as BERTopic, most wall-time is consumed by document embedding and clustering. Thus, class-based TF–IDF enables extremely fast topic signature extraction once clusters are known.

6. Applicability, Limitations, and Extensions

Class-based TF–IDF is directly applicable in:

  • Cluster-based topic modeling—turning embeddings-based clusters into topic–word distributions without generative modeling (e.g., LDA).
  • Supervised text classification—as in TF-IDFC-RF, where class contrast is essential for discriminative feature generation.
  • Dynamic topic tracking—supporting per-batch or streaming signature updates by reusing c-IDF statistics.
  • Flexible custom analysis—enabling topic splits along arbitrary metadata axes due to efficiency.

Limitations include the requirement for labeled data in supervised weighting schemes (TF-IDFC-RF), and potential instability with extremely low-frequency terms (mitigated by smoothing). While highly effective for binary sentiment analysis, supervised approaches may require further adjustment for multi-class settings, such as aggregating or normalizing class-based contrast factors.

A plausible implication is that class-based approaches can subsume some of the discriminative capabilities traditionally attributed to more complex probabilistic models, provided a high-quality clustering or class assignment is available.

7. Summary and Research Significance

Class-based TF–IDF schemes capitalize on group structure, treating each class or cluster as a pseudo-document and rescaling term weights with a variant of inverse frequency that emphasizes discriminativeness across classes. In neural topic modeling (BERTopic), this procedure translates directly from clustered embeddings to coherent topic representations without generative model overhead, yielding state-of-the-art coherence and diversity on public benchmarks (Grootendorst, 2022). In supervised learning (TF-IDFC-RF), class-conditional IDF and absence-based relevance factors significantly boost classification performance over unsupervised and several prior supervised alternatives (Carvalho et al., 2020). The result is a robust, computationally lightweight approach to extracting interpretable and discriminative representations for both unsupervised topic extraction and supervised text classification.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Class-Based TF–IDF.