Papers
Topics
Authors
Recent
Search
2000 character limit reached

Class-Based TF-IDF Methods

Updated 4 April 2026
  • Class-based TF-IDF is a supervised term-weighting framework that leverages class or cluster information to emphasize discriminative terms.
  • It adapts traditional TF-IDF by aggregating term frequencies over classes, thereby improving application in text classification, topic modeling, and feature engineering.
  • Empirical studies show that variants like TF-CR and TF-IDFC-RF significantly enhance performance metrics, such as macro-F1 scores, across various datasets.

Class-based TF-IDF refers to supervised term-weighting schemes that leverage class or cluster information to enhance term discriminativeness over purely document-centric metrics. Such methods are central to recent advances in text classification, topic modeling, and feature engineering, where class-aware weighting schemes—such as Term Frequency-Category Ratio (TF-CR), class-based TF-IDF (c-TF-IDF), and TF-IDFC-RF—serve to foreground terms that distinguish classes or clusters. These approaches modify the traditional TF-IDF framework by explicitly incorporating label or cluster assignments into term-weight computation, yielding improved interpretability, classifier performance, and feature relevance.

1. Formal Definitions of Class-based TF-IDF Variants

The earliest and most prevalent document-level weighting scheme is standard TF-IDF, defined as

Wt,d=tft,dlog(Ndft)W_{t,d} = \mathrm{tf}_{t,d} \cdot \log \left( \frac{N}{\mathrm{df}_t} \right)

with tft,d\mathrm{tf}_{t,d} the count of term tt in document dd, NN the document count, and dft\mathrm{df}_t the number of documents containing tt.

Class-based TF-IDF procedures redefine the unit of analysis, aggregating counts over classes or clusters and adjusting the “inverse document frequency” to penalize terms recurrent in many classes or clusters:

Wt,c=tft,clog(1+Atft)W_{t,c} = \mathrm{tf}_{t,c} \cdot \log \left( 1 + \frac{A}{\mathrm{tf}_t} \right)

where tft,c\mathrm{tf}_{t,c} is the frequency of term tt in class (or cluster) tft,d\mathrm{tf}_{t,d}0, tft,d\mathrm{tf}_{t,d}1 its global frequency, and tft,d\mathrm{tf}_{t,d}2 the average total number of tokens per class.

tft,d\mathrm{tf}_{t,d}3

where tft,d\mathrm{tf}_{t,d}4 is the number of times tft,d\mathrm{tf}_{t,d}5 occurs in class tft,d\mathrm{tf}_{t,d}6, tft,d\mathrm{tf}_{t,d}7 the token count in tft,d\mathrm{tf}_{t,d}8, and tft,d\mathrm{tf}_{t,d}9 the total occurrence of tt0.

tt1

Here, tt2 and tt3 are the counts of documents containing tt4 in the target and other classes, tt5 and tt6 are their complements, and the final score combines class-skew and selectivity.

These variations tailor the “informativeness” factor to class- or cluster-specific patterns—emphasizing intra-class frequency and inter-class exclusivity.

2. Comparison to Traditional TF-IDF

Traditional TF-IDF does not use any form of label or cluster information. Its weighting is based strictly on global document frequency. In contrast:

  • c-TF–IDF treats each cluster (or topic) as an aggregated “pseudo-document” and calculates importance based on frequency within a cluster versus its spread across clusters (Grootendorst, 2022).
  • TF-CR and TF-IDFC-RF employ label information; e.g., TF-CR exclusively rewards terms frequent in, and unique to, a class, while penalizing those distributed broadly.
  • Class-based schemes can employ class-level or cluster-level analogues of inverse document frequency, leading to distinctions such as “class-based idf prime,” document-in-class counts, or ratio terms.

These changes yield feature spaces aligned to discriminative axes set by class or cluster boundaries, improving downstream model performance and interpretability.

3. Applications: Clustering, Classification, and Topic Modeling

Class-based TF-IDF procedures have been deployed across several text processing pipelines:

  • Topic Modeling (c-TF–IDF): BERTopic clusters transformer-based document embeddings, then constructs topic representations via c-TF–IDF, yielding term rankings tailored per cluster. This diverges from centroid-based topic models by using cluster-level statistics, enhancing topic coherence and interpretability (Grootendorst, 2022).
  • Supervised Text Classification: TF-CR and TF-IDFC-RF generate features highly responsive to class label distributions. For embedding-based text classifiers, TF-CR is applied by computing, for each class, a weighted sum of pre-trained word vectors (e.g., Word2Vec, GloVe), using TF-CR as coefficients, and concatenating the class-specific vectors to form the input representation (Zubiaga, 2020).
  • Sentiment Analysis: In TF-IDFC-RF, term frequency in class, relative document coverage, and “absence in others” are combined to capture selecting terms for binary sentiment (positive/negative) or affective classification (Carvalho et al., 2020).

These mechanisms are “plug-and-play” with popular vector space models, requiring no additional parameters or external resources.

4. Experimental Findings and Quantitative Impact

Empirical comparisons confirm the effectiveness of class-based TF-IDF-type weighting:

TF-CR (macro-F₁, 90K training instances; “tw2v” embeddings):

Dataset No Weight TF-IDF KLD TF-CR
20NewsGroups 0.705 0.893 0.860 0.930
Hate Speech 0.661 0.556 0.643 0.648
Newsspace200 0.544 0.507 0.586 0.595
ODPtweets 0.325 0.354 0.362 0.458

TF-CR achieved the highest macro-F₁ in 7 out of 8 datasets at larger scales, with absolute gains over the next best strategy of up to 0.13. The advantage scales with dataset size, reflecting the improved exploitation of class-conditional statistics. TF-CR remains robust for datasets with as few as 5–10K instances (Zubiaga, 2020).

TF-IDFC-RF demonstrates highest weighted F₁ on “Polarity” and “Sarcasm” datasets among ten schemes, outperforming TF-IDF and IGM-based alternatives at most feature-set sizes, and remains among the top two schemes across all tested classification datasets (Carvalho et al., 2020).

5. Methodological and Design Considerations

Practical aspects of implementing class-based TF-IDF schemes include:

  • Smoothing and Normalization: c-TF–IDF (Grootendorst, 2022) adopts tt7 with tt8 smoothing for numerical stability; L1 normalization is recommended for comparability across clusters.
  • Scalability: All examined schemes are parameter-free and rely only on labeled data frequency counts. They are directly applicable to large and moderate datasets but may offer limited gains over unweighted embeddings at very small scales.
  • Flexibility: These methods can operate over any vectorizer, including embedding summation, n-gram expansions, or VSMs.
  • Noise Control: Hapax legomena and very rare terms can be pruned based on minimum frequency thresholds pre- or post-weighting.

No external semantic resources, extensive hyperparameter tuning, or smoothing constants (outside c-TF–IDF’s tt9) are required. This reduces risk of overfitting and facilitates rapid feature extraction.

6. Theoretical Motivation and Interpretative Value

Class-based TF-IDF variants are motivated by the need to identify terms most discriminative for class or cluster prediction:

  • Discriminativeness: These schemes assign maximal weight to terms both frequent in (cluster/class) dd0 and rare outside it, down-weighting ubiquitous or non-informative words—as seen in TF-CR’s multiplicative dd1 form (Zubiaga, 2020).
  • Interpretability: Feature vectors constructed using these weighting schemes align with human intuition regarding class-distinctive “buzzwords” or topic markers.
  • Extension to Multi-class/Clustering Setups: Although some (e.g., TF-IDFC-RF) are derived from binary-class settings, their logic suggests potential for extension using entropy or variance-based generalizations to handle multi-class or clustering problems (Carvalho et al., 2020).

A plausible implication is that these mechanisms contribute to enhanced classification, topic coherence, and downstream model transparency by tightly coupling term statistics with class information.

7. Extensions and Open Directions

Recent work proposes:

  • Dynamic Smoothing: In dynamic topic modeling, c-TF–IDF vectors may be averaged temporally to impose time-wise smoothness in topic representations (Grootendorst, 2022).
  • Parameter Learning: Learning TF exponents or idf scaling factors via cross-validation could further boost discriminative capacity (Carvalho et al., 2020).
  • Integration with Embeddings: Combining class-aware weighting with neural embedding architectures—optionally incorporating attention—could enable richer feature representations.

Generalizing the underlying principles to diverse text understanding tasks and exploring alternative aggregations (e.g., entropy-driven class exclusivity) remain active research areas.


References:

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Class-based TF-IDF.