Class-Based TF-IDF Methods

Updated 4 April 2026

Class-based TF-IDF is a supervised term-weighting framework that leverages class or cluster information to emphasize discriminative terms.
It adapts traditional TF-IDF by aggregating term frequencies over classes, thereby improving application in text classification, topic modeling, and feature engineering.
Empirical studies show that variants like TF-CR and TF-IDFC-RF significantly enhance performance metrics, such as macro-F1 scores, across various datasets.

Class-based TF-IDF refers to supervised term-weighting schemes that leverage class or cluster information to enhance term discriminativeness over purely document-centric metrics. Such methods are central to recent advances in text classification, topic modeling, and feature engineering, where class-aware weighting schemes—such as Term Frequency-Category Ratio (TF-CR), class-based TF-IDF (c-TF-IDF), and TF-IDFC-RF—serve to foreground terms that distinguish classes or clusters. These approaches modify the traditional TF-IDF framework by explicitly incorporating label or cluster assignments into term-weight computation, yielding improved interpretability, classifier performance, and feature relevance.

1. Formal Definitions of Class-based TF-IDF Variants

The earliest and most prevalent document-level weighting scheme is standard TF-IDF, defined as

$W_{t,d} = \mathrm{tf}_{t,d} \cdot \log \left( \frac{N}{\mathrm{df}_t} \right)$

with $\mathrm{tf}_{t,d}$ the count of term $t$ in document $d$ , $N$ the document count, and $\mathrm{df}_t$ the number of documents containing $t$ .

Class-based TF-IDF procedures redefine the unit of analysis, aggregating counts over classes or clusters and adjusting the “inverse document frequency” to penalize terms recurrent in many classes or clusters:

BERTopic’s c-TF–IDF (Grootendorst, 2022):

$W_{t,c} = \mathrm{tf}_{t,c} \cdot \log \left( 1 + \frac{A}{\mathrm{tf}_t} \right)$

where $\mathrm{tf}_{t,c}$ is the frequency of term $t$ in class (or cluster) $\mathrm{tf}_{t,d}$ 0, $\mathrm{tf}_{t,d}$ 1 its global frequency, and $\mathrm{tf}_{t,d}$ 2 the average total number of tokens per class.

TF-CR (Zubiaga, 2020):

$\mathrm{tf}_{t,d}$ 3

where $\mathrm{tf}_{t,d}$ 4 is the number of times $\mathrm{tf}_{t,d}$ 5 occurs in class $\mathrm{tf}_{t,d}$ 6, $\mathrm{tf}_{t,d}$ 7 the token count in $\mathrm{tf}_{t,d}$ 8, and $\mathrm{tf}_{t,d}$ 9 the total occurrence of $t$ 0.

TF-IDFC-RF (Carvalho et al., 2020):

$t$ 1

Here, $t$ 2 and $t$ 3 are the counts of documents containing $t$ 4 in the target and other classes, $t$ 5 and $t$ 6 are their complements, and the final score combines class-skew and selectivity.

These variations tailor the “informativeness” factor to class- or cluster-specific patterns—emphasizing intra-class frequency and inter-class exclusivity.

2. Comparison to Traditional TF-IDF

Traditional TF-IDF does not use any form of label or cluster information. Its weighting is based strictly on global document frequency. In contrast:

c-TF–IDF treats each cluster (or topic) as an aggregated “pseudo-document” and calculates importance based on frequency within a cluster versus its spread across clusters (Grootendorst, 2022).
TF-CR and TF-IDFC-RF employ label information; e.g., TF-CR exclusively rewards terms frequent in, and unique to, a class, while penalizing those distributed broadly.
Class-based schemes can employ class-level or cluster-level analogues of inverse document frequency, leading to distinctions such as “class-based idf prime,” document-in-class counts, or ratio terms.

These changes yield feature spaces aligned to discriminative axes set by class or cluster boundaries, improving downstream model performance and interpretability.

3. Applications: Clustering, Classification, and Topic Modeling

Class-based TF-IDF procedures have been deployed across several text processing pipelines:

Topic Modeling (c-TF–IDF): BERTopic clusters transformer-based document embeddings, then constructs topic representations via c-TF–IDF, yielding term rankings tailored per cluster. This diverges from centroid-based topic models by using cluster-level statistics, enhancing topic coherence and interpretability (Grootendorst, 2022).
Supervised Text Classification: TF-CR and TF-IDFC-RF generate features highly responsive to class label distributions. For embedding-based text classifiers, TF-CR is applied by computing, for each class, a weighted sum of pre-trained word vectors (e.g., Word2Vec, GloVe), using TF-CR as coefficients, and concatenating the class-specific vectors to form the input representation (Zubiaga, 2020).
Sentiment Analysis: In TF-IDFC-RF, term frequency in class, relative document coverage, and “absence in others” are combined to capture selecting terms for binary sentiment (positive/negative) or affective classification (Carvalho et al., 2020).

These mechanisms are “plug-and-play” with popular vector space models, requiring no additional parameters or external resources.

4. Experimental Findings and Quantitative Impact

Empirical comparisons confirm the effectiveness of class-based TF-IDF-type weighting:

TF-CR (macro-F₁, 90K training instances; “tw2v” embeddings):

Dataset	No Weight	TF-IDF	KLD	TF-CR
20NewsGroups	0.705	0.893	0.860	0.930
Hate Speech	0.661	0.556	0.643	0.648
Newsspace200	0.544	0.507	0.586	0.595
ODPtweets	0.325	0.354	0.362	0.458

TF-CR achieved the highest macro-F₁ in 7 out of 8 datasets at larger scales, with absolute gains over the next best strategy of up to 0.13. The advantage scales with dataset size, reflecting the improved exploitation of class-conditional statistics. TF-CR remains robust for datasets with as few as 5–10K instances (Zubiaga, 2020).

TF-IDFC-RF demonstrates highest weighted F₁ on “Polarity” and “Sarcasm” datasets among ten schemes, outperforming TF-IDF and IGM-based alternatives at most feature-set sizes, and remains among the top two schemes across all tested classification datasets (Carvalho et al., 2020).

5. Methodological and Design Considerations

Practical aspects of implementing class-based TF-IDF schemes include:

Smoothing and Normalization: c-TF–IDF (Grootendorst, 2022) adopts $t$ 7 with $t$ 8 smoothing for numerical stability; L1 normalization is recommended for comparability across clusters.
Scalability: All examined schemes are parameter-free and rely only on labeled data frequency counts. They are directly applicable to large and moderate datasets but may offer limited gains over unweighted embeddings at very small scales.
Flexibility: These methods can operate over any vectorizer, including embedding summation, n-gram expansions, or VSMs.
Noise Control: Hapax legomena and very rare terms can be pruned based on minimum frequency thresholds pre- or post-weighting.

No external semantic resources, extensive hyperparameter tuning, or smoothing constants (outside c-TF–IDF’s $t$ 9) are required. This reduces risk of overfitting and facilitates rapid feature extraction.

6. Theoretical Motivation and Interpretative Value

Class-based TF-IDF variants are motivated by the need to identify terms most discriminative for class or cluster prediction:

Discriminativeness: These schemes assign maximal weight to terms both frequent in (cluster/class) $d$ 0 and rare outside it, down-weighting ubiquitous or non-informative words—as seen in TF-CR’s multiplicative $d$ 1 form (Zubiaga, 2020).
Interpretability: Feature vectors constructed using these weighting schemes align with human intuition regarding class-distinctive “buzzwords” or topic markers.
Extension to Multi-class/Clustering Setups: Although some (e.g., TF-IDFC-RF) are derived from binary-class settings, their logic suggests potential for extension using entropy or variance-based generalizations to handle multi-class or clustering problems (Carvalho et al., 2020).

A plausible implication is that these mechanisms contribute to enhanced classification, topic coherence, and downstream model transparency by tightly coupling term statistics with class information.

7. Extensions and Open Directions

Recent work proposes:

Dynamic Smoothing: In dynamic topic modeling, c-TF–IDF vectors may be averaged temporally to impose time-wise smoothness in topic representations (Grootendorst, 2022).
Parameter Learning: Learning TF exponents or idf scaling factors via cross-validation could further boost discriminative capacity (Carvalho et al., 2020).
Integration with Embeddings: Combining class-aware weighting with neural embedding architectures—optionally incorporating attention—could enable richer feature representations.

Generalizing the underlying principles to diverse text understanding tasks and exploring alternative aggregations (e.g., entropy-driven class exclusivity) remain active research areas.

References:

(Grootendorst, 2022) BERTopic: Neural topic modeling with a class-based TF-IDF procedure
(Zubiaga, 2020) Exploiting Class Labels to Boost Performance on Embedding-based Text Classification
(Carvalho et al., 2020) TF-IDFC-RF: A Novel Supervised Term Weighting Scheme

Markdown Report Issue Upgrade to Chat

References (3)

BERTopic: Neural topic modeling with a class-based TF-IDF procedure (2022)

Exploiting Class Labels to Boost Performance on Embedding-based Text Classification (2020)

TF-IDFC-RF: A Novel Supervised Term Weighting Scheme (2020)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Class-based TF-IDF.

Class-Based TF-IDF Methods

1. Formal Definitions of Class-based TF-IDF Variants

2. Comparison to Traditional TF-IDF

3. Applications: Clustering, Classification, and Topic Modeling

4. Experimental Findings and Quantitative Impact

5. Methodological and Design Considerations

6. Theoretical Motivation and Interpretative Value

7. Extensions and Open Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Class-Based TF-IDF Methods

1. Formal Definitions of Class-based TF-IDF Variants

2. Comparison to Traditional TF-IDF

3. Applications: Clustering, Classification, and Topic Modeling

4. Experimental Findings and Quantitative Impact

5. Methodological and Design Considerations

6. Theoretical Motivation and Interpretative Value

7. Extensions and Open Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research