Class-Based TF-IDF Methods
- Class-based TF-IDF is a supervised term-weighting framework that leverages class or cluster information to emphasize discriminative terms.
- It adapts traditional TF-IDF by aggregating term frequencies over classes, thereby improving application in text classification, topic modeling, and feature engineering.
- Empirical studies show that variants like TF-CR and TF-IDFC-RF significantly enhance performance metrics, such as macro-F1 scores, across various datasets.
Class-based TF-IDF refers to supervised term-weighting schemes that leverage class or cluster information to enhance term discriminativeness over purely document-centric metrics. Such methods are central to recent advances in text classification, topic modeling, and feature engineering, where class-aware weighting schemes—such as Term Frequency-Category Ratio (TF-CR), class-based TF-IDF (c-TF-IDF), and TF-IDFC-RF—serve to foreground terms that distinguish classes or clusters. These approaches modify the traditional TF-IDF framework by explicitly incorporating label or cluster assignments into term-weight computation, yielding improved interpretability, classifier performance, and feature relevance.
1. Formal Definitions of Class-based TF-IDF Variants
The earliest and most prevalent document-level weighting scheme is standard TF-IDF, defined as
with the count of term in document , the document count, and the number of documents containing .
Class-based TF-IDF procedures redefine the unit of analysis, aggregating counts over classes or clusters and adjusting the “inverse document frequency” to penalize terms recurrent in many classes or clusters:
- BERTopic’s c-TF–IDF (Grootendorst, 2022):
where is the frequency of term in class (or cluster) 0, 1 its global frequency, and 2 the average total number of tokens per class.
- TF-CR (Zubiaga, 2020):
3
where 4 is the number of times 5 occurs in class 6, 7 the token count in 8, and 9 the total occurrence of 0.
- TF-IDFC-RF (Carvalho et al., 2020):
1
Here, 2 and 3 are the counts of documents containing 4 in the target and other classes, 5 and 6 are their complements, and the final score combines class-skew and selectivity.
These variations tailor the “informativeness” factor to class- or cluster-specific patterns—emphasizing intra-class frequency and inter-class exclusivity.
2. Comparison to Traditional TF-IDF
Traditional TF-IDF does not use any form of label or cluster information. Its weighting is based strictly on global document frequency. In contrast:
- c-TF–IDF treats each cluster (or topic) as an aggregated “pseudo-document” and calculates importance based on frequency within a cluster versus its spread across clusters (Grootendorst, 2022).
- TF-CR and TF-IDFC-RF employ label information; e.g., TF-CR exclusively rewards terms frequent in, and unique to, a class, while penalizing those distributed broadly.
- Class-based schemes can employ class-level or cluster-level analogues of inverse document frequency, leading to distinctions such as “class-based idf prime,” document-in-class counts, or ratio terms.
These changes yield feature spaces aligned to discriminative axes set by class or cluster boundaries, improving downstream model performance and interpretability.
3. Applications: Clustering, Classification, and Topic Modeling
Class-based TF-IDF procedures have been deployed across several text processing pipelines:
- Topic Modeling (c-TF–IDF): BERTopic clusters transformer-based document embeddings, then constructs topic representations via c-TF–IDF, yielding term rankings tailored per cluster. This diverges from centroid-based topic models by using cluster-level statistics, enhancing topic coherence and interpretability (Grootendorst, 2022).
- Supervised Text Classification: TF-CR and TF-IDFC-RF generate features highly responsive to class label distributions. For embedding-based text classifiers, TF-CR is applied by computing, for each class, a weighted sum of pre-trained word vectors (e.g., Word2Vec, GloVe), using TF-CR as coefficients, and concatenating the class-specific vectors to form the input representation (Zubiaga, 2020).
- Sentiment Analysis: In TF-IDFC-RF, term frequency in class, relative document coverage, and “absence in others” are combined to capture selecting terms for binary sentiment (positive/negative) or affective classification (Carvalho et al., 2020).
These mechanisms are “plug-and-play” with popular vector space models, requiring no additional parameters or external resources.
4. Experimental Findings and Quantitative Impact
Empirical comparisons confirm the effectiveness of class-based TF-IDF-type weighting:
TF-CR (macro-F₁, 90K training instances; “tw2v” embeddings):
| Dataset | No Weight | TF-IDF | KLD | TF-CR |
|---|---|---|---|---|
| 20NewsGroups | 0.705 | 0.893 | 0.860 | 0.930 |
| Hate Speech | 0.661 | 0.556 | 0.643 | 0.648 |
| Newsspace200 | 0.544 | 0.507 | 0.586 | 0.595 |
| ODPtweets | 0.325 | 0.354 | 0.362 | 0.458 |
TF-CR achieved the highest macro-F₁ in 7 out of 8 datasets at larger scales, with absolute gains over the next best strategy of up to 0.13. The advantage scales with dataset size, reflecting the improved exploitation of class-conditional statistics. TF-CR remains robust for datasets with as few as 5–10K instances (Zubiaga, 2020).
TF-IDFC-RF demonstrates highest weighted F₁ on “Polarity” and “Sarcasm” datasets among ten schemes, outperforming TF-IDF and IGM-based alternatives at most feature-set sizes, and remains among the top two schemes across all tested classification datasets (Carvalho et al., 2020).
5. Methodological and Design Considerations
Practical aspects of implementing class-based TF-IDF schemes include:
- Smoothing and Normalization: c-TF–IDF (Grootendorst, 2022) adopts 7 with 8 smoothing for numerical stability; L1 normalization is recommended for comparability across clusters.
- Scalability: All examined schemes are parameter-free and rely only on labeled data frequency counts. They are directly applicable to large and moderate datasets but may offer limited gains over unweighted embeddings at very small scales.
- Flexibility: These methods can operate over any vectorizer, including embedding summation, n-gram expansions, or VSMs.
- Noise Control: Hapax legomena and very rare terms can be pruned based on minimum frequency thresholds pre- or post-weighting.
No external semantic resources, extensive hyperparameter tuning, or smoothing constants (outside c-TF–IDF’s 9) are required. This reduces risk of overfitting and facilitates rapid feature extraction.
6. Theoretical Motivation and Interpretative Value
Class-based TF-IDF variants are motivated by the need to identify terms most discriminative for class or cluster prediction:
- Discriminativeness: These schemes assign maximal weight to terms both frequent in (cluster/class) 0 and rare outside it, down-weighting ubiquitous or non-informative words—as seen in TF-CR’s multiplicative 1 form (Zubiaga, 2020).
- Interpretability: Feature vectors constructed using these weighting schemes align with human intuition regarding class-distinctive “buzzwords” or topic markers.
- Extension to Multi-class/Clustering Setups: Although some (e.g., TF-IDFC-RF) are derived from binary-class settings, their logic suggests potential for extension using entropy or variance-based generalizations to handle multi-class or clustering problems (Carvalho et al., 2020).
A plausible implication is that these mechanisms contribute to enhanced classification, topic coherence, and downstream model transparency by tightly coupling term statistics with class information.
7. Extensions and Open Directions
Recent work proposes:
- Dynamic Smoothing: In dynamic topic modeling, c-TF–IDF vectors may be averaged temporally to impose time-wise smoothness in topic representations (Grootendorst, 2022).
- Parameter Learning: Learning TF exponents or idf scaling factors via cross-validation could further boost discriminative capacity (Carvalho et al., 2020).
- Integration with Embeddings: Combining class-aware weighting with neural embedding architectures—optionally incorporating attention—could enable richer feature representations.
Generalizing the underlying principles to diverse text understanding tasks and exploring alternative aggregations (e.g., entropy-driven class exclusivity) remain active research areas.
References:
- (Grootendorst, 2022) BERTopic: Neural topic modeling with a class-based TF-IDF procedure
- (Zubiaga, 2020) Exploiting Class Labels to Boost Performance on Embedding-based Text Classification
- (Carvalho et al., 2020) TF-IDFC-RF: A Novel Supervised Term Weighting Scheme