Class-Based TF-IDF Procedure
- Class-based TF-IDF is a technique that extends traditional TF-IDF by computing term weights over document clusters to identify distinctive topic keywords.
- It forms a single ‘class-document’ per cluster, enabling enhanced keyword extraction and improved interpretation in models like BERTopic.
- Normalization and smoothing strategies in c-TF-IDF ensure robust and comparable term weights across varying cluster sizes.
A class-based TF-IDF procedure generalizes the classical TF–IDF (Term Frequency–Inverse Document Frequency) weighting scheme to the context of text clusters or classes. Rather than quantifying term distinctiveness at the individual-document level, class-based TF-IDF (c-TF-IDF) assigns weights to terms based on their ability to characterize entire clusters (“classes”) of documents. This methodology enables ranking of the most salient and distinctive words for each group of semantically related texts, with applications in neural topic modeling and beyond. c-TF-IDF features are central in modern topic modeling paradigms built atop transformer-based embeddings, such as BERTopic (Grootendorst, 2022).
1. Rationale for Class-Based TF–IDF
Standard TF–IDF produces a per-document, per-term weighting intended to highlight words that are both frequent in a specific document and rare in the corpus overall:
where is the number of documents and is the document frequency of term . When documents have first been embedded and clustered, however, one seeks terms representative not of individual texts, but of entire clusters (interpreted as topics). Classical TF–IDF does not provide direct ranking of terms as "distinctive" for a cluster, as document-level rarity may not coincide with topic-level relevance.
Class-based TF–IDF addresses this by concatenating all documents in a cluster into a single "class-document" and adapting both term frequency and inverse document frequency to the cluster level. This enables extraction of words uniquely characteristic of each topic, facilitating coherent topic representation after embedding-based clustering as in BERTopic.
2. Formalization and Mathematical Definitions
Let denote the number of original documents and the number of clusters found by a clustering algorithm (e.g. HDBSCAN). Let the global vocabulary be , and let the clusters be with the set of documents assigned to cluster .
Define:
- : Total number of tokens in cluster ,
- : Average cluster length,
For every and :
- : Raw count of term in cluster ,
- (optional) Normalized term frequency:
- : Term's total occurrence across all clusters
The class-based inverse document frequency is defined as:
where the ensures positivity.
Finally, the class-based TF-IDF weight is:
or, using normalized term frequency,
3. Algorithmic Workflow
A step-wise outline for generating class-based TF-IDF weights is as follows:
- Text preprocessing: Tokenization, stop-word removal, and possible n-gram extraction.
- Document embedding: Each is mapped to (commonly via a pre-trained transformer such as SBERT).
- Optional dimensionality reduction: Often UMAP is applied to reduce embedding dimensionality for clustering.
- Clustering: Reduced embeddings are clustered (e.g., using HDBSCAN), yielding assignments .
- Cluster construction: All documents with the same assignment are pooled into clusters .
- Term frequency computation: For all and , compute .
- Average cluster length and total term frequency: Calculate and .
- Cluster-level IDF computation: Calculate for each term.
- Weight calculation: Compute .
- Topic representation: For each , select the top- terms (often ) by for interpretability.
This process is computationally dominated by the embedding stage, with all subsequent c-TF-IDF calculations performed efficiently via term counting and array operations once clusters are defined.
4. Normalization, Smoothing, and Hyperparameters
Several considerations ensure robust and meaningful c-TF-IDF representations:
- IDF Smoothing: The shift inside the logarithm (, ) prevents division by zero and ensures all IDFs are defined and positive.
- TF Normalization: Optionally, dividing by accommodates variation in cluster size, making term weights comparable across topics.
- n-gram Range: The vocabulary may include unigrams, bigrams, or both, tailored to the task.
- Dynamic-topic Smoothing: For temporal topic modeling, L1-normalize each time-slice vector and enforce drift via with (typically ) to smooth topic dynamics.
- Top-n Terms per Topic: Common practice is to retain n=10–20 keywords per topic, though this is task-dependent.
5. Computational Profile and Scaling
- Embedding: The most computationally intensive phase is document embedding, scaling as , with documents and the embedding dimension. GPU is strongly advised.
- UMAP and clustering: Both steps scale roughly as .
- c-TF-IDF Calculation: Once cluster assignments and counts are in RAM, computing all term-cluster weights requires time (document and term iterations).
- Memory Considerations: Major storage is taken by the document embeddings (), the term-frequency array (), and the cluster assignments ().
- Practicality: For corpora with – documents and large vocabularies, bottlenecks are in embedding; the c-TF-IDF construction itself is lightweight.
6. Illustrative Example
A minimal scenario illustrates the use of c-TF-IDF:
- Documents: “apple banana apple”, “banana fruit banana”, “dog cat dog”, “cat animal cat”
- Clusters: (fruit), (animal)
- Vocabulary:
- Raw term frequencies:
- Token totals: , ,
- Total term frequencies: e.g., , ,
- IDF computation (example):
- c-TF-IDF weights (for ):
- Interpretation: "banana" and "apple" are found as most distinctive for topic 1.
7. Comparison to Classical Document-Level TF–IDF
| Aspect | Classical TF–IDF | Class-based TF–IDF (c-TF-IDF) |
|---|---|---|
| Rarity Measured By | Documents | Clusters ("mega-documents") |
| IDF Formula | ||
| Use-case | Information retrieval, per-document analysis | Topic modeling and keyword extraction for document clusters |
| Discriminative Focus | Uncommon per document | Uncommon per cluster/topic |
| Interpretability | Rare terms per document | Most distinctive topic or cluster-specific terms |
Classical TF-IDF’s document-oriented rarity may overweight words appearing in many members of a topic but spread thinly across documents. In contrast, c-TF-IDF aggregates at the cluster level, highlighting terms unique to each topic and down-weighting those broadly spread across clusters. Empirically, c-TF-IDF leads to higher topic coherence and interpretability, as the salient words of a given cluster correspond more closely to shared semantic content (Grootendorst, 2022).
In summary, class-based TF-IDF adapts fundamental weighting principles to settings where the atomic unit shifts from individual documents to clusters of semantically similar texts. The approach is computationally efficient after the initial embedding and clustering, and it is well-suited to modern pipelines where topically coherent groupings of text are essential.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free