Class-Based TF-IDF Procedure

Updated 13 November 2025

Class-based TF-IDF is a technique that extends traditional TF-IDF by computing term weights over document clusters to identify distinctive topic keywords.
It forms a single ‘class-document’ per cluster, enabling enhanced keyword extraction and improved interpretation in models like BERTopic.
Normalization and smoothing strategies in c-TF-IDF ensure robust and comparable term weights across varying cluster sizes.

A class-based TF-IDF procedure generalizes the classical TF–IDF (Term Frequency–Inverse Document Frequency) weighting scheme to the context of text clusters or classes. Rather than quantifying term distinctiveness at the individual-document level, class-based TF-IDF (c-TF-IDF) assigns weights to terms based on their ability to characterize entire clusters (“classes”) of documents. This methodology enables ranking of the most salient and distinctive words for each group of semantically related texts, with applications in neural topic modeling and beyond. c-TF-IDF features are central in modern topic modeling paradigms built atop transformer-based embeddings, such as BERTopic (Grootendorst, 2022).

1. Rationale for Class-Based TF–IDF

Standard TF–IDF produces a per-document, per-term weighting intended to highlight words that are both frequent in a specific document and rare in the corpus overall:

$\text{TF-IDF}_{t,d} = \text{tf}_{t,d}\cdot \log\left(\frac{N}{\text{df}_t}\right)$

where $N$ is the number of documents and $\text{df}_t$ is the document frequency of term $t$ . When documents have first been embedded and clustered, however, one seeks terms representative not of individual texts, but of entire clusters (interpreted as topics). Classical TF–IDF does not provide direct ranking of terms as "distinctive" for a cluster, as document-level rarity may not coincide with topic-level relevance.

Class-based TF–IDF addresses this by concatenating all documents in a cluster into a single "class-document" and adapting both term frequency and inverse document frequency to the cluster level. This enables extraction of words uniquely characteristic of each topic, facilitating coherent topic representation after embedding-based clustering as in BERTopic.

2. Formalization and Mathematical Definitions

Let $N$ denote the number of original documents and $K$ the number of clusters found by a clustering algorithm (e.g. HDBSCAN). Let the global vocabulary be $V$ , and let the clusters be $C_1, \ldots, C_K$ with $C_k$ the set of documents assigned to cluster $k$ .

Define:

$N_k$ : Total number of tokens in cluster $k$ , $N_k = \sum_{d \in C_k} |d|$
$A$ : Average cluster length, $A = (1/K) \sum_{k=1}^K N_k$

For every $t \in V$ and $k \in \{1, ..., K\}$ :

$\text{tf}_{k,t}$ : Raw count of term $t$ in cluster $k$ , $\text{tf}_{k,t} = \sum_{d \in C_k} \text{count}(t, d)$
(optional) Normalized term frequency: $\overline{\text{tf}}_{k,t} = \frac{\text{tf}_{k,t}}{N_k}$
$\text{TF}_t = \sum_{k=1}^K \text{tf}_{k,t}$ : Term's total occurrence across all clusters

The class-based inverse document frequency is defined as:

$\text{idf}_t = \log\left( 1 + \frac{A}{\text{TF}_t} \right)$

where the $+1$ ensures positivity.

Finally, the class-based TF-IDF weight is:

$\text{c-TF-IDF}_{k,t} = \text{tf}_{k,t} \cdot \text{idf}_t$

or, using normalized term frequency,

$\text{c-TF-IDF}_{k,t} = \left( \frac{\text{tf}_{k,t}}{N_k} \right) \cdot \text{idf}_t$

3. Algorithmic Workflow

A step-wise outline for generating class-based TF-IDF weights is as follows:

Text preprocessing: Tokenization, stop-word removal, and possible n-gram extraction.
Document embedding: Each $d_i$ is mapped to $x_i \in \mathbb{R}^d$ (commonly via a pre-trained transformer such as SBERT).
Optional dimensionality reduction: Often UMAP is applied to reduce embedding dimensionality for clustering.
Clustering: Reduced embeddings are clustered (e.g., using HDBSCAN), yielding assignments $c_i \in \{1, ..., K\}$ .
Cluster construction: All documents with the same assignment are pooled into clusters $C_1, ..., C_K$ .
Term frequency computation: For all $k$ and $t$ , compute $\text{tf}_{k,t}$ .
Average cluster length and total term frequency: Calculate $A$ and $\text{TF}_t$ .
Cluster-level IDF computation: Calculate $\text{idf}_t$ for each term.
Weight calculation: Compute $\text{c-TF-IDF}_{k,t}$ .
Topic representation: For each $k$ , select the top- $n$ terms (often $n=10\text{--}20$ ) by $\text{c-TF-IDF}_{k,t}$ for interpretability.

This process is computationally dominated by the embedding stage, with all subsequent c-TF-IDF calculations performed efficiently via term counting and array operations once clusters are defined.

4. Normalization, Smoothing, and Hyperparameters

Several considerations ensure robust and meaningful c-TF-IDF representations:

IDF Smoothing: The shift inside the logarithm ( $+1$ , $\text{idf}_t = \log(1 + A / \text{TF}_t)$ ) prevents division by zero and ensures all IDFs are defined and positive.
TF Normalization: Optionally, dividing by $N_k$ accommodates variation in cluster size, making term weights comparable across topics.
n-gram Range: The vocabulary $V$ may include unigrams, bigrams, or both, tailored to the task.
Dynamic-topic Smoothing: For temporal topic modeling, L1-normalize each time-slice vector $w^{(t)}$ and enforce drift via $w^{(t)} = \alpha w^{(t)} + (1-\alpha) w^{(t-1)}$ with $\alpha \in [0, 1]$ (typically $\alpha=0.5$ ) to smooth topic dynamics.
Top-n Terms per Topic: Common practice is to retain n=10–20 keywords per topic, though this is task-dependent.

5. Computational Profile and Scaling

Embedding: The most computationally intensive phase is document embedding, scaling as $O(N \cdot d \cdot \text{cost}_\text{model})$ , with $N$ documents and $d$ the embedding dimension. GPU is strongly advised.
UMAP and clustering: Both steps scale roughly as $O(N \log N)$ .
c-TF-IDF Calculation: Once cluster assignments and counts are in RAM, computing all term-cluster weights requires $O(\sum_k |C_k| + K |V|)$ time (document and term iterations).
Memory Considerations: Major storage is taken by the document embeddings ( $N \times d$ ), the term-frequency array ( $K \times |V|$ ), and the cluster assignments ( $N$ ).
Practicality: For corpora with $10^4$ – $10^5$ documents and large vocabularies, bottlenecks are in embedding; the c-TF-IDF construction itself is lightweight.

6. Illustrative Example

A minimal scenario illustrates the use of c-TF-IDF:

Documents: $d_1 =$ “apple banana apple”, $d_2 =$ “banana fruit banana”, $d_3 =$ “dog cat dog”, $d_4 =$ “cat animal cat”
Clusters: $C_1 = \{d_1, d_2\}$ (fruit), $C_2 = \{d_3, d_4\}$ (animal)
Vocabulary: $V = \{\text{apple}, \text{banana}, \text{fruit}, \text{dog}, \text{cat}, \text{animal}\}$
Raw term frequencies:
- $\text{tf}_{1,\text{apple}} = 2$
- $\text{tf}_{1,\text{banana}} = 3$
- $\text{tf}_{1,\text{fruit}} = 1$
- $\text{tf}_{2,\text{dog}} = 2$
- $\text{tf}_{2,\text{cat}} = 3$
- $\text{tf}_{2,\text{animal}} = 1$
Token totals: $N_1 = 6$ , $N_2 = 6$ , $A = 6$
Total term frequencies: e.g., $\text{TF}_\text{apple}=2$ , $\text{TF}_\text{banana}=3$ , $\text{TF}_\text{fruit}=1$
IDF computation (example):
- $\text{idf}_\text{apple} = \log(1+6/2) = \log(4) = 1.386$
c-TF-IDF weights (for $C_1$ ):
- $w_{1,\text{banana}} = 3 \cdot 1.098 = 3.294$
- $w_{1,\text{apple}} = 2 \cdot 1.386 = 2.772$
- $w_{1,\text{fruit}} = 1 \cdot 1.946 = 1.946$
Interpretation: "banana" and "apple" are found as most distinctive for topic 1.

7. Comparison to Classical Document-Level TF–IDF

Aspect	Classical TF–IDF	Class-based TF–IDF (c-TF-IDF)
Rarity Measured By	Documents $d$	Clusters $k$ ("mega-documents")
IDF Formula	$\log(N/\text{df}_t)$	$\log(1 + A/\text{TF}_t)$
Use-case	Information retrieval, per-document analysis	Topic modeling and keyword extraction for document clusters
Discriminative Focus	Uncommon per document	Uncommon per cluster/topic
Interpretability	Rare terms per document	Most distinctive topic or cluster-specific terms

Classical TF-IDF’s document-oriented rarity may overweight words appearing in many members of a topic but spread thinly across documents. In contrast, c-TF-IDF aggregates at the cluster level, highlighting terms unique to each topic and down-weighting those broadly spread across clusters. Empirically, c-TF-IDF leads to higher topic coherence and interpretability, as the salient words of a given cluster correspond more closely to shared semantic content (Grootendorst, 2022).

In summary, class-based TF-IDF adapts fundamental weighting principles to settings where the atomic unit shifts from individual documents to clusters of semantically similar texts. The approach is computationally efficient after the initial embedding and clustering, and it is well-suited to modern pipelines where topically coherent groupings of text are essential.

PDF Markdown Chat (Pro)

References (1)

BERTopic: Neural topic modeling with a class-based TF-IDF procedure (2022)

Follow Topic

Get notified by email when new papers are published related to Class-Based TF-IDF Procedure.