Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
GPT-5.1
GPT-5.1 114 tok/s
Gemini 3.0 Pro 53 tok/s Pro
Gemini 2.5 Flash 132 tok/s Pro
Kimi K2 176 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Class-Based TF-IDF Procedure

Updated 13 November 2025
  • Class-based TF-IDF is a technique that extends traditional TF-IDF by computing term weights over document clusters to identify distinctive topic keywords.
  • It forms a single ‘class-document’ per cluster, enabling enhanced keyword extraction and improved interpretation in models like BERTopic.
  • Normalization and smoothing strategies in c-TF-IDF ensure robust and comparable term weights across varying cluster sizes.

A class-based TF-IDF procedure generalizes the classical TF–IDF (Term Frequency–Inverse Document Frequency) weighting scheme to the context of text clusters or classes. Rather than quantifying term distinctiveness at the individual-document level, class-based TF-IDF (c-TF-IDF) assigns weights to terms based on their ability to characterize entire clusters (“classes”) of documents. This methodology enables ranking of the most salient and distinctive words for each group of semantically related texts, with applications in neural topic modeling and beyond. c-TF-IDF features are central in modern topic modeling paradigms built atop transformer-based embeddings, such as BERTopic (Grootendorst, 2022).

1. Rationale for Class-Based TF–IDF

Standard TF–IDF produces a per-document, per-term weighting intended to highlight words that are both frequent in a specific document and rare in the corpus overall:

TF-IDFt,d=tft,dlog(Ndft)\text{TF-IDF}_{t,d} = \text{tf}_{t,d}\cdot \log\left(\frac{N}{\text{df}_t}\right)

where NN is the number of documents and dft\text{df}_t is the document frequency of term tt. When documents have first been embedded and clustered, however, one seeks terms representative not of individual texts, but of entire clusters (interpreted as topics). Classical TF–IDF does not provide direct ranking of terms as "distinctive" for a cluster, as document-level rarity may not coincide with topic-level relevance.

Class-based TF–IDF addresses this by concatenating all documents in a cluster into a single "class-document" and adapting both term frequency and inverse document frequency to the cluster level. This enables extraction of words uniquely characteristic of each topic, facilitating coherent topic representation after embedding-based clustering as in BERTopic.

2. Formalization and Mathematical Definitions

Let NN denote the number of original documents and KK the number of clusters found by a clustering algorithm (e.g. HDBSCAN). Let the global vocabulary be VV, and let the clusters be C1,,CKC_1, \ldots, C_K with CkC_k the set of documents assigned to cluster kk.

Define:

  • NkN_k: Total number of tokens in cluster kk, Nk=dCkdN_k = \sum_{d \in C_k} |d|
  • AA: Average cluster length, A=(1/K)k=1KNkA = (1/K) \sum_{k=1}^K N_k

For every tVt \in V and k{1,...,K}k \in \{1, ..., K\}:

  • tfk,t\text{tf}_{k,t}: Raw count of term tt in cluster kk, tfk,t=dCkcount(t,d)\text{tf}_{k,t} = \sum_{d \in C_k} \text{count}(t, d)
  • (optional) Normalized term frequency: tfk,t=tfk,tNk\overline{\text{tf}}_{k,t} = \frac{\text{tf}_{k,t}}{N_k}
  • TFt=k=1Ktfk,t\text{TF}_t = \sum_{k=1}^K \text{tf}_{k,t}: Term's total occurrence across all clusters

The class-based inverse document frequency is defined as:

idft=log(1+ATFt)\text{idf}_t = \log\left( 1 + \frac{A}{\text{TF}_t} \right)

where the +1+1 ensures positivity.

Finally, the class-based TF-IDF weight is:

c-TF-IDFk,t=tfk,tidft\text{c-TF-IDF}_{k,t} = \text{tf}_{k,t} \cdot \text{idf}_t

or, using normalized term frequency,

c-TF-IDFk,t=(tfk,tNk)idft\text{c-TF-IDF}_{k,t} = \left( \frac{\text{tf}_{k,t}}{N_k} \right) \cdot \text{idf}_t

3. Algorithmic Workflow

A step-wise outline for generating class-based TF-IDF weights is as follows:

  1. Text preprocessing: Tokenization, stop-word removal, and possible n-gram extraction.
  2. Document embedding: Each did_i is mapped to xiRdx_i \in \mathbb{R}^d (commonly via a pre-trained transformer such as SBERT).
  3. Optional dimensionality reduction: Often UMAP is applied to reduce embedding dimensionality for clustering.
  4. Clustering: Reduced embeddings are clustered (e.g., using HDBSCAN), yielding assignments ci{1,...,K}c_i \in \{1, ..., K\}.
  5. Cluster construction: All documents with the same assignment are pooled into clusters C1,...,CKC_1, ..., C_K.
  6. Term frequency computation: For all kk and tt, compute tfk,t\text{tf}_{k,t}.
  7. Average cluster length and total term frequency: Calculate AA and TFt\text{TF}_t.
  8. Cluster-level IDF computation: Calculate idft\text{idf}_t for each term.
  9. Weight calculation: Compute c-TF-IDFk,t\text{c-TF-IDF}_{k,t}.
  10. Topic representation: For each kk, select the top-nn terms (often n=1020n=10\text{--}20) by c-TF-IDFk,t\text{c-TF-IDF}_{k,t} for interpretability.

This process is computationally dominated by the embedding stage, with all subsequent c-TF-IDF calculations performed efficiently via term counting and array operations once clusters are defined.

4. Normalization, Smoothing, and Hyperparameters

Several considerations ensure robust and meaningful c-TF-IDF representations:

  • IDF Smoothing: The shift inside the logarithm (+1+1, idft=log(1+A/TFt)\text{idf}_t = \log(1 + A / \text{TF}_t)) prevents division by zero and ensures all IDFs are defined and positive.
  • TF Normalization: Optionally, dividing by NkN_k accommodates variation in cluster size, making term weights comparable across topics.
  • n-gram Range: The vocabulary VV may include unigrams, bigrams, or both, tailored to the task.
  • Dynamic-topic Smoothing: For temporal topic modeling, L1-normalize each time-slice vector w(t)w^{(t)} and enforce drift via w(t)=αw(t)+(1α)w(t1)w^{(t)} = \alpha w^{(t)} + (1-\alpha) w^{(t-1)} with α[0,1]\alpha \in [0, 1] (typically α=0.5\alpha=0.5) to smooth topic dynamics.
  • Top-n Terms per Topic: Common practice is to retain n=10–20 keywords per topic, though this is task-dependent.

5. Computational Profile and Scaling

  • Embedding: The most computationally intensive phase is document embedding, scaling as O(Ndcostmodel)O(N \cdot d \cdot \text{cost}_\text{model}), with NN documents and dd the embedding dimension. GPU is strongly advised.
  • UMAP and clustering: Both steps scale roughly as O(NlogN)O(N \log N).
  • c-TF-IDF Calculation: Once cluster assignments and counts are in RAM, computing all term-cluster weights requires O(kCk+KV)O(\sum_k |C_k| + K |V|) time (document and term iterations).
  • Memory Considerations: Major storage is taken by the document embeddings (N×dN \times d), the term-frequency array (K×VK \times |V|), and the cluster assignments (NN).
  • Practicality: For corpora with 10410^410510^5 documents and large vocabularies, bottlenecks are in embedding; the c-TF-IDF construction itself is lightweight.

6. Illustrative Example

A minimal scenario illustrates the use of c-TF-IDF:

  • Documents: d1=d_1 = “apple banana apple”, d2=d_2 = “banana fruit banana”, d3=d_3 = “dog cat dog”, d4=d_4 = “cat animal cat”
  • Clusters: C1={d1,d2}C_1 = \{d_1, d_2\} (fruit), C2={d3,d4}C_2 = \{d_3, d_4\} (animal)
  • Vocabulary: V={apple,banana,fruit,dog,cat,animal}V = \{\text{apple}, \text{banana}, \text{fruit}, \text{dog}, \text{cat}, \text{animal}\}
  • Raw term frequencies:
    • tf1,apple=2\text{tf}_{1,\text{apple}} = 2
    • tf1,banana=3\text{tf}_{1,\text{banana}} = 3
    • tf1,fruit=1\text{tf}_{1,\text{fruit}} = 1
    • tf2,dog=2\text{tf}_{2,\text{dog}} = 2
    • tf2,cat=3\text{tf}_{2,\text{cat}} = 3
    • tf2,animal=1\text{tf}_{2,\text{animal}} = 1
  • Token totals: N1=6N_1 = 6, N2=6N_2 = 6, A=6A = 6
  • Total term frequencies: e.g., TFapple=2\text{TF}_\text{apple}=2, TFbanana=3\text{TF}_\text{banana}=3, TFfruit=1\text{TF}_\text{fruit}=1
  • IDF computation (example):
    • idfapple=log(1+6/2)=log(4)=1.386\text{idf}_\text{apple} = \log(1+6/2) = \log(4) = 1.386
  • c-TF-IDF weights (for C1C_1):
    • w1,banana=31.098=3.294w_{1,\text{banana}} = 3 \cdot 1.098 = 3.294
    • w1,apple=21.386=2.772w_{1,\text{apple}} = 2 \cdot 1.386 = 2.772
    • w1,fruit=11.946=1.946w_{1,\text{fruit}} = 1 \cdot 1.946 = 1.946
  • Interpretation: "banana" and "apple" are found as most distinctive for topic 1.

7. Comparison to Classical Document-Level TF–IDF

Aspect Classical TF–IDF Class-based TF–IDF (c-TF-IDF)
Rarity Measured By Documents dd Clusters kk ("mega-documents")
IDF Formula log(N/dft)\log(N/\text{df}_t) log(1+A/TFt)\log(1 + A/\text{TF}_t)
Use-case Information retrieval, per-document analysis Topic modeling and keyword extraction for document clusters
Discriminative Focus Uncommon per document Uncommon per cluster/topic
Interpretability Rare terms per document Most distinctive topic or cluster-specific terms

Classical TF-IDF’s document-oriented rarity may overweight words appearing in many members of a topic but spread thinly across documents. In contrast, c-TF-IDF aggregates at the cluster level, highlighting terms unique to each topic and down-weighting those broadly spread across clusters. Empirically, c-TF-IDF leads to higher topic coherence and interpretability, as the salient words of a given cluster correspond more closely to shared semantic content (Grootendorst, 2022).

In summary, class-based TF-IDF adapts fundamental weighting principles to settings where the atomic unit shifts from individual documents to clusters of semantically similar texts. The approach is computationally efficient after the initial embedding and clustering, and it is well-suited to modern pipelines where topically coherent groupings of text are essential.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Class-Based TF-IDF Procedure.