Papers
Topics
Authors
Recent
Search
2000 character limit reached

Class-Based TF-IDF Procedure

Updated 13 November 2025
  • Class-based TF-IDF is a technique that extends traditional TF-IDF by computing term weights over document clusters to identify distinctive topic keywords.
  • It forms a single ‘class-document’ per cluster, enabling enhanced keyword extraction and improved interpretation in models like BERTopic.
  • Normalization and smoothing strategies in c-TF-IDF ensure robust and comparable term weights across varying cluster sizes.

A class-based TF-IDF procedure generalizes the classical TFIDF (Term Frequency–Inverse Document Frequency) weighting scheme to the context of text clusters or classes. Rather than quantifying term distinctiveness at the individual-document level, class-based TF-IDF (c-TF-IDF) assigns weights to terms based on their ability to characterize entire clusters (“classes”) of documents. This methodology enables ranking of the most salient and distinctive words for each group of semantically related texts, with applications in neural topic modeling and beyond. c-TF-IDF features are central in modern topic modeling paradigms built atop transformer-based embeddings, such as BERTopic (Grootendorst, 2022).

1. Rationale for Class-Based TF–IDF

Standard TF–IDF produces a per-document, per-term weighting intended to highlight words that are both frequent in a specific document and rare in the corpus overall:

TF-IDFt,d=tft,dlog(Ndft)\text{TF-IDF}_{t,d} = \text{tf}_{t,d}\cdot \log\left(\frac{N}{\text{df}_t}\right)

where NN is the number of documents and dft\text{df}_t is the document frequency of term tt. When documents have first been embedded and clustered, however, one seeks terms representative not of individual texts, but of entire clusters (interpreted as topics). Classical TF–IDF does not provide direct ranking of terms as "distinctive" for a cluster, as document-level rarity may not coincide with topic-level relevance.

Class-based TF–IDF addresses this by concatenating all documents in a cluster into a single "class-document" and adapting both term frequency and inverse document frequency to the cluster level. This enables extraction of words uniquely characteristic of each topic, facilitating coherent topic representation after embedding-based clustering as in BERTopic.

2. Formalization and Mathematical Definitions

Let NN denote the number of original documents and KK the number of clusters found by a clustering algorithm (e.g. HDBSCAN). Let the global vocabulary be VV, and let the clusters be C1,,CKC_1, \ldots, C_K with CkC_k the set of documents assigned to cluster kk.

Define:

  • NN0: Total number of tokens in cluster NN1, NN2
  • NN3: Average cluster length, NN4

For every NN5 and NN6:

  • NN7: Raw count of term NN8 in cluster NN9, dft\text{df}_t0
  • (optional) Normalized term frequency: dft\text{df}_t1
  • dft\text{df}_t2: Term's total occurrence across all clusters

The class-based inverse document frequency is defined as:

dft\text{df}_t3

where the dft\text{df}_t4 ensures positivity.

Finally, the class-based TF-IDF weight is:

dft\text{df}_t5

or, using normalized term frequency,

dft\text{df}_t6

3. Algorithmic Workflow

A step-wise outline for generating class-based TF-IDF weights is as follows:

  1. Text preprocessing: Tokenization, stop-word removal, and possible n-gram extraction.
  2. Document embedding: Each dft\text{df}_t7 is mapped to dft\text{df}_t8 (commonly via a pre-trained transformer such as SBERT).
  3. Optional dimensionality reduction: Often UMAP is applied to reduce embedding dimensionality for clustering.
  4. Clustering: Reduced embeddings are clustered (e.g., using HDBSCAN), yielding assignments dft\text{df}_t9.
  5. Cluster construction: All documents with the same assignment are pooled into clusters tt0.
  6. Term frequency computation: For all tt1 and tt2, compute tt3.
  7. Average cluster length and total term frequency: Calculate tt4 and tt5.
  8. Cluster-level IDF computation: Calculate tt6 for each term.
  9. Weight calculation: Compute tt7. 10. Topic representation: For each tt8, select the top-tt9 terms (often NN0) by NN1 for interpretability.

This process is computationally dominated by the embedding stage, with all subsequent c-TF-IDF calculations performed efficiently via term counting and array operations once clusters are defined.

4. Normalization, Smoothing, and Hyperparameters

Several considerations ensure robust and meaningful c-TF-IDF representations:

  • IDF Smoothing: The shift inside the logarithm (NN2, NN3) prevents division by zero and ensures all IDFs are defined and positive.
  • TF Normalization: Optionally, dividing by NN4 accommodates variation in cluster size, making term weights comparable across topics.
  • n-gram Range: The vocabulary NN5 may include unigrams, bigrams, or both, tailored to the task.
  • Dynamic-topic Smoothing: For temporal topic modeling, L1-normalize each time-slice vector NN6 and enforce drift via NN7 with NN8 (typically NN9) to smooth topic dynamics.
  • Top-n Terms per Topic: Common practice is to retain n=10–20 keywords per topic, though this is task-dependent.

5. Computational Profile and Scaling

  • Embedding: The most computationally intensive phase is document embedding, scaling as KK0, with KK1 documents and KK2 the embedding dimension. GPU is strongly advised.
  • UMAP and clustering: Both steps scale roughly as KK3.
  • c-TF-IDF Calculation: Once cluster assignments and counts are in RAM, computing all term-cluster weights requires KK4 time (document and term iterations).
  • Memory Considerations: Major storage is taken by the document embeddings (KK5), the term-frequency array (KK6), and the cluster assignments (KK7).
  • Practicality: For corpora with KK8–KK9 documents and large vocabularies, bottlenecks are in embedding; the c-TF-IDF construction itself is lightweight.

6. Illustrative Example

A minimal scenario illustrates the use of c-TF-IDF:

  • Documents: VV0 “apple banana apple”, VV1 “banana fruit banana”, VV2 “dog cat dog”, VV3 “cat animal cat”
  • Clusters: VV4 (fruit), VV5 (animal)
  • Vocabulary: VV6
  • Raw term frequencies:
    • VV7
    • VV8
    • VV9
    • C1,,CKC_1, \ldots, C_K0
    • C1,,CKC_1, \ldots, C_K1
    • C1,,CKC_1, \ldots, C_K2
  • Token totals: C1,,CKC_1, \ldots, C_K3, C1,,CKC_1, \ldots, C_K4, C1,,CKC_1, \ldots, C_K5
  • Total term frequencies: e.g., C1,,CKC_1, \ldots, C_K6, C1,,CKC_1, \ldots, C_K7, C1,,CKC_1, \ldots, C_K8
  • IDF computation (example):
    • C1,,CKC_1, \ldots, C_K9
  • c-TF-IDF weights (for CkC_k0):
    • CkC_k1
    • CkC_k2
    • CkC_k3
  • Interpretation: "banana" and "apple" are found as most distinctive for topic 1.

7. Comparison to Classical Document-Level TF–IDF

Aspect Classical TF–IDF Class-based TF–IDF (c-TF-IDF)
Rarity Measured By Documents CkC_k4 Clusters CkC_k5 ("mega-documents")
IDF Formula CkC_k6 CkC_k7
Use-case Information retrieval, per-document analysis Topic modeling and keyword extraction for document clusters
Discriminative Focus Uncommon per document Uncommon per cluster/topic
Interpretability Rare terms per document Most distinctive topic or cluster-specific terms

Classical TF-IDF’s document-oriented rarity may overweight words appearing in many members of a topic but spread thinly across documents. In contrast, c-TF-IDF aggregates at the cluster level, highlighting terms unique to each topic and down-weighting those broadly spread across clusters. Empirically, c-TF-IDF leads to higher topic coherence and interpretability, as the salient words of a given cluster correspond more closely to shared semantic content (Grootendorst, 2022).

In summary, class-based TF-IDF adapts fundamental weighting principles to settings where the atomic unit shifts from individual documents to clusters of semantically similar texts. The approach is computationally efficient after the initial embedding and clustering, and it is well-suited to modern pipelines where topically coherent groupings of text are essential.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Class-Based TF-IDF Procedure.