Papers
Topics
Authors
Recent
Search
2000 character limit reached

Recursive Thematic Partitioning (RTP)

Updated 25 June 2026
  • Recursive Thematic Partitioning (RTP) is an unsupervised hierarchical clustering framework that uses LLM-generated yes/no questions to split document sets into interpretable thematic taxonomies.
  • It recursively generates a full binary tree by partitioning documents based on semantic queries that optimize information gain and promote human-readable splits.
  • RTP offers transparent taxonomy construction and controllable text synthesis, demonstrating improved interpretability over traditional statistical topic models.

Recursive Thematic Partitioning (RTP) is a question-driven, interpretable hierarchical clustering framework designed for the unsupervised analysis and controllable synthesis of text corpora. Built atop LLMs, RTP interactively constructs a full binary tree by recursively partitioning documents according to explicit yes/no questions in natural language. This paradigm enables a shift from traditional statistical pattern mining (such as co-occurrence-based topic models) toward knowledge-driven, fully transparent thematic taxonomies, where each branching decision is semantically interpretable and explicitly encoded as a natural language query (Tavares, 26 Sep 2025).

1. Mathematical Formalism

Let D={d1,d2,,dD}D = \{d_1, d_2, \dots, d_{|D|}\} denote a collection of documents and Q\mathcal{Q} the universe of possible yes/no questions that the LLM can pose about text. The partitioning process is governed by a mapping

π:D×Q{0,1}\pi : D \times \mathcal{Q} \to \{0,1\}

where, for each document dd and question qq, π(d,q)=1\pi(d,q) = 1 if the answer is "yes" and π(d,q)=0\pi(d,q) = 0 if "no".

RTP recursively builds a rooted, full binary tree T=(V,E)T=(V,E) whose nodes correspond to document subsets (DnDD_n \subseteq D), with each internal node nn labeled by a question Q\mathcal{Q}0 and two child nodes: Q\mathcal{Q}1 and Q\mathcal{Q}2. The recursive partition at each node is defined as: Q\mathcal{Q}3 Recursion terminates if the node depth reaches a pre-specified Q\mathcal{Q}4 or the subset size Q\mathcal{Q}5 falls below minimum leaf size Q\mathcal{Q}6. The result is a taxonomy where every split and cluster’s logic is fully explicit (Tavares, 26 Sep 2025).

2. Algorithmic Structure and Complexity

RTP’s central algorithm relies on repeated, LLM-powered question generation and voting-based partition assignment. At each node Q\mathcal{Q}7:

  1. Question Generation: A representative sample Q\mathcal{Q}8 is fed to the LLM, which produces a binary, semantically informative question Q\mathcal{Q}9 aimed at dividing π:D×Q{0,1}\pi : D \times \mathcal{Q} \to \{0,1\}0 into balanced, coherent subsets. While no explicit quantitative criterion is imposed in the prompt, the process can be formalized via an information gain objective:

π:D×Q{0,1}\pi : D \times \mathcal{Q} \to \{0,1\}1

where π:D×Q{0,1}\pi : D \times \mathcal{Q} \to \{0,1\}2 denotes a semantic “entropy” metric.

  1. Answer Assignment: For each π:D×Q{0,1}\pi : D \times \mathcal{Q} \to \{0,1\}3, the LLM is queried π:D×Q{0,1}\pi : D \times \mathcal{Q} \to \{0,1\}4 times at non-zero temperature; majority voting determines π:D×Q{0,1}\pi : D \times \mathcal{Q} \to \{0,1\}5.
  2. Recursive Descent: Each branch invokes the same process on π:D×Q{0,1}\pi : D \times \mathcal{Q} \to \{0,1\}6 and π:D×Q{0,1}\pi : D \times \mathcal{Q} \to \{0,1\}7 at incremented depth.
  3. Stopping Criteria: As outlined, recursion halts at maximum depth or when leaf size is too small.

Pseudocode is as follows: dd8 Worst-case computational complexity is

π:D×Q{0,1}\pi : D \times \mathcal{Q} \to \{0,1\}8

where π:D×Q{0,1}\pi : D \times \mathcal{Q} \to \{0,1\}9, dd0 is sample size per node, and dd1 is number of LLM answer calls per document. This is practically moderated via global sampling and early stopping strategies (Tavares, 26 Sep 2025).

3. Interpretability and Empirical Evaluation

RTP’s core innovation is in maximizing cluster interpretability by centering each partition on an explicit, human-readable question, as opposed to traditional topic models which rely on latent distributions over keywords. Interpretability is assessed both by automatic metrics and expert human judgment.

  • Semantic Coherence: While conventional metrics use pointwise mutual information (PMI) over top keywords, RTP’s questions themselves serve as cluster descriptors, with interpretive transparency surpassing flat keyword lists.
  • Human Judgments: In controlled studies (IMDB, Yelp), clusters’ descriptions rated dd2 (Likert, 1–5), versus BERTopic’s dd3.
  • Downstream Classification: When cluster leaves are treated as classes, and new documents routed by traversing node questions, RTP attains competitive to superior performance compared to small-sample DistilBERT baselines and BERTopic-derived feature classifiers, especially when underlying themes align well with task labels. Example summary is shown below.
Model IMDB Yelp AG-News
SOTA (full-data) 0.94 0.65 0.94
DistilBERT (baseline) 0.81 (0.03) 0.39 (0.07) 0.87 (0.01)
RTP 0.96 (0.02) 0.40 (0.12) 0.64 (0.04)

RTP’s improved interpretability is most pronounced when formal task labels have direct semantic correlation with the discovered themes (Tavares, 26 Sep 2025).

4. Controlled Thematic Generation

A completed RTP tree encodes each thematic cluster (leaf) by a path dd4 of question-answer pairs leading from the root. This signature is leveraged to create controllable generation prompts, instructing LLMs to produce new text expressing precisely the set of attributes encoded by the path:

"Given the following constraints, produce a coherent review: 1. dd5 ... n. dd6 Write in the style of the original corpus, focusing on these attributes."

Evaluation of this CTG (Controllable Thematic Generation) approach, compared to uncontrolled and few-shot prompting, uses both embedding similarity (Sentence-BERT cosine, favoring style) and classification node-accuracy (favoring semantic attributes). CTG yields higher node-accuracy (e.g., 0.60 vs 0.04 for uncontrolled generation on IMDB), indicating reliable control over generated semantic substance (Tavares, 26 Sep 2025).

5. Connections to Broader Recursive Partitioning

The general principle of recursive partitioning has extensive precedent in hierarchical community detection for graphs, where model-free, top-down bipartitioning (e.g., spectral clustering, sign-splitting) recursively constructs binary trees of communities. Under the binary tree stochastic block model (BTSBM), such algorithms admit strong theoretical guarantees for exact recovery and exhibit computational efficiency (dd7; see (Li et al., 2018)).

In RTP, the bipartition operation is not spectral but semantic—driven by LLM-formulated natural language queries and their evaluation on document subsets. Both frameworks, however, share a recursive, binary, top-down construction and similar stopping-rule logic.

6. Limitations, Emergent Phenomena, and Extensions

  • Model Dependence: RTP fundamentally depends on the generative and discriminative qualities of the underlying LLM. Pre-training biases and finite input context limitations may impact both question selection and answer reliability.
  • Recursion Artifacts: Emergent behaviors include unbalanced splits (due to highly specific questions isolating “minority” clusters) and question redundancy (similar queries at different nodes, owing to localized, memoryless question selection).
  • Computational Cost: Multiple LLM calls per document/question induce significant cost at larger scales; this is mitigated by fixed-size sampling and early stopping.

Extensible Directions:

  • Human-in-the-loop workflows for pruning, overriding, or paraphrasing node questions.
  • Path-dependent question generation (conditioning new queries on the sequence history) to reduce redundancy.
  • Quantitative information-theoretic objectives, e.g., directly integrating coherence or information gain into the LLM prompt to optimize for balanced and semantically meaningful splits (Tavares, 26 Sep 2025).

7. Significance and Paradigmatic Distinction

Recursive Thematic Partitioning constitutes a paradigm shift in unsupervised text analysis, reorienting the focus from statistical distributional structure to transparent, question-based semantic logic. The end-to-end process yields hierarchies that are both interpretable for direct human consumption and directly actionable for both discriminative (classification, clustering) and generative (controllable text synthesis) applications. RTP thereby operationalizes a bridge between knowledge-driven human taxonomies and LLM-enabled data-driven analysis (Tavares, 26 Sep 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Recursive Thematic Partitioning (RTP).