Recursive Thematic Partitioning (RTP)
- Recursive Thematic Partitioning (RTP) is an unsupervised hierarchical clustering framework that uses LLM-generated yes/no questions to split document sets into interpretable thematic taxonomies.
- It recursively generates a full binary tree by partitioning documents based on semantic queries that optimize information gain and promote human-readable splits.
- RTP offers transparent taxonomy construction and controllable text synthesis, demonstrating improved interpretability over traditional statistical topic models.
Recursive Thematic Partitioning (RTP) is a question-driven, interpretable hierarchical clustering framework designed for the unsupervised analysis and controllable synthesis of text corpora. Built atop LLMs, RTP interactively constructs a full binary tree by recursively partitioning documents according to explicit yes/no questions in natural language. This paradigm enables a shift from traditional statistical pattern mining (such as co-occurrence-based topic models) toward knowledge-driven, fully transparent thematic taxonomies, where each branching decision is semantically interpretable and explicitly encoded as a natural language query (Tavares, 26 Sep 2025).
1. Mathematical Formalism
Let denote a collection of documents and the universe of possible yes/no questions that the LLM can pose about text. The partitioning process is governed by a mapping
where, for each document and question , if the answer is "yes" and if "no".
RTP recursively builds a rooted, full binary tree whose nodes correspond to document subsets (), with each internal node labeled by a question 0 and two child nodes: 1 and 2. The recursive partition at each node is defined as: 3 Recursion terminates if the node depth reaches a pre-specified 4 or the subset size 5 falls below minimum leaf size 6. The result is a taxonomy where every split and cluster’s logic is fully explicit (Tavares, 26 Sep 2025).
2. Algorithmic Structure and Complexity
RTP’s central algorithm relies on repeated, LLM-powered question generation and voting-based partition assignment. At each node 7:
- Question Generation: A representative sample 8 is fed to the LLM, which produces a binary, semantically informative question 9 aimed at dividing 0 into balanced, coherent subsets. While no explicit quantitative criterion is imposed in the prompt, the process can be formalized via an information gain objective:
1
where 2 denotes a semantic “entropy” metric.
- Answer Assignment: For each 3, the LLM is queried 4 times at non-zero temperature; majority voting determines 5.
- Recursive Descent: Each branch invokes the same process on 6 and 7 at incremented depth.
- Stopping Criteria: As outlined, recursion halts at maximum depth or when leaf size is too small.
Pseudocode is as follows: 8 Worst-case computational complexity is
8
where 9, 0 is sample size per node, and 1 is number of LLM answer calls per document. This is practically moderated via global sampling and early stopping strategies (Tavares, 26 Sep 2025).
3. Interpretability and Empirical Evaluation
RTP’s core innovation is in maximizing cluster interpretability by centering each partition on an explicit, human-readable question, as opposed to traditional topic models which rely on latent distributions over keywords. Interpretability is assessed both by automatic metrics and expert human judgment.
- Semantic Coherence: While conventional metrics use pointwise mutual information (PMI) over top keywords, RTP’s questions themselves serve as cluster descriptors, with interpretive transparency surpassing flat keyword lists.
- Human Judgments: In controlled studies (IMDB, Yelp), clusters’ descriptions rated 2 (Likert, 1–5), versus BERTopic’s 3.
- Downstream Classification: When cluster leaves are treated as classes, and new documents routed by traversing node questions, RTP attains competitive to superior performance compared to small-sample DistilBERT baselines and BERTopic-derived feature classifiers, especially when underlying themes align well with task labels. Example summary is shown below.
| Model | IMDB | Yelp | AG-News |
|---|---|---|---|
| SOTA (full-data) | 0.94 | 0.65 | 0.94 |
| DistilBERT (baseline) | 0.81 (0.03) | 0.39 (0.07) | 0.87 (0.01) |
| RTP | 0.96 (0.02) | 0.40 (0.12) | 0.64 (0.04) |
RTP’s improved interpretability is most pronounced when formal task labels have direct semantic correlation with the discovered themes (Tavares, 26 Sep 2025).
4. Controlled Thematic Generation
A completed RTP tree encodes each thematic cluster (leaf) by a path 4 of question-answer pairs leading from the root. This signature is leveraged to create controllable generation prompts, instructing LLMs to produce new text expressing precisely the set of attributes encoded by the path:
"Given the following constraints, produce a coherent review: 1. 5 ... n. 6 Write in the style of the original corpus, focusing on these attributes."
Evaluation of this CTG (Controllable Thematic Generation) approach, compared to uncontrolled and few-shot prompting, uses both embedding similarity (Sentence-BERT cosine, favoring style) and classification node-accuracy (favoring semantic attributes). CTG yields higher node-accuracy (e.g., 0.60 vs 0.04 for uncontrolled generation on IMDB), indicating reliable control over generated semantic substance (Tavares, 26 Sep 2025).
5. Connections to Broader Recursive Partitioning
The general principle of recursive partitioning has extensive precedent in hierarchical community detection for graphs, where model-free, top-down bipartitioning (e.g., spectral clustering, sign-splitting) recursively constructs binary trees of communities. Under the binary tree stochastic block model (BTSBM), such algorithms admit strong theoretical guarantees for exact recovery and exhibit computational efficiency (7; see (Li et al., 2018)).
In RTP, the bipartition operation is not spectral but semantic—driven by LLM-formulated natural language queries and their evaluation on document subsets. Both frameworks, however, share a recursive, binary, top-down construction and similar stopping-rule logic.
6. Limitations, Emergent Phenomena, and Extensions
- Model Dependence: RTP fundamentally depends on the generative and discriminative qualities of the underlying LLM. Pre-training biases and finite input context limitations may impact both question selection and answer reliability.
- Recursion Artifacts: Emergent behaviors include unbalanced splits (due to highly specific questions isolating “minority” clusters) and question redundancy (similar queries at different nodes, owing to localized, memoryless question selection).
- Computational Cost: Multiple LLM calls per document/question induce significant cost at larger scales; this is mitigated by fixed-size sampling and early stopping.
Extensible Directions:
- Human-in-the-loop workflows for pruning, overriding, or paraphrasing node questions.
- Path-dependent question generation (conditioning new queries on the sequence history) to reduce redundancy.
- Quantitative information-theoretic objectives, e.g., directly integrating coherence or information gain into the LLM prompt to optimize for balanced and semantically meaningful splits (Tavares, 26 Sep 2025).
7. Significance and Paradigmatic Distinction
Recursive Thematic Partitioning constitutes a paradigm shift in unsupervised text analysis, reorienting the focus from statistical distributional structure to transparent, question-based semantic logic. The end-to-end process yields hierarchies that are both interpretable for direct human consumption and directly actionable for both discriminative (classification, clustering) and generative (controllable text synthesis) applications. RTP thereby operationalizes a bridge between knowledge-driven human taxonomies and LLM-enabled data-driven analysis (Tavares, 26 Sep 2025).