CST: Recursive Query Tree Generation

Updated 17 March 2026

Context-Split-Tree (CST) is a recursive, LLM-driven algorithm that constructs binary trees over text, generating query–context pairs across multiple granularity levels.
It integrates context splitting with LLM-based question generation at each node to automatedly create high-quality supervised fine-tuning data.
By employing contrastive filtering and tunable parameters, CST optimizes query diversity and fidelity, significantly improving downstream model performance.

The Context-Split-Tree (CST) is a recursive, LLM-driven algorithm for constructing binary trees over a textual context to generate multi-granularity, context-driven query–context pairs. Central to the AugCon framework, CST enables automated large-scale generation of supervised fine-tuning (SFT) data for LLMs, spanning a range of granularities from macro-level to fine-grained queries. By integrating LLM-based context splitting and question generation at each node, CST achieves structural coverage of all meaningful context scales and forms the foundation for downstream SFT data quality improvements (Quan, 2024).

1. Formal Structure and Definition

Let $C$ be an initial textual context, segmented into atomic sentences $S_1,\,\ldots,\,S_n$ . CST constructs a binary tree $T = (V, E)$ , where each node $v \in V$ comprises:

a contiguous sub-context $C_v \subseteq C$
a query $q_v$ tailored to $C_v$ ’s semantic granularity

The split operation $\text{Split}(C) = (C_1, C_2, q)$ invokes an LLM with a specially designed prompt to:

Generate a question $q$ directly answerable from $C$
Partition $S_1,\,\ldots,\,S_n$ 0 into semantically coherent, minimally overlapping sub-contexts $S_1,\,\ldots,\,S_n$ 1 and $S_1,\,\ldots,\,S_n$ 2

For $S_1,\,\ldots,\,S_n$ 3 of $S_1,\,\ldots,\,S_n$ 4 sentences, index $S_1,\,\ldots,\,S_n$ 5 is selected. Sub-contexts $S_1,\,\ldots,\,S_n$ 6 and $S_1,\,\ldots,\,S_n$ 7 are formed. The process recurses on $S_1,\,\ldots,\,S_n$ 8 and $S_1,\,\ldots,\,S_n$ 9 while $T = (V, E)$ 0 (a tunable minimum-length threshold) and the split is non-degenerate ( $T = (V, E)$ 1), terminating with leaf nodes when conditions fail. Each node’s depth coheres with its granularity: root-level (macro), intermediate (conceptual), and deeper (detail).

A key structural property is that for $T = (V, E)$ 2 of $T = (V, E)$ 3 sentences, the CST contains exactly $T = (V, E)$ 4 nodes, each corresponding to a query–context pair: $T = (V, E)$ 5 This determinism ensures an exhaustive yet non-redundant traversal of granularity.

2. Recursive Construction and Algorithmic Workflow

The CST algorithm proceeds recursively as outlined below: $C_v \subseteq C$ 9 The process is interleaved with a filtering mechanism to enforce an upper bound $T = (V, E)$ 6 on queries per context. If fewer than $T = (V, E)$ 7 high-quality, diverse queries survive post-scoring, CST is re-invoked until the requirement is met. This approach guarantees coverage at multiple context scales.

3. Complexity and Resource Considerations

Let $T = (V, E)$ 8 be the sentence count of initial context $T = (V, E)$ 9. In the degenerate worst case (each split partitions off only one sentence), tree depth is $v \in V$ 0, and node count remains $v \in V$ 1. For balanced splits, depth is $v \in V$ 2; node count is still linear $v \in V$ 3. Each node entails one LLM inference for question and sub-context generation, incurring total cost:

Time: $v \in V$ 4 times the cost of a single LLM inference with context size at most $v \in V$ 5
Space: $v \in V$ 6 storage for (context, query) pairs

Parameter $v \in V$ 7 governs granularity, with lower $v \in V$ 8 leading to deeper trees and more fine-grained queries. This suggests that tuning $v \in V$ 9 enables precise control over output diversity and data volume.

4. CST in Data Filtering and Contrastive Scoring

Upon generating a candidate pool $C_v \subseteq C$ 0 of (context, query) pairs per context, CST integrates with a lightweight contrastive-learning-based scorer $C_v \subseteq C$ 1. This scorer is trained as follows:

Positive samples $C_v \subseteq C$ 2: from CST+few-shot prompt
Negative samples $C_v \subseteq C$ 3: from prompts with degraded instruction/fewer shots

Contrastive loss: $C_v \subseteq C$ 4 with $C_v \subseteq C$ 5 the sigmoid function.

During inference, all $C_v \subseteq C$ 6 are scored and sorted. The top-ranked queries are selected, skipping any whose ROUGE-L F1 with previously selected queries exceeds $C_v \subseteq C$ 7 (to ensure diversity), until $C_v \subseteq C$ 8 slots are filled. If insufficient diverse queries remain, CST is rerun.

5. Illustrative Example on a Toy Text

For the passage: "The profits of the contemporary global value chains (GVC) form a V-shape, also known as the ‘smile curve’. At one end are R & D and design; at the other end are services and marketing; processing sits in the middle. Profits at the ends are 20–25%; in the middle only 5%."

CST constructs the following (abbreviated) tree:

Node 1 (whole passage): "Why do entrepreneurs worldwide strive to move up the value chain?"
- C₁ (first half): "What are the key components of the contemporary global value chains?"
- C₁₁: "What does the global value curve look like?" (leaf; no further split)
- C₁₂: "What is the structure of the smile curve?"
- C₁₂₁: "What lies in the middle of the smile curve?" (leaf)
- C₂ (second half): Questions about low-profit-margin industries

In total, CST produces 8 binary-tree nodes and corresponding context–query pairs. The top 4 are retained post-filtering, covering macro to specific details.

6. Empirical Evaluation and Metrics

CST yields substantial improvements in SFT data quality and downstream model performance. Key empirical findings:

Human evaluation on DailyM test:
- Query Realism: 4.37 (CST) vs 4.05 (best prior)
- Query Diversity: 4.68 (CST) vs 4.13
Automatic QA benchmarks (fine-tuning Llama3-70B):

| Benchmark | Acc (CST, AugCon) | Acc (Context-Instruct) | |-------------------|-------------------|------------------------| | SQuAD1.1 | 0.336 | 0.314 | | TriviaQA | 0.849 | 0.825 | | DROP | 0.350 | 0.334 | | WebGLM-QA BS | 0.924 | 0.885 |

Granularity distribution (CST vs Context-Instruct):

| Type | CST (%) | Context-Instruct (%) | |-------------|---------|----------------------| | Detail | 37.8 | 17.9 | | Concept | 35.3 | 63.4 | | Macro | 26.9 | 18.7 |

Ablations (on TriviaQA):
- Removing CST: Accuracy drops from 0.849 to 0.793
- Removing contrastive filter: 0.828
- Removing fidelity module: 0.833
Compute-matched comparison (80 A100 GPU hours):
- AugCon wins over ETRC 64.5% and over Context-Instruct 60.3% under GPT-4 refereeing

These results demonstrate that CST’s recursive, tree-structured coverage of context enables generation of highly diverse, realistic, and fidelity-aligned SFT data, with significant impact on the quality of LLM fine-tuning outcomes (Quan, 2024).

Markdown Report Issue Upgrade to Chat

References (1)

Automatically Generating Numerous Context-Driven SFT Data for LLMs across Diverse Granularity (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Context-Split-Tree (CST).

CST: Recursive Query Tree Generation

1. Formal Structure and Definition

2. Recursive Construction and Algorithmic Workflow

3. Complexity and Resource Considerations

4. CST in Data Filtering and Contrastive Scoring

5. Illustrative Example on a Toy Text

6. Empirical Evaluation and Metrics

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

CST: Recursive Query Tree Generation

1. Formal Structure and Definition

2. Recursive Construction and Algorithmic Workflow

3. Complexity and Resource Considerations

4. CST in Data Filtering and Contrastive Scoring

5. Illustrative Example on a Toy Text

6. Empirical Evaluation and Metrics

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research