Papers
Topics
Authors
Recent
Search
2000 character limit reached

CST: Recursive Query Tree Generation

Updated 17 March 2026
  • Context-Split-Tree (CST) is a recursive, LLM-driven algorithm that constructs binary trees over text, generating query–context pairs across multiple granularity levels.
  • It integrates context splitting with LLM-based question generation at each node to automatedly create high-quality supervised fine-tuning data.
  • By employing contrastive filtering and tunable parameters, CST optimizes query diversity and fidelity, significantly improving downstream model performance.

The Context-Split-Tree (CST) is a recursive, LLM-driven algorithm for constructing binary trees over a textual context to generate multi-granularity, context-driven query–context pairs. Central to the AugCon framework, CST enables automated large-scale generation of supervised fine-tuning (SFT) data for LLMs, spanning a range of granularities from macro-level to fine-grained queries. By integrating LLM-based context splitting and question generation at each node, CST achieves structural coverage of all meaningful context scales and forms the foundation for downstream SFT data quality improvements (Quan, 2024).

1. Formal Structure and Definition

Let CC be an initial textual context, segmented into atomic sentences S1,,SnS_1,\,\ldots,\,S_n. CST constructs a binary tree T=(V,E)T = (V, E), where each node vVv \in V comprises:

  • a contiguous sub-context CvCC_v \subseteq C
  • a query qvq_v tailored to CvC_v’s semantic granularity

The split operation Split(C)=(C1,C2,q)\text{Split}(C) = (C_1, C_2, q) invokes an LLM with a specially designed prompt to:

  • Generate a question qq directly answerable from CC
  • Partition S1,,SnS_1,\,\ldots,\,S_n0 into semantically coherent, minimally overlapping sub-contexts S1,,SnS_1,\,\ldots,\,S_n1 and S1,,SnS_1,\,\ldots,\,S_n2

For S1,,SnS_1,\,\ldots,\,S_n3 of S1,,SnS_1,\,\ldots,\,S_n4 sentences, index S1,,SnS_1,\,\ldots,\,S_n5 is selected. Sub-contexts S1,,SnS_1,\,\ldots,\,S_n6 and S1,,SnS_1,\,\ldots,\,S_n7 are formed. The process recurses on S1,,SnS_1,\,\ldots,\,S_n8 and S1,,SnS_1,\,\ldots,\,S_n9 while T=(V,E)T = (V, E)0 (a tunable minimum-length threshold) and the split is non-degenerate (T=(V,E)T = (V, E)1), terminating with leaf nodes when conditions fail. Each node’s depth coheres with its granularity: root-level (macro), intermediate (conceptual), and deeper (detail).

A key structural property is that for T=(V,E)T = (V, E)2 of T=(V,E)T = (V, E)3 sentences, the CST contains exactly T=(V,E)T = (V, E)4 nodes, each corresponding to a query–context pair: T=(V,E)T = (V, E)5 This determinism ensures an exhaustive yet non-redundant traversal of granularity.

2. Recursive Construction and Algorithmic Workflow

The CST algorithm proceeds recursively as outlined below: CvCC_v \subseteq C9 The process is interleaved with a filtering mechanism to enforce an upper bound T=(V,E)T = (V, E)6 on queries per context. If fewer than T=(V,E)T = (V, E)7 high-quality, diverse queries survive post-scoring, CST is re-invoked until the requirement is met. This approach guarantees coverage at multiple context scales.

3. Complexity and Resource Considerations

Let T=(V,E)T = (V, E)8 be the sentence count of initial context T=(V,E)T = (V, E)9. In the degenerate worst case (each split partitions off only one sentence), tree depth is vVv \in V0, and node count remains vVv \in V1. For balanced splits, depth is vVv \in V2; node count is still linear vVv \in V3. Each node entails one LLM inference for question and sub-context generation, incurring total cost:

  • Time: vVv \in V4 times the cost of a single LLM inference with context size at most vVv \in V5
  • Space: vVv \in V6 storage for (context, query) pairs

Parameter vVv \in V7 governs granularity, with lower vVv \in V8 leading to deeper trees and more fine-grained queries. This suggests that tuning vVv \in V9 enables precise control over output diversity and data volume.

4. CST in Data Filtering and Contrastive Scoring

Upon generating a candidate pool CvCC_v \subseteq C0 of (context, query) pairs per context, CST integrates with a lightweight contrastive-learning-based scorer CvCC_v \subseteq C1. This scorer is trained as follows:

  • Positive samples CvCC_v \subseteq C2: from CST+few-shot prompt
  • Negative samples CvCC_v \subseteq C3: from prompts with degraded instruction/fewer shots

Contrastive loss: CvCC_v \subseteq C4 with CvCC_v \subseteq C5 the sigmoid function.

During inference, all CvCC_v \subseteq C6 are scored and sorted. The top-ranked queries are selected, skipping any whose ROUGE-L F1 with previously selected queries exceeds CvCC_v \subseteq C7 (to ensure diversity), until CvCC_v \subseteq C8 slots are filled. If insufficient diverse queries remain, CST is rerun.

5. Illustrative Example on a Toy Text

For the passage: "The profits of the contemporary global value chains (GVC) form a V-shape, also known as the ‘smile curve’. At one end are R & D and design; at the other end are services and marketing; processing sits in the middle. Profits at the ends are 20–25%; in the middle only 5%."

CST constructs the following (abbreviated) tree:

  • Node 1 (whole passage): "Why do entrepreneurs worldwide strive to move up the value chain?"
    • C₁ (first half): "What are the key components of the contemporary global value chains?"
    • C₁₁: "What does the global value curve look like?" (leaf; no further split)
    • C₁₂: "What is the structure of the smile curve?"
    • C₁₂₁: "What lies in the middle of the smile curve?" (leaf)
    • C₂ (second half): Questions about low-profit-margin industries

In total, CST produces 8 binary-tree nodes and corresponding context–query pairs. The top 4 are retained post-filtering, covering macro to specific details.

6. Empirical Evaluation and Metrics

CST yields substantial improvements in SFT data quality and downstream model performance. Key empirical findings:

  • Human evaluation on DailyM test:
    • Query Realism: 4.37 (CST) vs 4.05 (best prior)
    • Query Diversity: 4.68 (CST) vs 4.13
  • Automatic QA benchmarks (fine-tuning Llama3-70B):

| Benchmark | Acc (CST, AugCon) | Acc (Context-Instruct) | |-------------------|-------------------|------------------------| | SQuAD1.1 | 0.336 | 0.314 | | TriviaQA | 0.849 | 0.825 | | DROP | 0.350 | 0.334 | | WebGLM-QA BS | 0.924 | 0.885 |

  • Granularity distribution (CST vs Context-Instruct):

| Type | CST (%) | Context-Instruct (%) | |-------------|---------|----------------------| | Detail | 37.8 | 17.9 | | Concept | 35.3 | 63.4 | | Macro | 26.9 | 18.7 |

  • Ablations (on TriviaQA):
    • Removing CST: Accuracy drops from 0.849 to 0.793
    • Removing contrastive filter: 0.828
    • Removing fidelity module: 0.833
  • Compute-matched comparison (80 A100 GPU hours):
    • AugCon wins over ETRC 64.5% and over Context-Instruct 60.3% under GPT-4 refereeing

These results demonstrate that CST’s recursive, tree-structured coverage of context enables generation of highly diverse, realistic, and fidelity-aligned SFT data, with significant impact on the quality of LLM fine-tuning outcomes (Quan, 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Context-Split-Tree (CST).