CST: Recursive Query Tree Generation
- Context-Split-Tree (CST) is a recursive, LLM-driven algorithm that constructs binary trees over text, generating query–context pairs across multiple granularity levels.
- It integrates context splitting with LLM-based question generation at each node to automatedly create high-quality supervised fine-tuning data.
- By employing contrastive filtering and tunable parameters, CST optimizes query diversity and fidelity, significantly improving downstream model performance.
The Context-Split-Tree (CST) is a recursive, LLM-driven algorithm for constructing binary trees over a textual context to generate multi-granularity, context-driven query–context pairs. Central to the AugCon framework, CST enables automated large-scale generation of supervised fine-tuning (SFT) data for LLMs, spanning a range of granularities from macro-level to fine-grained queries. By integrating LLM-based context splitting and question generation at each node, CST achieves structural coverage of all meaningful context scales and forms the foundation for downstream SFT data quality improvements (Quan, 2024).
1. Formal Structure and Definition
Let be an initial textual context, segmented into atomic sentences . CST constructs a binary tree , where each node comprises:
- a contiguous sub-context
- a query tailored to ’s semantic granularity
The split operation invokes an LLM with a specially designed prompt to:
- Generate a question directly answerable from
- Partition 0 into semantically coherent, minimally overlapping sub-contexts 1 and 2
For 3 of 4 sentences, index 5 is selected. Sub-contexts 6 and 7 are formed. The process recurses on 8 and 9 while 0 (a tunable minimum-length threshold) and the split is non-degenerate (1), terminating with leaf nodes when conditions fail. Each node’s depth coheres with its granularity: root-level (macro), intermediate (conceptual), and deeper (detail).
A key structural property is that for 2 of 3 sentences, the CST contains exactly 4 nodes, each corresponding to a query–context pair: 5 This determinism ensures an exhaustive yet non-redundant traversal of granularity.
2. Recursive Construction and Algorithmic Workflow
The CST algorithm proceeds recursively as outlined below: 9 The process is interleaved with a filtering mechanism to enforce an upper bound 6 on queries per context. If fewer than 7 high-quality, diverse queries survive post-scoring, CST is re-invoked until the requirement is met. This approach guarantees coverage at multiple context scales.
3. Complexity and Resource Considerations
Let 8 be the sentence count of initial context 9. In the degenerate worst case (each split partitions off only one sentence), tree depth is 0, and node count remains 1. For balanced splits, depth is 2; node count is still linear 3. Each node entails one LLM inference for question and sub-context generation, incurring total cost:
- Time: 4 times the cost of a single LLM inference with context size at most 5
- Space: 6 storage for (context, query) pairs
Parameter 7 governs granularity, with lower 8 leading to deeper trees and more fine-grained queries. This suggests that tuning 9 enables precise control over output diversity and data volume.
4. CST in Data Filtering and Contrastive Scoring
Upon generating a candidate pool 0 of (context, query) pairs per context, CST integrates with a lightweight contrastive-learning-based scorer 1. This scorer is trained as follows:
- Positive samples 2: from CST+few-shot prompt
- Negative samples 3: from prompts with degraded instruction/fewer shots
Contrastive loss: 4 with 5 the sigmoid function.
During inference, all 6 are scored and sorted. The top-ranked queries are selected, skipping any whose ROUGE-L F1 with previously selected queries exceeds 7 (to ensure diversity), until 8 slots are filled. If insufficient diverse queries remain, CST is rerun.
5. Illustrative Example on a Toy Text
For the passage: "The profits of the contemporary global value chains (GVC) form a V-shape, also known as the ‘smile curve’. At one end are R & D and design; at the other end are services and marketing; processing sits in the middle. Profits at the ends are 20–25%; in the middle only 5%."
CST constructs the following (abbreviated) tree:
- Node 1 (whole passage): "Why do entrepreneurs worldwide strive to move up the value chain?"
- C₁ (first half): "What are the key components of the contemporary global value chains?"
- C₁₁: "What does the global value curve look like?" (leaf; no further split)
- C₁₂: "What is the structure of the smile curve?"
- C₁₂₁: "What lies in the middle of the smile curve?" (leaf)
- C₂ (second half): Questions about low-profit-margin industries
In total, CST produces 8 binary-tree nodes and corresponding context–query pairs. The top 4 are retained post-filtering, covering macro to specific details.
6. Empirical Evaluation and Metrics
CST yields substantial improvements in SFT data quality and downstream model performance. Key empirical findings:
- Human evaluation on DailyM test:
- Query Realism: 4.37 (CST) vs 4.05 (best prior)
- Query Diversity: 4.68 (CST) vs 4.13
- Automatic QA benchmarks (fine-tuning Llama3-70B):
| Benchmark | Acc (CST, AugCon) | Acc (Context-Instruct) | |-------------------|-------------------|------------------------| | SQuAD1.1 | 0.336 | 0.314 | | TriviaQA | 0.849 | 0.825 | | DROP | 0.350 | 0.334 | | WebGLM-QA BS | 0.924 | 0.885 |
- Granularity distribution (CST vs Context-Instruct):
| Type | CST (%) | Context-Instruct (%) | |-------------|---------|----------------------| | Detail | 37.8 | 17.9 | | Concept | 35.3 | 63.4 | | Macro | 26.9 | 18.7 |
- Ablations (on TriviaQA):
- Removing CST: Accuracy drops from 0.849 to 0.793
- Removing contrastive filter: 0.828
- Removing fidelity module: 0.833
- Compute-matched comparison (80 A100 GPU hours):
- AugCon wins over ETRC 64.5% and over Context-Instruct 60.3% under GPT-4 refereeing
These results demonstrate that CST’s recursive, tree-structured coverage of context enables generation of highly diverse, realistic, and fidelity-aligned SFT data, with significant impact on the quality of LLM fine-tuning outcomes (Quan, 2024).