Papers
Topics
Authors
Recent
Search
2000 character limit reached

QTree Construction for Hierarchical Querying

Updated 6 February 2026
  • QTree Construction is a structured 3-level, 3-ary tree that hierarchically decomposes a base query into subqueries.
  • It generates subqueries with strict inclusion or exclusion constraints to form connected, four-node outlines for comprehensive retrieval.
  • The construction pipeline integrates LLM-based generation, heuristic repair, and preference optimization to benchmark retrieval performance.

A QTree (Query Tree) denotes a specific hierarchical structure used for organizing subqueries in coverage-conditioned retrieval-augmented generation, with prominent recent instantiations represented in the QTree dataset for outline construction and evaluation in coverage-conditioned (C2C^2) querying. In this context, a QTree is a strictly regular 3-level kk-ary tree (with k=3k=3, depth D=3D=3), designed to systematically unfold the subtopic space of a base user question, enabling the creation of principled, efficiently searchable outlines that satisfy complex inclusion/exclusion criteria. QTree construction is utilized for benchmarking and optimizing LLM-guided information search, particularly within large-scale retrieval systems and preference-tuned outline planning (Kim et al., 2024).

1. Formal Definition and Notation

A QTree TqbaseT_{q_{base}} is defined on the basis of a “base” question qbaseq_{base} and is composed as follows:

  • Depth: D=3D=3
  • Branching factor: b=3b=3
  • Node set: V={qi1id1dD,ij{1,,b}}V = \{q_{i_1 \dots i_d} \mid 1 \leq d \leq D,\, i_j \in \{1,\dots,b\}\}, with V=d=1Dbd=39|V| = \sum_{d=1}^D b^d = 39 (excluding the root)
  • Each node qi1idq_{i_1\dots i_d} represents a natural-language subquestion recursively refining qbaseq_{base}
  • Edge set: E={(qi1id1,qi1id)1dD}E = \{(q_{i_1\dots i_{d-1}}, q_{i_1\dots i_d})\mid 1\leq d\leq D\}, forming an arborescence rooted at qbaseq_{base}

The tree is rooted, ordered, and strictly uniform: every parent at level dd has exactly bb children unless at maximum depth. Subquery identifiers are canonicalized as q1q_1, q21q_{2\,1}, q213q_{2\,1\,3}, etc., to denote hierarchical position.

Coverage-conditioned queries or C2C^2 queries augment qbaseq_{base} with a “coverage query” qcovq_{cov} specifying inclusion or exclusion of particular nodes/subtrees within TqbaseT_{q_{base}}.

2. Construction Pipeline

The QTree construction for the QTree dataset follows a multi-stage process (Kim et al., 2024):

A. Base Query Collection:

  • Base questions are sourced from datasets targeting information-seeking and expert queries (ASQA, Longform, ExpertQA)
  • Data cleaning removes extraneous or malformed entries, yielding 10577\sim10\,577 unique train/test base queries

B. Hierarchical Decomposition (QTree Generation):

  • For each qbaseq_{base}, an LLM (e.g., GPT-4) is prompted to generate a 3-level, strictly 3-ary hierarchy of subquestions
  • Requirements: depth and branching fixed at 3, all nodes in question format, no duplicates or semantic overlap
  • Outputs are filtered for shape validity; malformed generations are heuristically repaired or re-prompted

C. Coverage Query Synthesis:

  • For each tree TqbaseT_{q_{base}}, a node sVs \in V is sampled as a “background query”
  • An intent operation (inclusion/exclusion) is randomly assigned
  • Coverage query qcovq_{cov} is synthesized via LLM prompt templates such as “Please include/avoid topic X”
  • For each base query, a single C2C^2 query [qbase;qcov][q_{base};q_{cov}] is retained after parsing candidate coverages

D. Outline Candidate Generation:

  • For each C2C^2 query, the construction algorithm generates 3 candidate outlines by sequentially prompting the LLM to extract a connected set OO of exactly 4 nodes from TqbaseT_{q_{base}} that honor the coverage constraint
  • Outlines are validated to be connected (as path/tree subgraphs), deduplicated, and coverage-compliant (“inclusion” implies OO \cap subtree(s)(s)\neq\emptyset; “exclusion” implies OO\cap subtree(s)=(s) = \emptyset)

3. Coverage Constraints and Objective Metrics

Coverage-conditioned querying is formalized as follows:

  • For “Inclusion” intents, an outline OVO\subseteq V must intersect the subtree rooted at ss
  • For “Exclusion” intents, OO must be disjoint from subtree(s)(s)
  • Outlines are scored by an LLM judge (GPT-4) on a 1–5 scale, representing the degree of compliance with C2C^2 constraints

These scores are used not only for dataset annotation but also as reward signals for subsequent preference-alignment in outline generation models.

4. QPlanner Model and Training Paradigm

The resulting dataset is leveraged to train "QPlanner," a 7B-parameter Llama-2-based model:

  • Model input: C2C^2 query, consisting of the base question and the coverage intent
  • Model output: a full QTree TT plus a 4-item compliant outline
  • Training is conducted via:
    • Supervised fine-tuning (SFT) on (C2,O)(C^2,\,O) pairs
    • Direct Preference Optimization (DPO), which uses pairwise outline preferences derived from LLM-annotated scores, with a KL-regularization penalty imposed against a reference policy. Only the highest and lowest scored outlines per C2C^2 are used as preference signals.

DPO objective:

LDPO=E(C2,O+,O)[logσ(πθ(O+C2)πθ(OC2))]L_{DPO} = \mathbb{E}_{(C^2, O^+, O^-)}\left[\log\sigma(\pi_\theta(O^+|C^2) - \pi_\theta(O^-|C^2)) \right]

where σ()\sigma(\cdot) is the logistic sigmoid, πθ\pi_\theta is QPlanner's policy, and β=0.01\beta=0.01 controls the KL term.

5. Illustrative Example: End-to-End Construction

For qbaseq_{base} = “Describe the film The Woman Hunt”, the generated QTree features major facets such as plot, production, and reception. Suppose qcovq_{cov} is an exclusion of the “reception” subtree. The outline construction process yields an output such as:

  • “What is the plot of The Woman Hunt?”
  • “What are the main events in The Woman Hunt?”
  • “What initiates the conflict in The Woman Hunt?”
  • “What is the climax of The Woman Hunt?”

This outline forms a valid, connected subset, excludes nodes under the reception branch, and passes all structural and coverage constraints. An LLM judge rates the output, and the observed score contributes to preference-based training.

6. Experimental Results and Empirical Analysis

Empirical evaluation demonstrates:

  • All candidate outlines generated by QPlanner (as preference-tuned) better satisfy C2C^2 coverage criteria compared to SFT-only and baseline models, per both LLM and human judges
  • QTree structures support fine-grained, interpretable filtering and composition of subqueries under controlled inclusion/exclusion
  • On large-scale data, the QTree+QPlanner pipeline produces aligned outlines for 104\sim10^4 coverage-conditioned queries efficiently, supporting rapid evaluation in RAG contexts (Kim et al., 2024)

7. Scope, Limitations, and Practical Relevance

QTree construction, as instantiated in (Kim et al., 2024), is tailored to structured decomposition of information-seeking questions. It is agnostic to domain, provided the base queries are suitable for hierarchical topical expansion. The approach fundamentally depends on LLMs’ capacity for stable, high-coverage question generation and outline selection; malformed base queries or degenerate decompositions require manual or heuristic post-filtering. Outline extraction is strictly limited to connected 4-node subgraphs, which may not capture all semantically optimal combinations. A plausible implication is that QTree-based querying is most effective where topic structures are inherently hierarchical and easily segmented by breadth-first subtopic enumeration, and may need adaptation for deep or irregular hierarchies.

The QTree construction pipeline enables systematic benchmarking of retrieval-augmented generation with explicit outline constraints, and informs the development of preference-aligned planners that respect user-specified coverage in complex information spaces (Kim et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to QTree Construction.