Papers
Topics
Authors
Recent
Search
2000 character limit reached

Hierarchical Graph of Thoughts (HGOT)

Updated 10 February 2026
  • HGOT is a hierarchical reasoning framework that organizes LLM thought processes into a multilayer directed acyclic graph with nodes representing sub-tasks and dependencies.
  • It generalizes prior approaches like Chain-, Tree-, and Graph-of-Thoughts to enhance retrieval-augmented in-context learning and factuality evaluation.
  • The framework’s theoretical convergence guarantees and empirical benchmarks demonstrate its practical impact in achieving consistent, self-correcting reasoning.

A Hierarchical Graph of Thoughts (HGOT) is a structured framework for modeling, organizing, and leveraging the reasoning process of LLMs, in which the generation and evaluation of “thoughts” (intermediate reasoning steps) are represented as a multilayered, @@@@1@@@@. Each node in the graph encodes a sub-task, intention, or message, and the edges capture dependency relations. HGOT generalizes and systematizes prior approaches such as Chain-of-Thought (CoT), Tree-of-Thoughts (ToT), and Graph of Thoughts (GoT), providing a foundation for both the theoretical analysis of LLM generation and practical advances in retrieval-augmented in-context learning and factuality evaluation (Tutunov et al., 2023, Fang et al., 2024, Besta et al., 2023).

1. Theoretical Underpinnings of the Hierarchical Model

The two-level hierarchical graphical model formalizes LLM reasoning as a generative process involving latent contexts and intentions. The top level represents a global context cc, which defines the overall “mode” of reasoning (e.g., arithmetic, commonsense inference). The lower level comprises latent intentions θ0,...,θM\theta_0, ..., \theta_M and observable messages x0,...,xMx_0, ..., x_M corresponding to natural-language realizations of each sub-thought.

The generative model specifies:

  • Global context cq(c)c \sim q(c) over finite set C\mathcal{C}.
  • For initial intention: θ0q(θ0c)\theta_0 \sim q(\theta_0 | c), then message x0q(x0θ0)x_0 \sim q(x_0 | \theta_0).
  • Recursively, for i1i \geq 1: θiq(θic,θ0:i1,x0:i1)\theta_i \sim q(\theta_i | c, \theta_{0:i-1}, x_{0:i-1}), xiq(xiθi)x_i \sim q(x_i | \theta_i).
  • A terminal latent state θEND\theta_{\mathrm{END}} with xEND=ENDx_{\mathrm{END}} = \langle\mathrm{END}\rangle yields variable-length reasoning chains.

The joint probability is given by: q(c,Θ,X)=q(c)q(θ0c)q(x0θ0)i=1Mq(θic,θ0:i1,x0:i1)q(xiθi)q(c, \Theta, X) = q(c)\, q(\theta_0|c)\, q(x_0|\theta_0)\, \prod_{i=1}^M q(\theta_i | c, \theta_{0:i-1}, x_{0:i-1})\, q(x_i | \theta_i) with the marginal likelihood over messages q(X)q(X) as the object of interest.

This probabilistic framework renders explicit the conditional dependencies that ensure both local coherence (via intentions) and global consistency (via context), forming what is termed the Hierarchical Graph of Thoughts (Tutunov et al., 2023).

2. Geometric Convergence Rate in Few-shot Inference

When LLMs are prompted with a set of example chains Z1,...,ZNZ_1, ..., Z_N and a query x0x_0, the model pLLMp_{\mathrm{LLM}} is assumed to approximate the target marginals q((xi)i=1...mx0,...)q((x_i)_{i=1...m} | x_0, ... ). The geometric convergence theorem provides a formal guarantee for the approximation quality relative to the oracle, context-conditioned likelihood q((xr)r=1...mx0,c)q((x_r)_{r=1...m}|x_0,c^*).

Let the sequence ambiguity ϵ(W):=1q(the unique (c,Θ) that generated WW)\epsilon(W) := 1 - q(\text{the unique }(c, \Theta) \text{ that generated }W|W). Under a uniform-context prior, the following bound holds: pLLM(xrx0,Z1...ZN)q(xrx0,c)ηk=1N[ϵ(Zk)1ϵ(Zk)]\left| p_{\mathrm{LLM}}(x_r|x_0,Z_1...Z_N) - q(x_r|x_0,c^*) \right| \leq \eta\cdot \prod_{k=1}^N \left[ \frac{\epsilon(Z_k)}{1-\epsilon(Z_k)} \right] where η=2ϵ(x0)1ϵ(x0)\eta = 2 \frac{\epsilon(x_0)}{1-\epsilon(x_0)}. With ϵ(Zk)δ<1/2\epsilon(Z_k)\leq \delta < 1/2, the bound decays exponentially in NN: ηρN,for ρ=δ1δ<1|\cdots| \leq \eta \cdot \rho^N, \quad \text{for}~\rho = \frac{\delta}{1-\delta} < 1 This result elucidates that increasing the number of disambiguating example chains NN or reducing their ambiguity ϵ\epsilon sharply improves the probability of correct, context-appropriate reasoning generation. Thus, HGOT provides the theoretical justification for successes of few-shot CoT and its generalizations (Tutunov et al., 2023).

3. HGOT Framework for Retrieval-Augmented In-Context Learning

In retrieval-augmented factuality evaluation, HGOT is concretized as a multilayer DAG. Each node in layer \ell is a sub-query qq with an associated retrieval context and preliminary answer. Directed edges encode dependencies such that the answer to qiq_i is prerequisite for qjq_j (qiqjq_i \to q_j).

The procedural implementation is as follows (Fang et al., 2024):

  • PROBE: Issue qq to retrieval + LLM to obtain (aq,CIq,CTXq)(a_q, \mathrm{CI}_q, \mathrm{CTX}_q).
  • PLAN: Decompose qq into sub-queries via LLM planning prompts, enumerate dependencies.
  • SEARCH: Traverse the DAG in topological order, rewriting sub-queries with predecessors’ answers, recursing via TRAVERSE.
  • INFER: Score all retrieved passages, perform weighted self-consistency majority voting over candidate answers (see Section 4).

Emergent planning is induced through divide-and-conquer (PLAN) prompts, orchestrating the hierarchical breakdown and answer synthesis critical to the HGOT approach.

4. Thought-Quality Metrics and Voting Mechanisms

Evaluation and aggregation of LLM-generated “thoughts” is carried out using citation-aware metrics. For a thought τ\tau and ground-truth citations GT\mathrm{GT}: CR(τ)={citations in τ}GTGT,CP(τ)={citations in τ}GT{citations in τ}\mathrm{CR}(\tau) = \frac{|\{\text{citations in }\tau\} \cap \mathrm{GT}|}{|\mathrm{GT}|}, \quad \mathrm{CP}(\tau) = \frac{|\{\text{citations in }\tau\} \cap \mathrm{GT}|}{|\{\text{citations in }\tau\}|} Each sampled thought-answer pair (τi,ai)(\tau_i, a_i) receives a quality score: ρi=α1+βCR(τi)+γCP(τi)\rho_i = \alpha \cdot 1 + \beta\,\mathrm{CR}(\tau_i) + \gamma\,\mathrm{CP}(\tau_i) Weighted self-consistency majority voting selects the answer a^\hat{a}^* maximizing the sum of ρi\rho_i over matching responses, with normalized confidence

CI=i=1mρiδ(ai,a^)i=1mρi\mathbf{CI} = \frac{\sum_{i=1}^m \rho_i\,\delta(a_i, \hat{a}^*)}{\sum_{i=1}^m \rho_i}

where δ\delta is the Kronecker delta.

Retrieval passage scoring further incorporates weighted citation frequencies and iteratively updates passage scores: σ(p,t+1)w1σ(p,t)+w2νˉ(p)+w3CI\sigma(p,t+1) \leftarrow w_1\,\sigma(p,t) + w_2\,\bar{\nu}(p) + w_3\,\mathbf{CI} This explicit linkage between thought quality, answer selection, and citation grounding is a defining feature of HGOT in factuality applications (Fang et al., 2024).

5. Relation to and Generalization of Chain/Graph/Tree of Thoughts

HGOT subsumes prior paradigms:

  • Chain-of-Thought (CoT): Special case where the reasoning trace is a linear sequence, which corresponds to a depth-1 hierarchy in HGOT; the two-level latent variable model of (Tutunov et al., 2023) shows that few-shot, well-chosen CoT examples “pin down” the latent context, enabling effective reasoning.
  • Graph of Thoughts (GoT): Models the entire reasoning process as a directed graph G=(V,E)G = (V, E) of thoughts (vertices) with transformation operators for generation, aggregation, refinement, and user-defined abstractions or pruning. HGOT is extensible to arbitrary depth and graph topology, enabling sophisticated planning and reasoning workflows (Besta et al., 2023).

The relationship among these frameworks is summarized as follows:

Paradigm Graph Structure Latent Structure
CoT Chain (linear) 2-layer, sequential
ToT Tree Hierarchical, branched
GoT Arbitrary DAG Multi-type, extensible
HGOT Multilayer DAG Explicit multi-level

GoT’s modular architecture—including distinct Prompter, Parser, Scoring, and Controller modules—enables HGOT variants by adding abstraction layers, cross-level edges, or new operators as required (Besta et al., 2023).

6. Empirical Impact and Evaluation

HGOT yields state-of-the-art performance in retrieval-augmented factuality benchmarks. For example, on FEVER, Open-SQuAD, and HotPotQA, HGOT variants outperform or match the strongest published baselines, with up to a 7 percentage-point gain in exact match and substantial F1 improvements (e.g., FEVER: 58.35→61.50 EM; HotPotQA long: 45.26→53.98 EM compared to leading baselines) (Fang et al., 2024).

Weighted reasoning steps, citation precision/recall metrics, and passage re-ranking result in superior selection of factual and self-consistent answers. These results confirm the practical value of explicit, hierarchical reasoning and evaluation structures in contemporary LLM prompts and model design.

7. Future Prospects and Methodological Extensions

The theoretical framework underpinning HGOT suggests further advances in prompting strategies, example selection, and multi-hop retrieval architectures. In particular, the convergence theory implies that longer, more distinctive, and less ambiguous reasoning chains drive LLMs to mimic the target context with high accuracy; thus, system design should prioritize example chains and sub-thoughts minimizing posterior uncertainty over latent context and intent (Tutunov et al., 2023).

For future extensions—such as Tree- or Graph-of-Thoughts—designing multi-layered sub-thought sequences, weighting reasoning steps by their factual grounding, and integrating citation-calibrated confidence measures are projected to further enhance both model reliability and reasoning depth. The modular, transformation-based nature of HGOT within the GoT paradigm supports continued innovation in hierarchical, graph-based reasoning for LLMs (Besta et al., 2023).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Hierarchical Graph of Thoughts (HGOT).