Papers
Topics
Authors
Recent
Search
2000 character limit reached

Hierarchical Hindsight Reflection (H²R)

Updated 21 January 2026
  • Hierarchical Hindsight Reflection (H²R) is a dual-memory architecture that decouples abstract planning from concrete execution to enhance task generalization.
  • It employs iterative reflection loops to distill subgoal sequences and execution insights from past interactions, optimizing reuse of learned skills.
  • Empirical results on AlfWorld and PDDLGame validate H²R’s structured memory hierarchy, demonstrating significant performance improvements over baseline methods.

Hierarchical Hindsight Reflection (H2^2R) is a methodology for LLM-based agents addressing efficient knowledge transfer and memory utilization in multi-task settings. H2^2R introduces a hierarchical memory architecture that separates high-level planning knowledge from low-level execution strategies, enabling fine-grained and reusable knowledge to be distilled from past agent-environment interactions. This dual-memory design allows LLM agents to independently retrieve and recombine abstract and concrete experience, resulting in improved generalization and performance on novel tasks (Ye et al., 16 Sep 2025).

1. Hierarchical Memory Architecture

H2^2R decouples agent memory into two distinct but complementary levels:

  • High-Level Planning Memory (Mhigh\mathcal{M}_{\mathrm{high}}):

Each memory unit mihighm^{\mathrm{high}}_i encapsulates a natural-language task description Xi\mathcal{X}^i, an inferred subgoal sequence Gi={g1i,,gkii}\mathcal{G}^i = \{g^i_1, \dots, g^i_{k_i}\}, and a set of distilled planning insights Ihighi\mathcal{I}^{i}_{\mathrm{high}}. This memory is leveraged by the planner for task decomposition and selection of intermediate objectives.

  • Low-Level Execution Memory (Mlow\mathcal{M}_{\mathrm{low}}):

Each unit mjlowm^{\mathrm{low}}_j stores a single subgoal gjg^j, its associated successful sub-trajectory τ+j\tau^j_+, and a set of execution insights Ilowj\mathcal{I}^{j}_{\mathrm{low}}. This memory supports precise grounding of subgoals into concrete actions, supporting the execution module.

This division explicitly supports compositional learning by isolating abstract planning from concrete skill execution, and each memory is organized for targeted retrieval (Ye et al., 16 Sep 2025).

2. Hierarchical Hindsight Reflection Process

H2^2R constructs and refines its memory structures through a sequence of nested reflection loops operating over the agent’s interaction history. Given training triplets (Xi,τ+i,τi)(\mathcal{X}^i, \tau^i_+, \tau^i_-) of task prompt, successful trajectory, and failed trajectory:

  • High-Level Reflection:

Subgoal inference functions Fsubgoal\mathcal{F}_{\mathrm{subgoal}} extract subgoal sequences from both successful and failed attempts. Contrastive reflection (via Fhigh\mathcal{F}_{\mathrm{high}}) isolates planning insights by comparing these subgoal decompositions and outcomes. A high-level skeleton (Xi,G+i,)(\mathcal{X}^i, \mathcal{G}^i_+, \emptyset) is formed, and insights are attached after batch processing.

  • Low-Level Reflection:

The successful trajectory τ+i\tau^i_+ is partitioned into sub-trajectories {τ+,1i,,τ+,kii}\{\tau^i_{+,1}, \dots, \tau^i_{+,k_i}\} aligned with inferred subgoals. For each subgoal gjig^i_j, contrastive execution reflection (Flow\mathcal{F}_{\mathrm{low}}) extracts insights by comparing performance on successful sub-trajectories versus failures. Corresponding low-level skeletons (gji,τ+,ji,)(g^i_j, \tau^i_{+,j}, \emptyset) are instantiated.

  • Grounding of Insights:

After all episodes, dedicated grounding functions FgroundF_{\mathrm{ground}} attach the batch-extracted insights to their respective memory units, both high-level and low-level (Ye et al., 16 Sep 2025).

This protocol enables systematic distillation and organization of knowledge across abstraction levels, without requiring gradient-based fine-tuning.

3. Formal Algorithmic Specification

The H2^2R process is encapsulated by the following pseudocode:

Algorithm H²R:
Input:
  Trajectories T = { (X^i, τ^i_+, τ^i_-) }
  Modules: F_subgoal, F_high, F_trajectory, F_low, F_ground
Output:
  M_high, M_low

Initialize:
  M_high ← ∅
  M_low  ← ∅
  I_high ← ∅
  I_low  ← ∅

For each (X^i, τ^i_+, τ^i_-) in T do
  G^i_+ ← F_subgoal(X^i, τ^i_+)
  G^i_- ← F_subgoal(X^i, τ^i_-)
  I_high ← F_high(X^i, τ^i_+, τ^i_-, G^i_+, G^i_-, I_high)
  Add (X^i, G^i_+, ∅) to M_high

  T_sub ← F_trajectory(X^i, τ^i_+, G^i_+)
  For j = 1…|G^i_+| do
    g_j ← G^i_{+,j}
    τ_{+,j} ← T_sub[j]
    I_low ← F_low(g_j, τ_{+,j}, τ^i_-, I_low)
    Add (g_j, τ_{+,j}, ∅) to M_low
  End For
End For

For each (X^i, G^i_+, _) in M_high do
  I^i_high ← F_ground((X^i, G^i_+), I_high)
  Replace third slot by I^i_high
End For

For each (g_j, τ_{+,j}, _) in M_low do
  I^j_low ← F_ground((g_j, τ_{+,j}), I_low)
  Replace third slot by I^j_low
End For

Return M_high, M_low

No trainable parameters are updated; memories are built and refined through iterative prompting (Ye et al., 16 Sep 2025).

4. Retrieval and Utilization at Test Time

H2^2R’s test-time memory access involves independent, vector-based retrieval over each memory hierarchy:

  • High-Level (Planner) Retrieval:

$\mathcal{M}_{\mathrm{high}}^{\mathrm{relevant}} = \mathrm{top}\mbox{-}k\left[\cos\left(e(\mathcal{X}),\,e(\mathcal{X}^i)\right)\right]_{m^i \in \mathcal{M}_{\mathrm{high}}}$

Retrieved units provide subgoal sequences G\mathcal{G} and corresponding planning insights Ihigh\mathcal{I}_{\mathrm{high}}.

  • Low-Level (Executor) Retrieval:

$\mathcal{M}_{\mathrm{low}}^{\mathrm{relevant}} = \mathrm{top}\mbox{-}k\left[\cos\left(e(g),\,e(g^j)\right)\right]_{m^j \in \mathcal{M}_{\mathrm{low}}}$

Relevant memory supplies successful sub-trajectories and low-level insights for action grounding.

Integration proceeds as follows:

  • The planner conditions on {X,Khigh}\{\mathcal{X}, \mathcal{K}_{\mathrm{high}}\} to output the next subgoal gg.
  • The executor conditions on {g,Klow}\{g, \mathcal{K}_{\mathrm{low}}\} to produce the atomic action aAa \in \mathcal{A}.

No loss function L\mathcal{L} or gradient-based optimization operates on the memories. All memory retrievals use fixed, pre-trained sentence embeddings e()e(\cdot) and cosine similarity s(u,v)=uvuvs(u, v) = \frac{u \cdot v}{\|u\| \|v\|} (Ye et al., 16 Sep 2025).

5. Empirical Evaluation and Results

H2^2R was evaluated on two benchmarks:

  • AlfWorld: Text-based household environment, comprising 6 task types and a maximum episode length of 30 steps.
  • PDDLGame: Strategic planning domain, with 3 task types and up to 40 steps per episode.

Baseline comparisons include ReAct (no memory) and Expel (single-level episodic memory). The metric is held-out task success rate, averaged over three random seeds.

Method AlfWorld (%) PDDLGame (%)
ReAct 46.3 66.7
Expel 72.4 72.2
H2^2R 75.9 80.5

Ablation on PDDLGame illustrates the impact of memory hierarchy:

Variant Success Rate (%) Δ from full H2^2R
full H2^2R 80.5 --
– high-level memory 52.8 –27.7
– low-level memory 61.1 –19.4

Key observations:

  • H2^2R surpasses Expel by 3.5 percentage points on AlfWorld and 8.3 points on PDDLGame.
  • Both high-level and low-level memories are necessary; removing either drasticly reduces performance.
  • Greater improvement on PDDLGame suggests hierarchical decomposition is particularly advantageous in complex planning environments (Ye et al., 16 Sep 2025).

6. Concrete Example: AlfWorld Task

A representative AlfWorld scenario further clarifies H2^2R’s operation:

  • Training Memory:

Task: “pick up cup, fill with water, place on table”. Inferred subgoals: G={\mathcal{G} = \{"grasp cup", "fill cup", "put on table"}\}. High-level insight: “Always re-orient the cup before pouring.” Low-level execution memory for “put on table” stores sub-trajectory (“move east; release cup”) and the insight “verify table is clear.”

  • Generalization to New Task:

For the novel task “cool lettuce and place on table”: 1. The planner retrieves high-level units with “place on table,” reuses the subgoal “put on table,” and drops irrelevant “fill/clean” subgoals. 2. The planner composes {\{“grasp lettuce”, “cool lettuce”, “put on table”}\} as subgoals. 3. The executor retrieves the low-level memory for “put on table” to generate precise action sequences, ensuring successful execution.

This demonstrates targeted transfer and compositional recombination of hierarchical memories, enhancing efficiency and flexibility (Ye et al., 16 Sep 2025).

7. Significance and Implications

H2^2R advances the design of LLM-based agent architectures by enabling memory structures that are both modular and hierarchically organized. Fine-grained decomposition allows agents to efficiently reuse and recombine plans and skills across tasks, addressing the limitations of monolithic memory approaches. Empirical results underscore the necessity of both abstraction layers for generalization, corroborating the utility of explicit hierarchical memory for compositional task transfer in LLM agents (Ye et al., 16 Sep 2025). A plausible implication is that further scaling of such hierarchies could improve performance on even more complex multitask and open-ended environments.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Hierarchical Hindsight Reflection (H$^2$R).