Hierarchical Hindsight Reflection (H²R)

Updated 21 January 2026

Hierarchical Hindsight Reflection (H²R) is a dual-memory architecture that decouples abstract planning from concrete execution to enhance task generalization.
It employs iterative reflection loops to distill subgoal sequences and execution insights from past interactions, optimizing reuse of learned skills.
Empirical results on AlfWorld and PDDLGame validate H²R’s structured memory hierarchy, demonstrating significant performance improvements over baseline methods.

Hierarchical Hindsight Reflection (H $^2$ R) is a methodology for LLM-based agents addressing efficient knowledge transfer and memory utilization in multi-task settings. H $^2$ R introduces a hierarchical memory architecture that separates high-level planning knowledge from low-level execution strategies, enabling fine-grained and reusable knowledge to be distilled from past agent-environment interactions. This dual-memory design allows LLM agents to independently retrieve and recombine abstract and concrete experience, resulting in improved generalization and performance on novel tasks (Ye et al., 16 Sep 2025).

1. Hierarchical Memory Architecture

H $^2$ R decouples agent memory into two distinct but complementary levels:

High-Level Planning Memory ( $\mathcal{M}_{\mathrm{high}}$ ):

Each memory unit $m^{\mathrm{high}}_i$ encapsulates a natural-language task description $\mathcal{X}^i$ , an inferred subgoal sequence $\mathcal{G}^i = \{g^i_1, \dots, g^i_{k_i}\}$ , and a set of distilled planning insights $\mathcal{I}^{i}_{\mathrm{high}}$ . This memory is leveraged by the planner for task decomposition and selection of intermediate objectives.

Low-Level Execution Memory ( $\mathcal{M}_{\mathrm{low}}$ ):

Each unit $m^{\mathrm{low}}_j$ stores a single subgoal $^2$ 0, its associated successful sub-trajectory $^2$ 1, and a set of execution insights $^2$ 2. This memory supports precise grounding of subgoals into concrete actions, supporting the execution module.

This division explicitly supports compositional learning by isolating abstract planning from concrete skill execution, and each memory is organized for targeted retrieval (Ye et al., 16 Sep 2025).

2. Hierarchical Hindsight Reflection Process

H $^2$ 3R constructs and refines its memory structures through a sequence of nested reflection loops operating over the agent’s interaction history. Given training triplets $^2$ 4 of task prompt, successful trajectory, and failed trajectory:

High-Level Reflection:

Subgoal inference functions $^2$ 5 extract subgoal sequences from both successful and failed attempts. Contrastive reflection (via $^2$ 6) isolates planning insights by comparing these subgoal decompositions and outcomes. A high-level skeleton $^2$ 7 is formed, and insights are attached after batch processing.

Low-Level Reflection:

The successful trajectory $^2$ 8 is partitioned into sub-trajectories $^2$ 9 aligned with inferred subgoals. For each subgoal $^2$ 0, contrastive execution reflection ( $^2$ 1) extracts insights by comparing performance on successful sub-trajectories versus failures. Corresponding low-level skeletons $^2$ 2 are instantiated.

Grounding of Insights:

After all episodes, dedicated grounding functions $^2$ 3 attach the batch-extracted insights to their respective memory units, both high-level and low-level (Ye et al., 16 Sep 2025).

This protocol enables systematic distillation and organization of knowledge across abstraction levels, without requiring gradient-based fine-tuning.

3. Formal Algorithmic Specification

The H $^2$ 4R process is encapsulated by the following pseudocode:

Algorithm H²R:
Input:
  Trajectories T = { (X^i, τ^i_+, τ^i_-) }
  Modules: F_subgoal, F_high, F_trajectory, F_low, F_ground
Output:
  M_high, M_low

Initialize:
  M_high ← ∅
  M_low  ← ∅
  I_high ← ∅
  I_low  ← ∅

For each (X^i, τ^i_+, τ^i_-) in T do
  G^i_+ ← F_subgoal(X^i, τ^i_+)
  G^i_- ← F_subgoal(X^i, τ^i_-)
  I_high ← F_high(X^i, τ^i_+, τ^i_-, G^i_+, G^i_-, I_high)
  Add (X^i, G^i_+, ∅) to M_high

  T_sub ← F_trajectory(X^i, τ^i_+, G^i_+)
  For j = 1…|G^i_+| do
    g_j ← G^i_{+,j}
    τ_{+,j} ← T_sub[j]
    I_low ← F_low(g_j, τ_{+,j}, τ^i_-, I_low)
    Add (g_j, τ_{+,j}, ∅) to M_low
  End For
End For

For each (X^i, G^i_+, _) in M_high do
  I^i_high ← F_ground((X^i, G^i_+), I_high)
  Replace third slot by I^i_high
End For

For each (g_j, τ_{+,j}, _) in M_low do
  I^j_low ← F_ground((g_j, τ_{+,j}), I_low)
  Replace third slot by I^j_low
End For

Return M_high, M_low

No trainable parameters are updated; memories are built and refined through iterative prompting (Ye et al., 16 Sep 2025).

4. Retrieval and Utilization at Test Time

H $^2$ 5R’s test-time memory access involves independent, vector-based retrieval over each memory hierarchy:

High-Level (Planner) Retrieval:

$^2$ 6

Retrieved units provide subgoal sequences $^2$ 7 and corresponding planning insights $^2$ 8.

Low-Level (Executor) Retrieval:

$^2$ 9

Relevant memory supplies successful sub-trajectories and low-level insights for action grounding.

Integration proceeds as follows:

The planner conditions on $\mathcal{M}_{\mathrm{high}}$ 0 to output the next subgoal $\mathcal{M}_{\mathrm{high}}$ 1.
The executor conditions on $\mathcal{M}_{\mathrm{high}}$ 2 to produce the atomic action $\mathcal{M}_{\mathrm{high}}$ 3.

No loss function $\mathcal{M}_{\mathrm{high}}$ 4 or gradient-based optimization operates on the memories. All memory retrievals use fixed, pre-trained sentence embeddings $\mathcal{M}_{\mathrm{high}}$ 5 and cosine similarity $\mathcal{M}_{\mathrm{high}}$ 6 (Ye et al., 16 Sep 2025).

5. Empirical Evaluation and Results

H $\mathcal{M}_{\mathrm{high}}$ 7R was evaluated on two benchmarks:

AlfWorld: Text-based household environment, comprising 6 task types and a maximum episode length of 30 steps.
PDDLGame: Strategic planning domain, with 3 task types and up to 40 steps per episode.

Baseline comparisons include ReAct (no memory) and Expel (single-level episodic memory). The metric is held-out task success rate, averaged over three random seeds.

Method	AlfWorld (%)	PDDLGame (%)
ReAct	46.3	66.7
Expel	72.4	72.2
H $\mathcal{M}_{\mathrm{high}}$ 8R	75.9	80.5

Ablation on PDDLGame illustrates the impact of memory hierarchy:

Variant	Success Rate (%)	Δ from full H $\mathcal{M}_{\mathrm{high}}$ 9R
full H $m^{\mathrm{high}}_i$ 0R	80.5	--
– high-level memory	52.8	–27.7
– low-level memory	61.1	–19.4

Key observations:

H $m^{\mathrm{high}}_i$ 1R surpasses Expel by 3.5 percentage points on AlfWorld and 8.3 points on PDDLGame.
Both high-level and low-level memories are necessary; removing either drasticly reduces performance.
Greater improvement on PDDLGame suggests hierarchical decomposition is particularly advantageous in complex planning environments (Ye et al., 16 Sep 2025).

6. Concrete Example: AlfWorld Task

A representative AlfWorld scenario further clarifies H $m^{\mathrm{high}}_i$ 2R’s operation:

Training Memory:

Task: “pick up cup, fill with water, place on table”. Inferred subgoals: $m^{\mathrm{high}}_i$ 3"grasp cup", "fill cup", "put on table" $m^{\mathrm{high}}_i$ 4. High-level insight: “Always re-orient the cup before pouring.” Low-level execution memory for “put on table” stores sub-trajectory (“move east; release cup”) and the insight “verify table is clear.”

Generalization to New Task:

For the novel task “cool lettuce and place on table”: 1. The planner retrieves high-level units with “place on table,” reuses the subgoal “put on table,” and drops irrelevant “fill/clean” subgoals. 2. The planner composes $m^{\mathrm{high}}_i$ 5“grasp lettuce”, “cool lettuce”, “put on table” $m^{\mathrm{high}}_i$ 6 as subgoals. 3. The executor retrieves the low-level memory for “put on table” to generate precise action sequences, ensuring successful execution.

This demonstrates targeted transfer and compositional recombination of hierarchical memories, enhancing efficiency and flexibility (Ye et al., 16 Sep 2025).

7. Significance and Implications

H $m^{\mathrm{high}}_i$ 7R advances the design of LLM-based agent architectures by enabling memory structures that are both modular and hierarchically organized. Fine-grained decomposition allows agents to efficiently reuse and recombine plans and skills across tasks, addressing the limitations of monolithic memory approaches. Empirical results underscore the necessity of both abstraction layers for generalization, corroborating the utility of explicit hierarchical memory for compositional task transfer in LLM agents (Ye et al., 16 Sep 2025). A plausible implication is that further scaling of such hierarchies could improve performance on even more complex multitask and open-ended environments.