Hierarchical Reflection Architecture

Updated 4 July 2026

Hierarchical Reflection Architecture is a multi-level design where higher layers monitor and refine lower-level operations, enabling effective cross-level adaptation.
It is applied in image processing, LLM-based agents, and software recovery by utilizing structured decompositions like multi-scale, temporal, and mnemonic hierarchies.
Empirical studies show notable improvements such as increased PSNR/SSIM in image restoration and higher success rates in agent tasks, validating its practical impact.

Searching arXiv for papers directly related to “hierarchical reflection architecture” and adjacent formulations of hierarchical reflection across agents, software architecture, and formal systems. Hierarchical Reflection Architecture denotes a class of architectures in which reflection is organized explicitly across multiple levels of abstraction, scope, or temporal scale, rather than being treated as a single undifferentiated self-monitoring mechanism. Across the literature, the term does not refer to one universally standardized formalism; instead, it appears in several technically distinct but structurally analogous settings, including single-image reflection removal, LLM and MLLM agents, software architecture recovery, runtime reflection in semantic towers, hierarchical execution in cognitive architectures, and formal logics of layered refinement. In these settings, the common pattern is a layered organization in which higher or broader levels monitor, reinterpret, constrain, or refine lower or narrower levels, often with explicit mechanisms for cross-level propagation, decomposition, and reconstruction (Cai et al., 5 Jun 2025, Li et al., 21 Jul 2025, Ye et al., 16 Sep 2025, Rideau, 1 Jun 2026).

1. Conceptual scope and defining properties

A hierarchical reflection architecture is characterized by the coexistence of at least two structurally distinct levels at which a system represents, evaluates, or transforms its own operation. The hierarchy may be spatial, temporal, semantic, architectural, or logical, depending on domain. In single-image reflection removal, the hierarchy is spatial and representational: encoder–decoder levels, hierarchical attention windows, and dual spatial–frequency branches jointly model reflections across scales and domains (Cai et al., 5 Jun 2025). In mobile GUI agents, the hierarchy is temporal: action-level, trajectory-level, and global task-level reflectors detect and correct errors at progressively broader horizons (Li et al., 21 Jul 2025). In multi-task LLM agents, the hierarchy is mnemonic: high-level planning memory is separated from low-level execution memory, and hindsight reflection populates both levels with reusable knowledge (Ye et al., 16 Sep 2025).

A second defining property is explicit cross-level linkage. In ROS 2 architecture recovery, AtomicRosNodeClassifier, ComposedRosNodeClassifier, and RosNodePart encode a hierarchical structural decomposition in which node instances are typed by either atomic or composed classifiers, and launch-file inclusion induces nested subsystem structure (Briechle et al., 19 May 2026). In k-layered transition systems, the compatibility predicate $D^n$ constrains which tuples $(w_0,\dots,w_k)$ constitute valid cross-layer configurations, and transition relations $R_k$ act over these tuples rather than over isolated states (Madeira et al., 2016). In runtime semantic towers, safe-point subsets $O \subseteq C.o$ and partial interpretation functors $\Phi : O \to A$ define when a more concrete computation can be observed at a more abstract level (Rideau, 1 Jun 2026).

A third recurring property is reflection-driven adaptation. In agentic systems, reflection modules generate corrective feedback or reusable experience. In architectural systems, reflection enables reconstruction or evolution of structure. In formal settings, reflection principles increase deductive strength or refine behavioral specifications. This suggests a general interpretation: a hierarchical reflection architecture is not merely layered representation, but layered self-reference with operational consequences.

2. Image restoration interpretation: hierarchical modeling of reflection phenomena

In single-image reflection removal, the term acquires a particularly concrete computational meaning through the F2T2-HiT architecture for SIRR (Cai et al., 5 Jun 2025). The problem is posed as decomposing an observed image $I$ into transmission and reflection layers,

$I = T + R \quad\text{with the goal that}\quad T \approx B,$

where $B$ is the ground-truth reflection-free image. The motivating claim is that real reflections are highly heterogeneous in intensity, shape and structure, light source, spatial scale, and image coverage, making reflection removal intrinsically multi-scale and context-dependent (Cai et al., 5 Jun 2025).

The architecture realizes hierarchy at three distinct levels. First, it uses a U-shaped encoder–decoder structure akin to UNet/NAFNet, yielding a spatial multi-scale hierarchy from shallow, high-resolution detail to low-resolution global context. Second, it incorporates Hierarchical Transformer (HiT) blocks with window sizes $4\times4$ , $8\times8$ , and $(w_0,\dots,w_k)$ 0, providing hierarchical attention within each feature level. Third, it includes FFT Transformer (F2T2) blocks that combine a spatial-domain branch and a frequency-domain branch, creating a domain hierarchy between local spatial cues and global frequency structure (Cai et al., 5 Jun 2025).

The frequency branch applies a 2D FFT channel-wise,

$(w_0,\dots,w_k)$ 1

followed by learnable modulation,

$(w_0,\dots,w_k)$ 2

while the spatial branch produces $(w_0,\dots,w_k)$ 3 through multi-kernel depthwise convolutions. A typical fusion form is

$(w_0,\dots,w_k)$ 4

This arrangement gives the model image-wide receptive fields via FFT while preserving local texture and boundary structure through spatial processing (Cai et al., 5 Jun 2025).

The HiT block retains a Transformer-like outer structure but performs hierarchical window-based attention. For a windowed feature tensor $(w_0,\dots,w_k)$ 5, standard projections

$(w_0,\dots,w_k)$ 6

are used conceptually, while spatial-scale correlation and channel-scale correlation are aggregated with linear complexity relative to window size. This allows large windows without quadratic self-attention cost (Cai et al., 5 Jun 2025).

Empirically, the full model “NAFNet+HiT+F2T2” attains an average PSNR of 25.57 dB and SSIM 0.894, compared with 23.97 dB / 0.882 for the NAFNet baseline, 24.78 dB / 0.888 for “NAFNet+Restormer,” and 25.00 dB / 0.891 for “NAFNet+HiT” (Cai et al., 5 Jun 2025). On SIR² it achieves 25.72 dB / 0.903 SSIM. These results support the interpretation that a hierarchical reflection architecture in image restoration is one that models reflection artifacts across spatial resolution, attention scope, and frequency structure simultaneously.

This suggests a domain-specific definition: in restoration, a hierarchical reflection architecture is a system that assigns reflection versus transmission responsibility through layered decomposition across multiple representational domains.

3. Multi-agent and LLM systems: temporal and mnemonic hierarchies of self-correction

In LLM- and MLLM-based agents, hierarchical reflection refers primarily to reflection organized over temporal scope or memory granularity. The MobileUse GUI agent provides a canonical temporal formulation (Li et al., 21 Jul 2025). Reflection is partitioned into three levels: action-level feedback $(w_0,\dots,w_k)$ 7, trajectory-level feedback $(w_0,\dots,w_k)$ 8, and global/task-level feedback $(w_0,\dots,w_k)$ 9, with

$R_k$ 0

The Operator generates

$R_k$ 1

and the Progressor updates

$R_k$ 2

Action Reflection is invoked selectively via a confidence score over action-type tokens,

$R_k$ 3

with reflection triggered when $R_k$ 4 (Li et al., 21 Jul 2025).

The three reflective levels serve distinct functions. The Action Reflector checks whether a single action caused the intended local screen change. The Trajectory Reflector detects loops, repeated screenshots, and accumulated local errors over recent steps. The Global Reflector verifies whether the entire instruction has actually been fulfilled when the agent proposes termination (Li et al., 21 Jul 2025). This produces a temporally stratified reflective control loop in which local corrections do not require immediate global replanning, while premature task completion can still be vetoed.

The quantitative effect is substantial. On AndroidWorld, the base system without reflection or exploration achieves 49.5 average success rate, while adding Action Reflector increases it to 55.17, Trajectory Reflector to 56.1, Global Reflector to 58.6, and Reflection-on-Demand to 61.6; adding Proactive Exploration then yields 62.9 (Li et al., 21 Jul 2025). On AndroidLab, MobileUse attains 44.20% SR and 50.01% Sub-SR (Li et al., 21 Jul 2025). Reflection corrects 18 previously failed tasks, corresponding to a 30.51% correction rate, while misjudging 7.02% of successful tasks (Li et al., 21 Jul 2025).

A related but more memory-centric formulation appears in H $R_k$ 5R for multi-task LLM agents (Ye et al., 16 Sep 2025). Here the hierarchy separates high-level planning memory

$R_k$ 6

from low-level execution memory

$R_k$ 7

Hindsight reflection infers subgoals,

$R_k$ 8

extracts high-level planning insights and low-level execution insights, and grounds them back into separate memory stores (Ye et al., 16 Sep 2025). At test time, high-level retrieval is conditioned on task description $R_k$ 9, whereas low-level retrieval is conditioned on the current subgoal $O \subseteq C.o$ 0.

This decoupling yields measurable gains. On AlfWorld, H $O \subseteq C.o$ 1R improves success from 72.4% for ExpeL to 75.9%; on PDDLGame it improves from 72.2% to 80.5% (Ye et al., 16 Sep 2025). Ablation shows that removing high-level memories drops PDDLGame success to 52.8%, while removing low-level memories drops it to 61.1% (Ye et al., 16 Sep 2025). A plausible implication is that hierarchical reflection in LLM agents is most effective when reflective knowledge is stored and retrieved at the same granularity at which decisions are made.

MA-CoNav extends this logic into embodied VLN with both organizational and reflective hierarchy (Luo et al., 3 Mar 2026). It adopts a “1 Master + 4 Sub-agents” structure—Task Planning Agent, Observation Agent, Control Execution Agent, and Memory Agent—plus a “Local-Global” dual-stage reflection mechanism (Luo et al., 3 Mar 2026). Local reflection evaluates candidate actions against immediate obstacles and risk patterns from memory: $O \subseteq C.o$ 2 whereas global reflection segments full histories $O \subseteq C.o$ 3 into episodes, diagnoses failures, and stores structured experience tuples

$O \subseteq C.o$ 4

On a real-world indoor dataset, MA-CoNav reaches SR = 25.6%, compared with 8.4% for RCTAMP and 2.8% for CoELA; removing the reflection mechanism reduces SR to 17.2% (Luo et al., 3 Mar 2026). Here reflection is hierarchical both by temporal scope and by control authority.

4. Software and runtime systems: architecture as reflected structure

In software architecture research, hierarchical reflection architecture often denotes explicit multi-level structural representation plus mechanisms for reconstruction, evolution, or runtime ascent.

The ROS 2 architecture recovery pipeline in (Briechle et al., 19 May 2026) uses a UML-based modeling concept with AtomicRosNodeClassifier, ComposedRosNodeClassifier, and RosNodePart. The hierarchy distinguishes source-level node definitions from launch-level instantiation and subsystem composition. The architecture is reconstructed through a staged, agent-based pipeline comprising NodeAnalyzer, ComponentArchitectureTeam, LaunchFileAnalyzer, and SystemArchitectureTeam. Two intermediate representations are central: the List of Atomic ROS Nodes (JSON) and the Launch File Dependency Description (JSON) (Briechle et al., 19 May 2026). These constrain reconstruction across multiple abstraction levels and yield both AtomicClassifierDiagram ([ACD](https://www.emergentmind.com/topics/anycapdataset-acd)) and ComposedClassifierDiagram ([CCD](https://www.emergentmind.com/topics/cross-modal-consistency-distortion-ccd)) models.

In evaluation on the BrickByBrick case study, ACD recovery achieves Precision = 1.0, Recall = 1.0, F1 = 1.0, while CCD recovery achieves Precision = 1.0, Recall = 0.95, F1 = 0.98 (Briechle et al., 19 May 2026). The architecture is reflective in the sense that it can be re-derived from distributed implementation artifacts and linked back to them via IDs, launch-file dependencies, and namespace scopes. This suggests an architectural definition of hierarchical reflection: a system in which implementation evidence can regenerate nested architectural views with traceable links across levels.

A more explicit runtime formulation appears in “Climbing Up the Semantic Tower -- at Runtime” (Rideau, 1 Jun 2026). Software is modeled as a semantic tower of implementations. A concrete language or machine $O \subseteq C.o$ 5 implements a more abstract one $O \subseteq C.o$ 6 via a subset $O \subseteq C.o$ 7 of observable safe points and an interpretation functor

$O \subseteq C.o$ 8

Reflection is not limited to descending into lower-level runtime states; rather, safe points allow a running concrete computation to be observed at a higher abstraction level. The paper formalizes observability with an observe function whose computational content ensures that any interrupted concrete execution fragment can be extended to a safe point and mapped to an abstract transition (Rideau, 1 Jun 2026). Implementations become first-class values, enabling migration between implementations while execution continues.

This runtime tower is hierarchical because each level implements the one above and can itself be reflected upon. It is reflective because the implementation relation is exposed as a usable protocol rather than hidden beneath compilation boundaries.

The ArchWare ADL work provides a complementary structural view (Morrison et al., 2010). Active software architectures are defined as dynamic, updatable, decomposable, and reflective. Key mechanisms include compose, decompose, hyper-code, and the maps

$O \subseteq C.o$ 9

with $\Phi : O \to A$ 0 the domain of entities and $\Phi : O \to A$ 1 the domain of representations (Morrison et al., 2010). A running system can be decomposed into components, reified into hyper-code, transformed, reflected back, and recomposed. Because a composed behavior is itself a behavior, composition induces an implicit hierarchy of subsystems. Reflection therefore operates structurally over nested compositions.

The earlier thesis on reflection and hyper-programming in persistent systems further clarifies this layering (Kirby, 2010). It describes a stack comprising the persistent store and runtime, linguistic reflection mechanisms, hyper-programming, and reflective tools over hyper-program representations. The key abstract decomposition of evaluation into compile, eval', drop, and raise provides a phase hierarchy—composition time, compile time, link time, run time—while hyper-programs add a representation layer in which source contains direct links into the persistent store (Kirby, 2010). This work does not use the phrase “hierarchical reflection architecture,” but it provides a precise instance of one.

5. Formal and logical formulations

A formal semantics of hierarchical reflection appears in the logic of n-dimensional hierarchical refinement (Madeira et al., 2016). An $\Phi : O \to A$ 2-layered model is

$\Phi : O \to A$ 3

where $\Phi : O \to A$ 4 is the set of local states at level $\Phi : O \to A$ 5, $\Phi : O \to A$ 6 is the compatibility predicate, $\Phi : O \to A$ 7 are level-specific transition relations, and valuations for propositions and nominals depend on level (Madeira et al., 2016). A hierarchical model satisfies

$\Phi : O \to A$ 8

meaning projection of a $\Phi : O \to A$ 9-transition to lower layers recovers the lower-layer transition structure exactly (Madeira et al., 2016).

This framework supports formulas at multiple levels, including modal operators $I$ 0 and satisfaction operators $I$ 1, and provides standard translation into many-sorted first-order logic. Reflection here is not operational self-critique but layered self-description: higher levels encode abstractions of lower levels, and refinement adds new layers while preserving positive properties under simulation (Madeira et al., 2016). Bisimulation and simulation are defined over compatible tuples rather than isolated states, which makes the hierarchy semantically first-class.

A distinct but philosophically related use of reflection appears in proof theory (Nogina, 2014). Reflection principles over PA include local reflection

$I$ 2

and mixed explicit–implicit reflection

$I$ 3

The paper proves that every such reflection principle is equivalent either to $I$ 4 or to $I$ 5 for some $I$ 6, and that they form a strict hierarchy

$I$ 7

(Nogina, 2014). This is a logical hierarchy of self-trust or self-recognition rather than an implemented architecture, but it shows that hierarchical reflection can be formalized as non-collapsing levels of increasingly strong self-referential principles.

This suggests a broader interpretation: formal hierarchical reflection requires not merely multiple levels, but nontrivial cross-level semantics that preserve distinctions in strength or expressivity.

6. Cross-domain synthesis, design patterns, and recurrent tensions

Across these disparate domains, several recurring design patterns emerge.

One is coarse-to-fine organization. F2T2-HiT processes reflections from shallow high-resolution detail to low-resolution global context, while HiT further spans window sizes $I$ 8 (Cai et al., 5 Jun 2025). MobileUse escalates from action-level to trajectory-level to global-level reflection (Li et al., 21 Jul 2025). TimeSearch for long-video understanding, although framed around Spotlight and Reflection rather than the exact phrase “hierarchical reflection architecture,” applies reflection-guided best-first search over temporal segments, combining global sparse context, recursive event subdivision, and final dense spotlighting (Pan et al., 2 Apr 2025). MA-CoNav similarly separates step-level correction from episode-level reflective knowledge construction (Luo et al., 3 Mar 2026).

A second pattern is separation of representation and control levels. In software systems, higher architectural levels reconstruct or manipulate lower-level structure rather than directly executing their logic (Briechle et al., 19 May 2026, Morrison et al., 2010, Rideau, 1 Jun 2026). In H $I$ 9R, planning memory and execution memory are not merely two retrieval pools; they correspond to distinct decision strata (Ye et al., 16 Sep 2025). In hierarchical transition systems, higher-level states are refined into local transition systems while preserving projection consistency (Madeira et al., 2016).

A third pattern is explicit traceability across levels. ROS 2 recovery uses IDs, typing relations, and launch inclusions (Briechle et al., 19 May 2026). Semantic towers use safe points and interpretation functors (Rideau, 1 Jun 2026). Hyper-programming stores source with links into the persistent store (Kirby, 2010). Without such traceability, hierarchy becomes descriptive rather than reflective.

A fourth pattern is reflection as selective intervention rather than universal recomputation. MobileUse invokes Action Reflection only when $I = T + R \quad\text{with the goal that}\quad T \approx B,$ 0, showing that selective reflection can improve both efficiency and performance (Li et al., 21 Jul 2025). TimeSearch uses reflection confidence to prioritize temporal search and stop early when confidence exceeds $I = T + R \quad\text{with the goal that}\quad T \approx B,$ 1 (Pan et al., 2 Apr 2025). Dynamic Hierarchical Justification in Soar-like hierarchical execution uses subtask support sets to retract entire subtasks when higher-level assumptions change, rather than maintaining fine-grained justifications for each assumption (Laird et al., 2011). This suggests that hierarchy often serves not only representational richness but computational tractability.

The main recurring tension is between granularity and cost. Finer-grained reflection can improve correction or reuse, but may increase complexity, latency, or instability. MobileUse shows that reflecting at every step is not optimal; confidence-based triggering performs better (Li et al., 21 Jul 2025). H $I = T + R \quad\text{with the goal that}\quad T \approx B,$ 2R shows that flat memory is too coarse, but hierarchical decomposition requires extra subgoal inference and segmentation machinery (Ye et al., 16 Sep 2025). In DHJ, subtask-level retraction is simpler than assumption-level retraction but can cause unnecessary regeneration (Laird et al., 2011). In runtime semantic towers, observability requires semantically defined safe points, which may impose implementation constraints (Rideau, 1 Jun 2026).

Another tension concerns static versus dynamic hierarchy. F2T2-HiT uses fixed window sizes and fixed encoder–decoder scales (Cai et al., 5 Jun 2025). MA-CoNav’s local–global reflection is architecturally fixed (Luo et al., 3 Mar 2026). ROS 2 recovery currently relies on static launch-file evidence and struggles with dynamic integration semantics (Briechle et al., 19 May 2026). Several works therefore imply future directions toward content-adaptive scale selection, dynamic windowing, richer runtime evidence, or online evolution of reflective structure.

7. Limitations, misconceptions, and prospective directions

A common misconception is that hierarchical reflection architecture always means a meta-controller supervising lower-level modules. The literature shows a broader range. In image restoration, it may instead mean multi-scale and multi-domain decomposition of a phenomenon called “reflection” (Cai et al., 5 Jun 2025). In proof theory, it refers to stratified self-referential principles of provability (Nogina, 2014). In formal system design, it may denote tower navigation across abstraction levels (Rideau, 1 Jun 2026) or hierarchical refinement semantics (Madeira et al., 2016). The unifying feature is layered self-relation, not any one implementation pattern.

Another misconception is that hierarchy automatically implies improved robustness. The evidence is more specific. Gains appear when hierarchy aligns with task structure: reflection size diversity in SIRR (Cai et al., 5 Jun 2025), temporal scale in GUI control (Li et al., 21 Jul 2025), subgoal decomposition in LLM agents (Ye et al., 16 Sep 2025), or launch-induced structure in ROS 2 systems (Briechle et al., 19 May 2026). Misaligned hierarchies may simply add complexity.

Several limitations recur across the surveyed work. Static reflective structures often struggle with dynamic conditions, as noted for ROS 2 launch semantics (Briechle et al., 19 May 2026) and fixed windowing or scale choices in SIRR (Cai et al., 5 Jun 2025). Reflection quality often depends on the quality of the underlying model, as seen in LLM-based self-reflection systems where critiques may hallucinate or misjudge states (Li et al., 21 Jul 2025, Ye et al., 16 Sep 2025). Post-hoc reflective knowledge may improve reuse but not guarantee online adaptation, since many systems update memory offline or episodically rather than continuously (Ye et al., 16 Sep 2025, Luo et al., 3 Mar 2026).

Prominent future directions are already visible in the cited work. In image restoration, plausible extensions include dynamic scale selection, content-adaptive windowing, more structured frequency priors, explicit prediction of both $I = T + R \quad\text{with the goal that}\quad T \approx B,$ 3 and $I = T + R \quad\text{with the goal that}\quad T \approx B,$ 4, and temporal extensions for video reflection removal (Cai et al., 5 Jun 2025). In agent systems, likely continuations include richer multi-scale reflection triggers, tighter integration of memory and planning, and extension to more complex embodied or web environments (Li et al., 21 Jul 2025, Ye et al., 16 Sep 2025, Luo et al., 3 Mar 2026). In software systems, major open directions include dynamic semantic integration, formalized metamodels, and first-class runtime reflection protocols deployed in real systems rather than primarily as formal blueprints (Briechle et al., 19 May 2026, Rideau, 1 Jun 2026). In formal logic and semantics, a plausible implication is that richer hierarchies of reflection could be studied by combining layered transition semantics, explicit proof formalisms, and operational reflection mechanisms (Madeira et al., 2016, Nogina, 2014).

Taken together, the literature supports a domain-general characterization: a hierarchical reflection architecture is a structured arrangement in which reflection is distributed across explicitly related levels, with each level operating on representations appropriate to its scope and feeding corrections, abstractions, or reconstructions across levels. What varies from field to field is the substrate—pixels, trajectories, memories, launch files, transition systems, proofs, or runtime states—but the architectural motif remains strikingly consistent.