Local-to-Global Inter-View Hierarchy

Updated 10 December 2025

Local-to-global inter-view hierarchy is a hierarchical attention mechanism that organizes token interactions from individual view-level computations to global aggregation.
It divides attention computation into frame-wise, group-wise, and global stages, ensuring efficient scaling and preservation of local and global context.
Integrating intra-view pyramidal architectures, the approach achieves near-linear computational complexity with superior empirical performance on 3D reconstruction benchmarks.

A local-to-global inter-view hierarchy is an architectural mechanism within multi-view transformer models that structures attention and information flow across multiple input views, such as images, for tasks like large-scale 3D scene reconstruction. This hierarchy arranges inter-view reasoning in a staged manner: computations are first performed within individual views (local), then among small groups of neighboring views (group-wise), and finally across all views globally. Such a hierarchy allows scalable, efficient, and consistent modeling of large multi-view datasets, notably in conjunction with dual intra-view strategies (e.g., fine-to-coarse hierarchies) (Kang et al., 8 Dec 2025).

1. Motivation and Definition

The local-to-global inter-view hierarchy arises from the challenge of reconstructing large 3D scenes from many posed images using transformer-based models. A naïve application of global self-attention to all image patches across all views incurs prohibitive quadratic complexity in both computation and memory relative to sequence length, which scales rapidly with the number and resolution of input images. The local-to-global inter-view hierarchy addresses this by initially restricting attention to tokens within a single view, then attending within small, localized groups of views, and ultimately across the full set of views, thereby enabling near-linear scaling with the number of views and preserving both local geometric fidelity and global consistency (Kang et al., 8 Dec 2025).

2. Hierarchical Workflow: Stages of Inter-View Reasoning

The local-to-global inter-view hierarchy in the Multi-view Pyramid Transformer (MVP) is realized in three key attention stages:

Frame-wise (Local) Stage: Each view's tokens attend exclusively to other tokens within the same image. This confines initial reasoning to the local spatial context of each frame.
Group-wise (Intermediate) Stage: Views are partitioned into groups of size $M$ , typically determined by frame-index locality, forming $G = N/M$ groups (where $N$ is the number of views). In each group, tokens first undergo frame-wise self-attention within the group, then group-wise attention across concatenated tokens from all views in the group.
Global Stage: The group size $M$ is set to $N$ such that all tokens from all views participate in global self-attention, enabling full-scene aggregation.

This staged logic is captured in the transformation of tokens $T \in \mathbb{R}^{N \cdot h_s \cdot w_s \times d_s}$ through localized, grouped, and then global self-attention operations, where $h_s \cdot w_s$ is the number of spatial tokens per view at hierarchy stage $s$ , and $d_s$ is the token embedding dimensionality (Kang et al., 8 Dec 2025).

Table: Stages of Local-to-Global Inter-View Hierarchy

Stage	Attention Scope	Operation
Frame-wise	Within-view (local)	SelfAttn( $T_i$ ) for each $T_i$
Group-wise	Within group of $M$	SelfAttn in group and frame
Global	All views (global)	SelfAttn over all view tokens

3. Mathematical Formulation

For each block in the hierarchy, multi-head self-attention is applied as:

$\text{head}_i = \text{Attention}(XW_i^Q, XW_i^K, XW_i^V)$

$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{Q K^\top}{\sqrt{d_k}}\right) V$

$\text{MultiHead}(X) = \text{Concat}(\text{head}_1, \ldots, \text{head}_H) W^O$

For group-wise partitioning:

$T$ is grouped: $G \leftarrow \text{group}(T)$
Frame-wise and group-wise attention are applied within groups, and global attention aggregates information across all tokens when $M = N$ (Kang et al., 8 Dec 2025).

4. Integration with Intra-View and Pyramidal Architectures

The local-to-global inter-view hierarchy is interleaved with a fine-to-coarse intra-view hierarchy, which aggregates per-view tokens at progressively coarser spatial resolutions. Between inter-view attention stages, spatial downsampling (e.g., $3 \times 3$ stride-2 convolution) halves the spatial resolution and doubles feature dimensionality: $h_{s+1} = h_s/2$ , $w_{s+1} = w_s/2$ , $d_{s+1} = 2d_s$ . The hierarchical aggregation of features culminates in pyramidal feature aggregation, fusing multi-scale features top-down for decoding into 3D Gaussian parameters for scene reconstruction (Kang et al., 8 Dec 2025).

5. Computational Efficiency and Empirical Performance

The local-to-global hierarchy dramatically reduces both computational and memory requirements in large-scale multi-view transformers. A naïve approach scales as $O((N h_s w_s)^2)$ , whereas the staged approach in MVP divides complexity as:

Stage 1 (Frame-wise): $N \cdot O((h_1 w_1)^2)$
Stage 2 (Group-wise): $(N/M) \cdot O((M h_2 w_2)^2) + N \cdot O((h_2 w_2)^2)$
Stage 3 (Global): $O((N h_3 w_3)^2)$

Since $h_2 w_2 \approx (h_1 w_1)/4$ , $h_3 w_3 \approx (h_1 w_1)/16$ , and $M$ is kept small (e.g., 4), total cost scales nearly linearly with $N$ . For $N=256$ at $960 \times 540$ , MVP achieves $1.8$s per frame compared to $21$s for iLRM, with global-attention baselines encountering memory exhaustion (Kang et al., 8 Dec 2025).

6. Quantitative Results and Ablation Insights

Quantitative benchmarks demonstrate the significance of the local-to-global inter-view hierarchy:

On DL3DV (32 views): MVP attains PSNR 25.96 dB, SSIM 0.847, LPIPS 0.187, and 0.17 s per frame, outperforming iLRM and 3D-GS in both accuracy and speed.
Ablation shows that removing group-wise (local-to-global) or pyramidal components reduces reconstruction quality (e.g., PSNR drops 22.79→22.53 or 21.58), and dropping the hierarchy altogether increases latency by 6× at $N=256$ or causes out-of-memory failures (Kang et al., 8 Dec 2025).

7. Broader Significance and Implications

The local-to-global inter-view hierarchy is central to achieving efficient, scalable, and high-fidelity 3D reconstruction from hundreds of images in a single forward pass. This approach obviates the exponential resource demands associated with global-attention-only methods and enables robust generalization to unseen numbers of input views. A plausible implication is that such hierarchies can inform broader classes of multi-view and multi-modal transformer architectures where cross-instance reasoning must balance locality, scalability, and global context (Kang et al., 8 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

Multi-view Pyramid Transformer: Look Coarser to See Broader (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Local-to-Global Inter-View Hierarchy.