Fine-to-Coarse Intra-View Hierarchy

Updated 10 December 2025

Fine-to-coarse intra-view hierarchy is a neural network principle that progressively aggregates detailed spatial information into compact, abstract representations.
It employs patch embedding and convolution-based token reduction to balance high-resolution local details with broader global context for 3D scene reconstruction.
Empirical results on benchmarks show enhanced scalability and performance, with MVP achieving state-of-the-art metrics across a range of view counts.

A fine-to-coarse intra-view hierarchy is a neural network architectural principle in which detailed spatial information within a single input view is progressively aggregated into more compact and abstract representations through successive encoding stages. This approach underpins the Multi-view Pyramid Transformer (MVP) architecture, facilitating efficient and scalable processing for reconstructing large 3D scenes from many images in a single forward pass. The intra-view hierarchy operates orthogonally to the inter-view (local-to-global) hierarchy, jointly delivering both computational efficiency and representational richness by combining detailed local features with broad, global context (Kang et al., 8 Dec 2025).

1. Formal Definition and Data Flow

Given a set of $N$ posed input images $I_i\in\mathbb{R}^{H\times W\times3}$ , each augmented with their camera ray information $P_i\in\mathbb{R}^{H\times W\times9}$ (via Plücker encoding), the input to the intra-view hierarchy is the combined image $\widetilde I_i = [I_i; P_i] \in \mathbb{R}^{H\times W\times12}$ . A patch-embedding layer partitions each $\widetilde I_i$ into non-overlapping patches of size $p\times p$ , linearly projecting them into tokens of dimension $d_0$ . The result is an initial token map $T_i^{(0)} \in \mathbb{R}^{h_0 \times w_0 \times d_0}$ , with $h_0 = H/p$ , $w_0 = W/p$ .

The tokens undergo three "stages." Each stage includes:

An inter-view attention block (orthogonal to intra-view);
An intra-view token-reduction convolution that halves spatial resolution per view ( $h_\ell,w_\ell \rightarrow h_{\ell+1},w_{\ell+1}$ ) and doubles the per-token channel count ( $d_\ell \rightarrow d_{\ell+1}$ ).

Mathematically, the token reduction at level $\ell$ is:

$T^{(\ell+1)} = \operatorname{Conv2D}\left(\mathrm{reshape}(T^{(\ell)'})\right) \in \mathbb{R}^{\frac{h_\ell}{2} \times \frac{w_\ell}{2} \times 2d_\ell}$

This progression systematically decreases sequence length while increasing representation capacity per token.

2. Architectural Motivation and Principles

The fine-to-coarse intra-view hierarchy addresses the challenge of capturing both fine geometric structure and global contextual information within each input image. Early network stages maintain high spatial resolution, performing attentional operations over granularity sufficient to resolve geometric detail via frame-wise attention. Subsequent stages incrementally reduce spatial token count through pooling and convolution, raising per-token channel dimensionality to allow later-stage tokens to summarize broader image regions with higher-level features.

This allows downstream modules, such as group-wise and global inter-view attention, to operate over a substantially reduced number of tokens per view, yielding significant gains in memory and computational efficiency. The design is critical to preventing "attention dilution," as compact token representations maintain the information content required for high-quality 3D reconstruction even as the transformer scales to hundreds of views.

3. Mathematical Characterization and Scaling Behavior

The token count and channel dimension at each stage for an input image of size $H \times W$ and initial patch size $p$ are:

Stage 1: $h_0 = H/p$ , $w_0 = W/p$ , $d_0=256$
Stage 2: $h_1 = h_0/2$ , $w_1 = w_0/2$ , $d_1=512$
Stage 3: $h_2 = h_1/2$ , $w_2 = w_1/2$ , $d_2=1024$

Patch size cascade is set as $(8,16,32)$ for typical use, with variants explored for trade-off analysis.

Computational complexity per stage scales as follows:

Stage 1: $O\left(N(h_0 w_0)^2\right)$
Stage 2: $O\left(N M (h_1 w_1)^2\right)$ (for group-wise attention, $M\ll N$ )
Stage 3: $O\left((N h_2 w_2)^2\right)$

With $h_1 w_1 \approx \frac{1}{4} h_0 w_0$ and $h_2 w_2 \approx \frac{1}{16} h_0 w_0$ , later stages benefit from dramatically reduced sequence lengths. This structure enables MVP to achieve nearly linear-to-subquadratic scaling in view count $N$ , with stage 3 attention only applied to a small number of coarsened tokens.

4. Functional Role in 3D Scene Reconstruction

The intra-view hierarchy enables MVP to deliver accurate single-pass 3D reconstructions from tens to hundreds of images by:

Preserving spatial granularity required for detailed frame-wise processing in early layers.
Facilitating pyramidal fusion of features across multiple scales through the Pyramidal Feature Aggregation (PFA) module, which upsamples and fuses features from all three intra-view stages.
Enabling final tokens, enriched by both local geometric and global context, to be decoded directly into 3D Gaussian primitives for differentiable rendering with 3D Gaussian Splatting.

Ablation studies confirm that removal of the intra-view hierarchy (e.g., using only a fixed patch size throughout) severely impairs scalability, leading to out-of-memory errors beyond 64 views or an increase in runtime by 50–80× for high $N$ . Reversing the hierarchy direction (coarse-to-fine) collapses photometric accuracy, underscoring the necessity of fine-to-coarse progression (Kang et al., 8 Dec 2025).

5. Empirical Outcomes and Comparative Evaluation

On the DL3DV benchmark, MVP demonstrates state-of-the-art performance, with sequential improvements as view count increases:

16 views: PSNR 23.76, SSIM 0.798, LPIPS 0.239, 0.09s
32 views: PSNR 25.96, SSIM 0.847, LPIPS 0.187, 0.17s
64 views: PSNR 27.73, SSIM 0.881, LPIPS 0.154, 0.36s
128 views: PSNR 29.02, SSIM 0.903, LPIPS 0.134, 0.77s
256 views: PSNR 29.67, SSIM 0.915, LPIPS 0.128, 1.84s

These results surpass iLRM and Long-LRM (>1 dB on PSNR), with runtime and memory consumption remaining sublinear in $N$ . Competing baselines either become infeasible (run out of memory) or suffer magnified latency, demonstrating the critical impact of the intra-view hierarchy on practical scalability (Kang et al., 8 Dec 2025).

6. Relevance of Hierarchy in Broader Transformer Architectures

The fine-to-coarse intra-view hierarchy's success in MVP demonstrates the effectiveness of staged spatial reduction coupled with increasing embedding dimensionality for large-scale structured data processing. By aggregating spatial information at successively greater abstraction while preserving local details in early layers, such hierarchies may generalize to other transformer-based architectures where maintaining both efficiency and detail is paramount. The empirical finding that the local-to-global (inter-view) and fine-to-coarse (intra-view) hierarchies are orthogonal but complementary suggests a general design principle for efficient multi-scale, multi-view vision systems (Kang et al., 8 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

Multi-view Pyramid Transformer: Look Coarser to See Broader (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Fine-to-Coarse Intra-View Hierarchy.