Multi-view Pyramid Transformer (MVP)

Updated 10 December 2025

The paper introduces a scalable dual-hierarchy transformer architecture that efficiently aggregates local and global features for direct 3D scene reconstruction.
MVP leverages alternating inter-view grouping and intra-view pooling with pyramidal feature aggregation to process high-resolution images with reduced computational cost.
Empirical evaluations demonstrate state-of-the-art performance and speedup, achieving robust reconstruction accuracy even with hundreds of input views.

The Multi-view Pyramid Transformer (MVP) is a scalable multi-view transformer architecture designed for direct 3D scene reconstruction from tens to hundreds of posed images in a single feed-forward pass. MVP introduces a dual hierarchical mechanism—local-to-global inter-view grouping and fine-to-coarse intra-view spatial pooling—to reconcile computational efficiency with high-resolution, global 3D reasoning. By integrating these hierarchies, MVP achieves state-of-the-art results in generalizable scene reconstruction, particularly when combined with modern differentiable 3D representations such as Gaussian splatting, and offers efficient scalability to large numbers of views and high-resolution tokens (Kang et al., 8 Dec 2025).

1. Dual Hierarchies: Local-to-Global Grouping and Fine-to-Coarse Pooling

MVP concurrently leverages two pyramidal hierarchies:

Inter-view hierarchy (local→global): Attention is structured to operate first within individual frames, then within small consecutive groups of views, and finally across all views. This phased expansion, from local to global, manages computational complexity while propagating global scene context.
Intra-view hierarchy (fine→coarse): Within each image, tokens representing spatial patches are progressively merged. Each stage halves the spatial dimensions and doubles the channel capacity, constructing a feature pyramid where early layers capture fine local detail and later layers encode information-dense, wide-receptive-field representations.

This alternation between inter- and intra-view hierarchies ensures that attention operations remain computationally feasible (avoiding the quadratic explosion of classic global attention in $O((NHW/p^2)^2)$ ) while effectively aggregating multi-scale context and detailed geometry (Kang et al., 8 Dec 2025).

2. Architectural Design and Mathematical Framework

The MVP architecture can ingest up to 256 images (e.g., $960\times540$ pixels), each tokenized into spatial patches and register tokens, and processes them in three main transformer stages. The combined token matrix for stage $\ell$ is represented as $T^{(\ell)}\in\mathbb{R}^{N\,h_\ell\,w_\ell\times d_\ell}$ , where $h_\ell$ , $w_\ell$ , and $d_\ell$ denote spatial resolution and feature dimensionality.

Stages proceed as follows:

Stage 1 (Frame-wise): Self-attention is applied independently within each view at patch-level granularity.
Stage 2 (Group-wise): Views are partitioned into consecutive groups of size $M$ (default 4). Grouped frames are processed through alternating per-frame and group-wide self-attention, reducing context fragmentation while maintaining tractability.
Stage 3 (Global): All views form a single group ( $M=N$ ), where the attention module alternates between per-frame and full global self-attention ("Alternating Attention" block).

Between stages, a convolutional reduction operation contracts spatial dimensions ( $h_{\ell+1}=\tfrac{1}{2}h_\ell,\,w_{\ell+1}=\tfrac{1}{2}w_\ell$ ) and doubles the channel dimension ( $d_{\ell+1}=2d_\ell$ ), forming the intra-view feature pyramid.

3. Computational Complexity and Efficiency

At stage $\ell$ , the self-attention computation per block is

$O\bigl(Nh_\ell^2w_\ell^2d_\ell(1 + M_\ell)\bigr),$

as the majority of attention is localized within groups, unlike global attention which would incur $O(N^2h_\ell^2w_\ell^2d_\ell)$ cost. This approach enables scaling to hundreds of views and high-resolution tokens without incurring out-of-memory (OOM) errors. Notably, per-stage compute is dominated by the middle transformer block due to the pyramidal contraction of spatial token count and increase in channel capacity.

Empirically, inference latency on an NVIDIA H100 using FlashAttention3 is approximately 0.09 seconds for 16 views, 0.36 seconds for 64 views, scaling to 1.84 seconds for 256 views (Kang et al., 8 Dec 2025).

4. Integration with Differentiable 3D Gaussian Splatting

Post-transformer, MVP aggregates multi-scale output tokens using Pyramidal Feature Aggregation (PFA): $F = \mathrm{fuse}\bigl(\mathrm{up}(\mathrm{fuse}(\mathrm{up}(F^{(3)}) + F^{(2)})) + F^{(1)}\bigr),$ where $F^{(\ell)}$ denotes the feature map of stage $\ell$ . The resulting per-patch features are decoded to 3D Gaussian parameters—centers $\mu_j$ , scales $s_j$ , rotations $q_j$ , opacity $\alpha_j$ , and spherical harmonics color $c_j$ . These are rendered in a differentiable pipeline (e.g., gsplat) enabling end-to-end optimization for 3D scene appearance.

The core optimization objective combines photometric loss across held-out views and a view-dependent opacity regularizer: $\mathcal L_{\mathrm{img}} = \sum_{i\in\mathcal T} \|I_i - \hat I_i\|_2^2 + \lambda\, \mathrm{LPIPS}(I_i, \hat I_i), \ \lambda=0.2,$

$\mathcal R_{\alpha} = \frac{1}{N_{\mathcal G}}\sum_{j=1}^{N_{\mathcal G}}\left|\sigma(\alpha_j \cdot \omega_j)\right|, \ \gamma=0.001,$

with $I_i$ and $\hat{I}_i$ the ground-truth and rendered images for target view $i$ , and $\omega_j$ is a sampled view direction per Gaussian.

5. Training Regimen and Implementation Details

MVP is trained using a three-stage progressive curriculum on the DL3DV dataset:

Stage 1: $480\times256$ resolution, 32 input, 12 target views, 100K iterations, learning rate $2\!\times\!10^{-4}$ .
Stage 2: $960\times540$ resolution, 32 input, 6 target views, 50K iterations, learning rate $2\!\times\!10^{-5}$ .
Stage 3: $960\times540$ resolution, 16–256 input views, 30K iterations, learning rate $2\!\times\!10^{-5}$ . At this stage, only the global attention block is fine-tuned.

Technical enhancements include AdamW optimizer ( $\beta_1=0.9,\ \beta_2=0.95,\,\mathrm{wd}=0.05$ ), cosine learning rate decay, mixed-precision bfloat16 computation, FlashAttention2/3 kernel support, gradient checkpointing, and EMA. Camera pose information is encoded using 9D Plücker ray maps augmented with PRoPE. Each transformer block comprises LayerNorm, QK-Norm (RMSNorm), MultiHeadAttn (head dim 64), and GELU-activated MLP (Kang et al., 8 Dec 2025).

6. Empirical Evaluation and Ablation Studies

MVP achieves state-of-the-art or near state-of-the-art reconstruction accuracy (measured by PSNR and LPIPS) with significant speedup over prior art.

Baseline/Variant	DL3DV (PSNR improvement)	DL3DV (Speedup)	RE10K (PSNR gain)
Long-LRM, iLRM	+1–3 dB	2–250×	>2 dB
3D-GS optimization	~matched	>250×	–
CLiFT, iLRM (RE10K)	–	–	>2 dB

Ablation studies indicate that:

Removing PFA reduces PSNR by 1.2 dB (LPIPS increases by 0.10).
Excluding group-attention or substituting global-only attention yields a 0.4–1.2 dB drop and 6× slowdown, with OOM beyond 128 views.
Disabling the intra-view hierarchy results in a 50–80× slowdown, with OOM at 256 views.
Reversing the order of hierarchies results in catastrophic degradation (−4 dB).

Moreover, MVP's reconstruction quality is resilient to increasing numbers of input views, with PSNR continuing to improve as views are added, in contrast to the saturation observed in baseline architectures (Kang et al., 8 Dec 2025).

7. Strengths, Limitations, and Prospects

MVP’s principal strengths are its scalability to very large and high-resolution multiview configurations within a single forward pass, its balanced combination of high-fidelity local and global information, and its orders-of-magnitude acceleration compared to optimization-based baselines.

However, MVP currently presumes accurate known camera poses and is specialized for static scenes. It requires photometric supervision, so 3D geometry may benefit from auxiliary depth or MVS losses.

Potential directions for advancing the MVP architecture include:

Self-supervised pose and geometry estimation to enable pose-free operation.
Incorporation of temporal reasoning for dynamic or 4D scene modeling, potentially by extending the hierarchical grouping paradigm.
Supervision with explicit geometry (depth, normals) for improved structure recovery.
Application of the dual-hierarchy approach to broader multi-view tasks (semantic segmentation, detection, relighting).
Exploration of adaptive group sizes $M$ for trade-offs between computational cost and reconstruction quality (Kang et al., 8 Dec 2025).

MVP’s principled, dual-pyramid mechanism establishes a template for efficient, high-fidelity, transformer-based multi-view 3D reasoning applicable to both current and future directions in scene understanding.

PDF Markdown Chat (Pro)

References (1)

Multi-view Pyramid Transformer: Look Coarser to See Broader (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Multi-view Pyramid Transformer (MVP).