Local View Transformer (LVT)

Updated 30 September 2025

Local View Transformer is an architectural paradigm that enhances 3D scene reconstruction by applying spatially-local attention and relative pose conditioning.
It restricts self-attention to local neighborhoods, enabling linear scaling and efficient processing of high-resolution, multi-view data.
LVT supports practical applications in VR/AR, robotics, and neural rendering by delivering robust, high-fidelity novel view synthesis and scene representation.

The Local View Transformer (LVT) is an architectural paradigm in transformer models that targets scalable, efficient, and high-fidelity scene reconstruction and novel view synthesis by incorporating spatial locality into the attention operation. Unlike standard transformers, which compute dense global self-attention and thus scale quadratically with the number of input tokens (such as image patches or camera views), LVT restricts attention to spatially local neighborhoods, introduces relative geometric conditioning between views, and decodes into a sophisticated 3D scene representation. This approach is motivated by the insight that, for large-scale 3D reconstruction tasks, spatially nearby views provide more significant signal about local scene composition than distant views, enabling linear scaling in the number of input views and rendering LVT suitable for arbitrarily large, high-resolution scenes in a single forward pass (Imtiaz et al., 29 Sep 2025).

1. Architectural Principles and Local Attention

At its core, LVT modifies the standard transformer stack by replacing global, all-to-all self-attention with locally restricted attention. Input images are first patchified, typically by applying convolutional operations to stack per-view RGB information with per-patch local ray direction maps ( $I_j$ and $R_j$ for the $j$ -th view). These are embedded to produce input tokens $M_j$ for each view:

$M_j = [I_j, R_j]$

Attention within each transformer block is not performed across all tokens from all views; instead, each query token in view $j$ only attends to tokens in a local neighborhood of spatially proximate or temporally adjacent camera views. This local attention window can be defined by distance in physical or pose space, ensuring that only neighboring views—those most likely to contribute relevant information about local scene content—are considered.

The local attention mechanism is augmented by conditioning on relative geometric transformation between the query view and its neighbors, as opposed to using absolute positional information or global 3D coordinate encodings. This conditional attention retains permutation invariance with respect to input orderings and supports the aggregation of local evidence in a scalable way.

2. Relative Pose-Based Positional Encoding

A distinctive feature of LVT is its use of geometric relational encoding between views, moving beyond global or purely sequential tokens. For every attention operation between a query token $M_j$ and a source token $M_{j'}$ , an explicit conditioning feature $C_{j, j'}$ is generated:

$P_\text{rel} = P_j P_{j'}^{-1}$

$C_{j, j'} = \text{MLP}(\text{PE}(P_\text{rel}))$

where $P_j$ and $P_{j'}$ are the poses of the two views, $\text{PE}$ denotes a positional encoding (such as a rotary embedding) applied to the relative transformation parameterized by quaternion rotation and translation, and $\text{MLP}$ is a multilayer perceptron. The key and value tokens used in the attention mechanism are then modified as:

$K_{j'} = M_{j'} + C_{j, j'}$

and the value passed for aggregation in the attention operation is similarly augmented. This approach encodes the spatial relationship between the views directly into the attention weights, allowing the transformer to learn how relative pose affects information transfer and fusion, while avoiding explicit global coordinate dependencies.

3. Scene Representation via 3D Gaussian Splatting

LVT decodes learned tokens into a continuous 3D scene by producing a Gaussian Splat representation. Each pixel (and its corresponding feature) is mapped to parameters specifying the scale, orientation (as a quaternion), and depth of a 3D Gaussian:

3-channel scale per Gaussian
4-channel rotation (unit quaternion)
Scalar depth along the corresponding local ray

To enable view-dependent effects (e.g., specularity, thin structures), each Gaussian is further augmented with color and opacity modeled via spherical harmonics (SH):

Color SH: $d_c = 3 \times (\text{deg}_{\text{SH,color}} + 1)^2$ channels
Opacity SH: $d_o = (\text{deg}_{\text{SH}} + 1)^2$ channels

The full set of splats is rendered using volumetric techniques, with a regularization term on opacity included to suppress spurious artifacts:

$R_\sigma = \frac{1}{HWN_i}\sum |\text{opacity}|$

The loss function for training combines mean squared error (MSE), perceptual (LPIPS) loss, and $R_\sigma$ .

4. Computational Efficiency and Scaling

LVT removes the quadratic bottleneck inherent in global self-attention. By attending only to a fixed-size or adaptively chosen set of neighbors (typically defined by a spatial or pose-based radius), the attention operation scales as $O(N)$ with the number of input views $N$ . This allows LVT to process very long video sequences or very large sets of images for scene reconstruction without prohibitive compute or memory costs.

Locality also enables the stacking of multiple transformer blocks, allowing effective receptive fields to grow across network depth—paralleling the multi-layer aggregation strategies seen in CNNs, but with the additional representational power of attention mechanisms and pose-conditioning.

5. Performance and Generalization

Empirical evaluation shows that LVT achieves state-of-the-art performance on large-scale multi-view datasets:

On DL3DV, LVT improves PSNR over Long-LRM and the per-scene optimized variant of 3DGS by +3.5 dB and +2.1 dB, respectively.
In zero-shot generalization, LVT performs robustly on novel datasets (Tanks and Temples, Mip-NeRF360) with varying sequence lengths and camera trajectories, demonstrating invariance to global scene scale and layout.

Scalability in input size and scene scale is confirmed by linear memory and compute cost growth with the number of input views, in contrast to quadratic scaling in global attention transformers.

6. Applications Across 3D Computer Vision

LVT's architecture is suitable for:

Real-time novel view synthesis for VR/AR and interactive visualization, enabling dynamic rendering from arbitrary camera poses.
Large-scale 3D mapping and scene capture for robotics, autonomous navigation, and environment digitization, removing practical limits on scene extent.
High-resolution neural rendering for cultural heritage, film, gaming, and virtual tours, where scene detail and photorealism are paramount.

Since the model does not require per-scene optimization or scene-specific training, it is practical for generalized, out-of-the-box 3D scene processing pipelines.

The Local View paradigm in LVT extends trends in vision transformers that prioritize locality (e.g., windowed attention, local-to-global designs) (Li et al., 2021, Zhang et al., 2022, Pan et al., 2023), but specializes the notion of "locality" to the view-space—a natural choice when input signals are images from spatially or temporally adjacent cameras. The conditioning on geometric relationships between views (using pose encodings) is a further departure from the purely content-based or absolute-position encoding strategies seen elsewhere, and enables strong generalization across scenes of arbitrary shape and layout.

A plausible implication is that approaches based on LVT could influence future multi-view and neural rendering models in domains such as autonomous robotics, digital twins, and simulation for synthetic data generation, due to their efficiency and scalability properties.