Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 59 tok/s

Gemini 2.5 Pro 52 tok/s Pro

GPT-5 Medium 40 tok/s Pro

GPT-5 High 27 tok/s Pro

GPT-4o 104 tok/s Pro

Kimi K2 195 tok/s Pro

GPT OSS 120B 467 tok/s Pro

Claude Sonnet 4 37 tok/s Pro

2000 character limit reached

iLRM: An Iterative Large 3D Reconstruction Model (2507.23277v1)

Published 31 Jul 2025 in cs.CV

Abstract: Feed-forward 3D modeling has emerged as a promising approach for rapid and high-quality 3D reconstruction. In particular, directly generating explicit 3D representations, such as 3D Gaussian splatting, has attracted significant attention due to its fast and high-quality rendering, as well as numerous applications. However, many state-of-the-art methods, primarily based on transformer architectures, suffer from severe scalability issues because they rely on full attention across image tokens from multiple input views, resulting in prohibitive computational costs as the number of views or image resolution increases. Toward a scalable and efficient feed-forward 3D reconstruction, we introduce an iterative Large 3D Reconstruction Model (iLRM) that generates 3D Gaussian representations through an iterative refinement mechanism, guided by three core principles: (1) decoupling the scene representation from input-view images to enable compact 3D representations; (2) decomposing fully-attentional multi-view interactions into a two-stage attention scheme to reduce computational costs; and (3) injecting high-resolution information at every layer to achieve high-fidelity reconstruction. Experimental results on widely used datasets, such as RE10K and DL3DV, demonstrate that iLRM outperforms existing methods in both reconstruction quality and speed. Notably, iLRM exhibits superior scalability, delivering significantly higher reconstruction quality under comparable computational cost by efficiently leveraging a larger number of input views.

Collections

Summary

The paper introduces iLRM, an iterative refinement model that decouples scene representation from input images to reduce computational costs in 3D reconstruction.
It employs a two-stage attention mechanism with token uplifting and mini-batch strategies to integrate high-resolution cues effectively.
Empirical results demonstrate improvements of 3–4 dB in PSNR and reduced computation time compared to state-of-the-art methods.

Iterative Large 3D Reconstruction Model (iLRM): Architecture, Scalability, and Empirical Analysis

Introduction and Motivation

The iLRM framework addresses critical limitations in feed-forward 3D Gaussian Splatting (3D-GS) models, particularly their scalability and representational efficiency. Existing transformer-based approaches for generalizable 3D reconstruction suffer from quadratic computational complexity with respect to the number of input views and image resolution, primarily due to full attention across all image tokens. Moreover, pixel-aligned Gaussian generation leads to excessive redundancy, as the number of Gaussians is tightly coupled to image resolution, often far exceeding the minimal set required for accurate scene representation.

iLRM introduces an iterative refinement mechanism, decoupling the scene representation from input images and decomposing multi-view interactions into a two-stage attention scheme. This design enables efficient utilization of high-resolution image cues while maintaining a compact set of 3D Gaussians, facilitating both high-fidelity reconstruction and superior scalability.

Figure 1: The overall architecture and qualitative results of the proposed iLRM.

Architectural Design

Decoupling and Tokenization

iLRM initializes scene representation using viewpoint-centric embeddings derived from Plücker ray coordinates, which encode spatial and directional information for each camera pose. These embeddings are partitioned into non-overlapping patches and linearly projected to form viewpoint tokens. Multi-view images are similarly tokenized by concatenating RGB patches with corresponding Plücker ray patches, followed by linear projection.

Scalable Multi-View Context Modeling

The core architectural innovation lies in the two-stage attention mechanism:

Per-View Cross-Attention: Each viewpoint token interacts only with its corresponding image tokens, drastically reducing the computational cost compared to full multi-view attention.
Global Self-Attention: Viewpoint tokens are refined via self-attention, enabling global information exchange across views. Operating in a low-resolution embedding space ensures tractability even with many input views.
Figure 2: The proposed scalable architectural designs by decoupling viewpoint and image tokens, and modeling the global interactions via cross- and self-attentions ( $N$ : number of views, $h = H/p$ , $w = W/p$ ).

Token Uplifting and Mini-Batch Cross-Attention

To bridge the resolution gap between low-resolution viewpoint tokens and high-resolution image tokens, iLRM employs a token uplifting strategy: each viewpoint token is expanded into multiple finer-grained queries before cross-attention, then compressed back post-attention. This mechanism preserves spatial correspondence and enables effective integration of high-resolution cues.

Mini-batch cross-attention schemes further enhance scalability by sampling subsets of tokens during cross-attention, analogous to mini-batch gradient descent. Structured sampling strategies are adopted for efficient implementation without significant performance degradation.

Figure 3: Various mini-batch cross-attention schemes.

The network stacks multiple update blocks, each comprising a cross-attention layer followed by a self-attention layer. At each layer, viewpoint tokens are iteratively refined, mimicking the feedback-driven optimization process of per-scene methods. The final viewpoint tokens are decoded into Gaussian parameters via a linear layer.

Training Objectives and Implementation

iLRM is trained using a combination of MSE and perceptual loss (VGG-based), supervising rendered images against ground-truth targets. The architecture leverages FlashAttention-2 for efficient attention computation, gradient checkpointing for memory savings, and mixed-precision training (BFloat16). Camera pose normalization ensures consistent scene alignment.

Empirical Results

Quantitative Performance

On the RE10K dataset, iLRM achieves a PSNR improvement of approximately 3 dB over state-of-the-art baselines (GS-LRM, DepthSplat) when leveraging more views (8 vs. 2) at less than half the computation time (0.028s vs. 0.065s). On DL3DV, iLRM improves PSNR by ~4 dB under comparable computational budgets, efficiently utilizing up to four times more views.

Qualitative Analysis

iLRM consistently produces sharper novel view synthesis with fewer artifacts compared to baselines, as demonstrated on both RE10K and DL3DV datasets.

Figure 4: Qualitative comparison of novel view synthesis on the RE10K dataset.

Figure 5: Qualitative comparison of novel view synthesis on the DL3DV dataset.

Figure 6: Qualitative comparison of novel view synthesis on the high-resolution (512 $\times$ 960) DL3DV dataset.

Scalability and Efficiency

The decoupling of viewpoint and image tokens allows iLRM to generate compact Gaussian sets while leveraging high-resolution images, outperforming baselines under identical output density and requiring less computational resources. Mini-batch cross-attention reduces training FLOPs by up to 4.7 $\times$ with only marginal drops in reconstruction quality.

Ablation Studies

Increasing the number of update layers yields consistent performance gains, confirming the efficacy of iterative refinement. Token uplifting and self-attention are critical for capturing fine-grained spatial correspondences and global dependencies, respectively. Group-attention variants incur higher computational cost without performance benefits, validating the two-stage attention design.

Limitations

While iLRM significantly reduces computational overhead, self-attention across many views remains a bottleneck as input size grows. The requirement for known camera poses limits applicability to raw video datasets; extending iLRM to pose-free settings is a promising direction. Hierarchical or sparse attention mechanisms may further enhance scalability.

Implications and Future Directions

iLRM demonstrates that iterative refinement and decoupled representations are essential for scalable, high-quality feed-forward 3D reconstruction. The architecture is well-suited for large-scale, real-time applications and generalizes robustly across diverse scenes and viewpoint configurations. Future work should explore pose-free reconstruction, efficient attention alternatives, and integration with foundation models for further generalization.

Conclusion

iLRM establishes a new paradigm for feed-forward 3D reconstruction by combining iterative refinement, decoupled scene representation, and scalable attention mechanisms. The empirical results substantiate its superiority in both reconstruction quality and computational efficiency. The framework provides a solid foundation for future research in scalable, generalizable 3D scene modeling.