Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
3 tokens/sec
DeepSeek R1 via Azure Pro
51 tokens/sec
2000 character limit reached

iLRM: Iterative Large 3D Reconstruction Model

Updated 1 August 2025
  • iLRM is a feed-forward neural framework that iteratively refines compact viewpoint tokens to achieve scalable, high-fidelity 3D scene reconstruction.
  • Its novel two-stage attention mechanism decouples per-view features from the scene representation, reducing memory usage while capturing high-resolution details.
  • Empirical evaluations show iLRM delivers superior PSNR gains and faster inference compared to state-of-the-art methods, making it ideal for AR/VR, robotics, and photogrammetry.

An Iterative Large 3D Reconstruction Model (iLRM) is a feed-forward neural framework designed for scalable, high-fidelity 3D scene reconstruction from multi-view images, with explicit focus on computational efficiency and quality improvements as the number of input views and resolution increase. iLRM introduces an iterative refinement mechanism, a novel two-stage attention architecture, and a decoupled compact scene representation, enabling effective integration of high-resolution image information and large-number view scalability while maintaining rapid inference and superior reconstruction accuracy (Kang et al., 31 Jul 2025).

1. Guiding Principles and Architecture

iLRM advances feed-forward 3D modeling by addressing the memory and computational bottlenecks inherent in transformer-based methods that perform global attention across all image tokens. Its architecture is anchored in three principles:

  1. Decoupling Scene Representation from Input Images: Rather than directly regressing a dense set of per-pixel Gaussians from every input image, iLRM creates a compact set of “viewpoint embeddings” (low-dimensional tokens associated with each view) which are progressively refined into 3D Gaussian parameters. This approach reduces the memory requirements and ensures that the number of scene tokens remains manageable regardless of the input image resolution.
  2. Two-Stage Attention Mechanism: Each update layer is structured to first apply per-view cross-attention between viewpoint tokens and high-resolution image tokens (encoding both RGB and Plücker camera pose descriptors), followed by a global self-attention across viewpoint tokens. This contrasts with prior methods that use single-stage, all-to-all attention, which scales poorly as view count or image resolution increases.
  3. High-Resolution Guidance via Token Uplifting: At every update layer, a “token uplifting” mechanism expands the low-resolution viewpoint token set to finer query tokens before cross-attention, allowing high-frequency details encoded in image patches to gradually permeate the 3D scene representation across iterations—even without directly attending to all high-resolution tokens.

The iterative update step at layer ll is mathematically expressed as: $\begin{aligned} \tilde{V}_i^{(l-1)} &= \mathrm{cross\mbox{-}attn}^{(l)}\left(V_i^{(l-1)},\ S_i\right), \ \left\{ V_i^{(l)} \right\}_{i=1}^N &= \mathrm{self\mbox{-}attn}^{(l)}\left(\left\{ \tilde{V}_i^{(l-1)} \right\}_{i=1}^N\right), \end{aligned}$ where Vi(l)V_i^{(l)} are the viewpoint tokens for the ii-th input, and SiS_i the (high-resolution) per-image tokens.

2. Iterative Refinement as Feed-Forward Optimization

Unlike one-shot approaches, iLRM employs a stack of update layers, each performing a refinement analogous to a single iteration of gradient-based optimization. The initialization derives viewpoint embeddings from camera pose encodings (e.g., Plücker coordinates), which are then successively updated based on image evidence and inter-view consistency.

Each update is designed to aggregate and propagate both shared (global structure across views) and unique (specific view-dependent detail) information. The cumulative effect across update layers is a compact but expressive 3D Gaussian set suitable for explicit scene rendering and downstream 3D processing. This approach mimics iterative optimization steps but is executed in a fully feed-forward, differentiable fashion, supporting rapid inference.

3. Comparison with Other State-of-the-Art Reconstruction Models

iLRM demonstrates empirically superior performance relative to methods such as GS-LRM and DepthSplat. Quantitatively, it improves PSNR by about 3 dB on the RealEstate10K dataset and 4 dB on DL3DV when matched for inference time, and achieves faster inference (e.g., 0.028 s vs. 0.065 s on RealEstate10K). Qualitative visualizations in the original benchmarks report sharper, artifact-suppressed reconstructions.

The primary driver of these improvements is the ability to handle larger numbers of input views (enabled by attention decoupling) with finer high-resolution detail (enabled by token uplifting), while keeping the computational cost—memory and floating-point operation count—at a fixed, manageable level as the scene complexity and image resolution increase.

Method Inference Time (s) PSNR (RE10K) Scalability
GS-LRM >0.2 lower Limited by token count
DepthSplat ~0.065 lower Cost grows with views
iLRM 0.028 higher Handles more views efficiently

4. Scalability and Computational Strategies

iLRM’s architecture is explicitly optimized for scalability:

  • View Decoupling: Scene representation is kept as a fixed or slowly growing set of viewpoint tokens regardless of the number or size of input images, decoupling memory consumption from linear scaling with views or resolution.
  • Two-Stage Attention: Local (per-view) cross-attention is linear in the number of views and avoids O(N2)O(N^2) scaling of global attention. The subsequent global self-attention is performed only on the compact set of viewpoint tokens.
  • Token Uplifting and Structured Sampling: Each low-res viewpoint token can be expanded (uplifted) before cross-attention, infusing high-frequency detail selectively. Structured mini-batch sampling for cross-attention further reduces the computational burden per update layer.

This design allows iLRM to aggregate information from a greater number of images and higher resolutions than models with all-to-all attention, without exponential increases in compute or memory.

5. Experimental Evaluation and Practical Applications

iLRM is evaluated on benchmark datasets such as RealEstate10K (which includes both indoor and outdoor video-derived scenes) and DL3DV, as well as cross-dataset tests on ACID and others. Evaluations cover not only standard metrics (PSNR, SSIM, LPIPS) but also qualitative aspects (visual sharpness, artifact suppression, and spatial consistency).

The reported speed and accuracy make iLRM suitable for applications in:

  • Real-time scene reconstruction and novel view synthesis for AR/VR and visual effects pipelines.
  • Robotics, where rapid and accurate 3D scene understanding is essential.
  • Photogrammetry and mesh/point cloud generation for large, complex environments.

Its rapid feed-forward inference, ability to scale with view count, and intrinsic high-resolution capability distinguish it within the explicit 3D Gaussian reconstruction paradigm.

6. Significance and Outlook

iLRM represents a shift in large-scale 3D reconstruction: from brute-force, globally-attentional transformer designs (where scalability is limited by quadratic costs) to an architecture that preserves or improves fidelity while making it computationally practical to use many high-resolution images as input.

The separation between scene representation and per-image features, the two-stage attention framework, and the iterative token refinement are the central mechanisms supporting this progress. This makes iLRM a key reference for future research oriented toward efficient, explicit, and high-quality feed-forward 3D scene modeling (Kang et al., 31 Jul 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)