Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 17 tok/s Pro
GPT-5 High 22 tok/s Pro
GPT-4o 93 tok/s Pro
Kimi K2 186 tok/s Pro
GPT OSS 120B 446 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

ARTDECO: Towards Efficient and High-Fidelity On-the-Fly 3D Reconstruction with Structured Scene Representation (2510.08551v1)

Published 9 Oct 2025 in cs.CV

Abstract: On-the-fly 3D reconstruction from monocular image sequences is a long-standing challenge in computer vision, critical for applications such as real-to-sim, AR/VR, and robotics. Existing methods face a major tradeoff: per-scene optimization yields high fidelity but is computationally expensive, whereas feed-forward foundation models enable real-time inference but struggle with accuracy and robustness. In this work, we propose ARTDECO, a unified framework that combines the efficiency of feed-forward models with the reliability of SLAM-based pipelines. ARTDECO uses 3D foundation models for pose estimation and point prediction, coupled with a Gaussian decoder that transforms multi-scale features into structured 3D Gaussians. To sustain both fidelity and efficiency at scale, we design a hierarchical Gaussian representation with a LoD-aware rendering strategy, which improves rendering fidelity while reducing redundancy. Experiments on eight diverse indoor and outdoor benchmarks show that ARTDECO delivers interactive performance comparable to SLAM, robustness similar to feed-forward systems, and reconstruction quality close to per-scene optimization, providing a practical path toward on-the-fly digitization of real-world environments with both accurate geometry and high visual fidelity. Explore more demos on our project page: https://city-super.github.io/artdeco/.

Summary

  • The paper introduces a unified framework that combines feed-forward 3D models with SLAM for efficient monocular 3D reconstruction.
  • It employs a hierarchical Gaussian scene representation with LoD control to optimize rendering fidelity and computational efficiency.
  • Experimental evaluations demonstrate superior pose accuracy and reconstruction quality across diverse indoor and outdoor benchmarks.

ARTDECO: Efficient and High-Fidelity On-the-Fly 3D Reconstruction with Structured Scene Representation

ARTDECO introduces a unified framework for real-time, high-fidelity 3D reconstruction from monocular image sequences, integrating feed-forward 3D foundation models with SLAM-based optimization and a hierarchical Gaussian scene representation. The system is designed to address the trade-offs between accuracy, robustness, and efficiency that have historically limited the practicality of monocular 3D reconstruction for applications in AR/VR, robotics, and real-to-sim content creation. Figure 1

Figure 1: ARTDECO delivers high-fidelity, interactive 3D reconstruction from monocular images, combining efficiency with robustness across indoor and outdoor scenes.


System Architecture

ARTDECO's pipeline consists of three tightly integrated modules: frontend, backend, and mapping. The frontend leverages feed-forward models (MASt3R for matching and π3\pi^3 for loop closure) to estimate relative poses and categorize frames into keyframes, mapper frames, or common frames. The backend refines keyframe poses via loop closure detection and global bundle adjustment, ensuring multi-view consistency and drift reduction. The mapping module incrementally reconstructs the scene using a hierarchical semi-implicit Gaussian representation, with level-of-detail (LoD) control for scalable rendering. Figure 2

Figure 2: Frontend and backend modules. The frontend aligns incoming frames and classifies them; the backend performs loop detection and global pose optimization.


Hierarchical Gaussian Scene Representation

ARTDECO employs a hierarchical Gaussian representation, where each primitive encodes position, spherical harmonics, scale, opacity, and local features. The mapping module uses multi-resolution Laplacian of Gaussian (LoG) analysis to identify regions requiring refinement, inserting new Gaussians only where necessary. LoD-aware rendering is achieved by assigning each Gaussian a distance-dependent parameter dmaxd_{\text{max}}, enabling efficient, scale-consistent rendering and suppressing flickering artifacts. Figure 3

Figure 3: Mapping process. Keyframes and mapper frames add new Gaussians; LoG analysis guides refinement; common frames contribute via gradient-based updates.


Integration of Feed-Forward Models and SLAM

ARTDECO's key innovation is the modular integration of feed-forward 3D foundation models for pose estimation, loop closure, and dense point prediction. MASt3R provides robust two-view correspondences and metric pointmaps, while π3\pi^3 enhances loop closure detection with multi-frame geometric priors. Pose estimation is performed via weighted Gauss–Newton optimization, with joint refinement of focal length when intrinsics are unknown. The backend's factor graph connects keyframes based on ASMK similarity and geometric consistency, followed by global bundle adjustment in the Sim(3)\mathrm{Sim}(3) group.


Training and Optimization Strategy

The system operates in a streaming fashion, initializing new Gaussians and optimizing the scene for KK iterations upon arrival of keyframes or mapper frames. Common frames trigger only K/2K/2 iterations, focusing on refinement. Training frames are sampled with a 0.2 probability from the current frame and 0.8 from past frames to mitigate local overfitting. After streaming, a global optimization is performed, with higher sampling probabilities for less-updated frames. Camera poses and Gaussian parameters are jointly optimized, propagating gradients to both positions and rotations.


Experimental Results

ARTDECO is evaluated on eight diverse indoor and outdoor benchmarks, including TUM, ScanNet, ScanNet++, VR-NeRF, KITTI, Waymo, Fast-LIVO2, and MatrixCity. Reconstruction quality is measured by PSNR, SSIM, and LPIPS; pose accuracy by ATE RMSE; and efficiency by FPS. ARTDECO consistently achieves state-of-the-art results, outperforming both SLAM-based and feed-forward baselines in rendering fidelity, localization accuracy, and runtime. Figure 4

Figure 4: Qualitative comparisons against popular on-the-fly reconstruction baselines. ARTDECO preserves high-quality rendering details in complex environments.

Figure 5

Figure 5: More Qualitative Reconstruction Results.

Figure 6

Figure 6: Qualitative Comparison of Trajectories across Different Methods on the KITTI Dataset.

Figure 7

Figure 7: Qualitative Comparison of Trajectories across Different Methods on the ScanNet++ Dataset.

Figure 8

Figure 8: Qualitative Comparison of Trajectories across Different Methods on the TUM Dataset.

Numerical results highlight ARTDECO's superiority:

  • On ScanNet++, ARTDECO achieves PSNR 29.12, SSIM 0.918, LPIPS 0.167, and ATE RMSE 0.018, outperforming all baselines.
  • On KITTI, ARTDECO achieves PSNR 23.17, SSIM 0.765, LPIPS 0.299, and ATE RMSE 1.36, with robust tracking and rendering.
  • Runtime is competitive, with ARTDECO running faster than all but OnTheFly-NVS, justified by its higher pose accuracy.

Ablation studies confirm the importance of hierarchical LoD, structural Gaussians, and the hybrid frontend/backend design. Disabling loop closure or using dense keyframes degrades localization, while omitting LoD or mapper frames reduces reconstruction fidelity.


Limitations and Future Directions

ARTDECO's reliance on feed-forward 3D foundation models introduces sensitivity to noise, blur, and out-of-distribution inputs. The system assumes consistent illumination and sufficient parallax; violations can cause drift or artifacts. Future work should focus on uncertainty estimation, adaptive model selection, and stronger priors to enhance generalization and reliability in unconstrained real-world scenarios.


Conclusion

ARTDECO demonstrates that combining feed-forward priors with structured Gaussian representations and SLAM-based optimization enables efficient, robust, and high-fidelity on-the-fly 3D reconstruction from monocular image sequences. The framework achieves near per-scene optimization quality with SLAM-level efficiency and feed-forward robustness, providing a practical solution for large-scale real-to-sim pipelines. ARTDECO's modular design and hierarchical scene representation offer a promising foundation for future research in scalable, interactive 3D digitization for AR/VR, robotics, and digital twins.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 2 tweets and received 1 like.

Upgrade to Pro to view all of the tweets about this paper:

Youtube Logo Streamline Icon: https://streamlinehq.com