S2D: Sparse to Dense Lifting for 3D Reconstruction with Minimal Inputs

Published 11 Mar 2026 in cs.CV | (2603.10893v1)

Abstract: Explicit 3D representations have already become an essential medium for 3D simulation and understanding. However, the most commonly used point cloud and 3D Gaussian Splatting (3DGS) each suffer from non-photorealistic rendering and significant degradation under sparse inputs. In this paper, we introduce Sparse to Dense lifting (S2D), a novel pipeline that bridges the two representations and achieves high-quality 3DGS reconstruction with minimal inputs. Specifically, the S2D lifting is two-fold. We first present an efficient one-step diffusion model that lifts sparse point cloud for high-fidelity image artifact fixing. Meanwhile, to reconstruct 3D consistent scenes, we also design a corresponding reconstruction strategy with random sample drop and weighted gradient for robust model fitting from sparse input views to dense novel views. Extensive experiments show that S2D achieves the best consistency in generating novel view guidance and first-tier sparse view reconstruction quality under different input sparsity. By reconstructing stable scenes with the least possible captures among existing methods, S2D enables minimal input requirements for 3DGS applications.

Abstract PDF Upgrade to Chat

Authors (8)

Summary

The paper introduces a hybrid pipeline that combines VFMs for robust point cloud estimation with a diffusion-based artifact fixer to enhance 3D reconstruction quality.
It leverages random sample drop and weighted gradient techniques during 3DGS optimization to maintain both structural and textural integrity under sparse input conditions.
Empirical results show significant improvements in PSNR and SSIM over baseline methods across multiple datasets, demonstrating the method’s robustness and practical impact.

Sparse to Dense Lifting (S2D): High-Quality 3DGS Reconstruction with Minimal Inputs

Introduction and Motivation

The S2D framework addresses persistent limitations in explicit 3D reconstruction, particularly with 3D Gaussian Splatting (3DGS), which suffers severe degradation under sparse input conditions. Standard 3DGS pipelines require many images to avoid visual artifacts during view interpolation and extrapolation, which is impractical in many real-world scenarios due to the increased capture and computational burdens. Previous approaches, including feed-forward models and generative novel-view synthesis (NVS) strategies, have failed to yield artifact-free, photorealistic, and 3D-consistent results in the sparse input regime. S2D proposes a hybrid explicit-generative pipeline that leverages both visual geometry foundation models (VFMs) for robust 3D point cloud estimation, and a novel one-step diffusion-based artifact fixer, to achieve high-fidelity 3DGS reconstruction from minimal views.

S2D Pipeline: Architecture and Methodology

The S2D framework consists of two key components: (i) sparse-to-dense lifting via a diffusion-based artifact fixer, and (ii) a tailored 3DGS scene optimization strategy designed for sparse and dense hybrid supervision. Figure 1 details the high-level architecture and artifact fixer model design.

Figure 1: S2D reconstruction pipeline and model architecture of artifact fixer.

Point Cloud Estimation and Novel View Rendering

Given any number of input images and their associated camera poses, S2D employs a state-of-the-art visual geometry foundation model—such as $\pi^3$ or MapAnything—to reconstruct a point cloud for the scene. These dense point clouds are naturally robust to sparse input, as they are view-independent and structurally consistent, despite their lack of photorealism due to aliasing and quantization errors.

One-Step Diffusion-Based Artifact Fixer

S2D introduces a one-step latent diffusion model as an artifact fixer, which processes synthesized novel views. The fixer is guided by both the point cloud rendering from the target viewpoint (structural guidance) and a nearby reference view image (textural guidance). The mixing module, as shown in the architecture, combines DINO features extracted from both guides, leveraging cross-attention to produce a mixed latent input for efficient and single-step denoising. The model further employs targeted loss terms, including DINO loss, SSIM, LPIPS, L2, and adversarial loss, to balance perceptual and structural fidelity.

Reconstruction Strategy: Random Sample Drop and Weighted Gradient

S2D augments the 3DGS optimization loop to avoid overfitting and inconsistency in regions unconstrained by input images. Novel views generated via artifact fixing, along with original inputs, are inserted into the supervision set. To prevent imbalance due to the disproportionate number of novel views, a stochastic sample drop scheme proportional to parameter $\alpha$ ensures robust optimization stability (Figure 2). Pixelwise gradient weighting, modulated by a confidence map derived from point cloud visibility, reduces adverse impact from regions susceptible to unresolved artifacts.

Figure 2: Evaluation on parameter $\alpha$ (references-to-novel sample ratio) and $\beta$ (minimum weight for novel views) for stable optimization.

This reconstruction process produces a 3DGS scene with extended, artifact-free camera trajectories, from extremely sparse inputs.

Empirical Results

S2D’s performance is systematically evaluated across varied datasets (3DOVS, MIP360, DL3DV-960, RE10K, and Waymo), spanning a range of scene complexity and input sparsity.

Sparse Reconstruction Quality

In extremely sparse-view settings (as little as one or two images), S2D exhibits substantial gains in both photometric and perceptual metrics compared to baseline 3DGS and recent feed-forward or generative NVS methods. For instance, on 3DOVS with a single input image, S2D achieves a PSNR of 21.41 and SSIM of 0.77, compared to 10.12 and 0.30, respectively, for baseline 3DGS. The improvement is consistent across datasets and input densities (Figure 3, Figure 4).

Figure 3: Qualitative results in different situations—including in-the-wild and demanding sparse input scenarios—demonstrate the stability and photorealism of S2D reconstructions, even with severe input reduction.

Figure 4: Reconstruction quality versus input density, showing S2D's consistent improvement relative to conventional 3DGS, particularly in the low-input regime.

Ablation and Robustness

S2D’s superior performance is attributed to the effectiveness of its dual-guidance fixing and hybrid supervision. Ablation studies validate that artifact removal is strongest when both point cloud and reference texture are integrated through the mixing module, and when optimization benefits from random sample drop with weighted gradients (Figures 4 and 5).

Figure 5: Ablation on artifact removal; dual-guidance mixing with DINO features yields maximal artifact suppression and structural restoration.

Figure 6: Ablations on reconstruction strategy—weighted gradient (WG) and random sample drop (RSD) are key to minimizing inconsistency and preserving high-frequency textures.

Artifact fixer training on synthetic corruptions at varied intensities, and diverse perturbations of 3DGS renderings, ensures resilience to real-world degradation (Figure 7).

Figure 7: Training data with different artifact intensities and perturbation strategies for improving artifact fixer's robustness.

Domain Applications

On driving datasets (Waymo, NuScenes), S2D not only outperforms methods such as DIFIX, StreetCrafter, and video diffusion-based SEVA in both structural quality and FID, but also displays strong scene-level consistency in view extrapolation over long trajectories (Figure 8).

Figure 8: DiffusionGS results and driving scene comparisons—S2D provides perceptually superior reconstructions in complex autonomous driving scenarios.

Failure Modes and Limitations

Where input images are both extremely sparse and lack discriminative texture, the underlying VFM may yield fragmentary point clouds, degrading guidance quality (Figure 9). However, the modularity of S2D allows for VFM swapping as advances occur.

Figure 9: Example of point cloud under extreme sparse, low-texture condition, illustrating guidance failure case.

Implications and Outlook

S2D demonstrates that explicit 3DGS representations can benefit substantially from generative priors, particularly via strong geometric and textural guidance provided by state-of-the-art diffusion models and visual geometry transformers. The methodology sets a new benchmark for minimal-input 3D reconstruction, lowering data requirements and increasing the practicality of explicit scene representations for simulation, robotics, and VLS systems. The proposed framework anticipates advances in geometric foundation models and denoising networks, being modular enough to integrate improved backbone architectures. Future work may address residual failure cases by integrating learned spatial priors for more robust point cloud estimation and extending cross-attention mechanisms for even stronger structure–texture disentanglement.

Conclusion

S2D establishes a new standard for minimal-input, high-fidelity 3DGS reconstruction by bridging foundation model-derived point clouds and diffusion-based artifact correction, underpinned by a robust hybrid supervision strategy. The approach is empirically validated to surpass prior SOTA in extreme sparse-view settings, generalizes across scene types, and maintains competitive inference efficiency. The practical implication is the substantial broadening of 3DGS use cases under minimal data constraints, with potential to impact autonomous systems, digital twins, and immersive environment construction (2603.10893).

Markdown Report Issue