3D Reconstruction from Inconsistent Views

Updated 20 March 2026

The paper presents a novel non-rigid iterative ICP method that aligns disparate point clouds using MLP-generated SE(3) twists to correct pose and geometric drift.
Key techniques include a global optimization stage and inverse deformation fields for non-rigid-aware differentiable rendering, ensuring both photometric and geometric consistency.
The approach integrates generative priors with classical multi-view methods to reconstruct metrically accurate 3D models, enabling applications in VR, robotics, and scene understanding.

World reconstruction from inconsistent views addresses the challenge of generating metrically and visually coherent 3D models from a collection of images or video frames that are inconsistent due to factors such as stochastic generative drift, sparse or unposed camera observations, variable illumination, and lack of geometric overlap. Inconsistent views arise in outputs of generative video models, sparse view capture without reliable calibration, or even in classical scenarios where scene occlusion, dynamic content, or illumination changes invalidate traditional photometric and geometric consistency assumptions. Solutions require robust mechanisms to align, regularize, and fuse these disparate observations into unified, explorable 3D worlds suitable for downstream applications in virtual reality, robotics, and scene understanding (Höllein et al., 17 Mar 2026).

1. The Problem of Inconsistent Views in 3D World Reconstruction

The central difficulty in world reconstruction from inconsistent views is the violation of classical multi-view constraints. In conventional structure-from-motion (SfM) and multi-view stereo (MVS), dense spatial overlap, accurate camera calibration, and photometric consistency are prerequisites for high-fidelity 3D reconstruction. In generative and real-world capture settings, however, input views may display:

Inconsistent geometry: Generative models (e.g., video diffusion models) produce temporally and spatially smooth but not strictly 3D-consistent frames. Surfaces may drift, objects may appear/disappear, or geometry may be misaligned across views (Höllein et al., 17 Mar 2026).
Unknown or noisy camera poses: Many recent pipelines operate without ground-truth extrinsics or intrinsics, increasing ambiguity in correspondence and regularization (Wu et al., 2023, Zhu et al., 6 Aug 2025, Jin et al., 2021, Huang et al., 2023).
Sparse, non-overlapping observations: Limited baseline, occlusions, or insufficient coverage undermine surface triangulation and matching (Vora et al., 2023, Jin et al., 2021).

This challenges all stages of the classical pipeline: feature matching, triangulation, surface regularization, and rendering.

2. Geometric Alignment: Non-Rigid ICP and Global Optimization

The method introduced in "World Reconstruction From Inconsistent Views" (Höllein et al., 17 Mar 2026) addresses frame-wise geometric inconsistencies by first lifting each frame to a dense pointcloud using a geometric foundation model (GFM), typically using pretrained monocular depth estimators with per-pixel confidences. After unprojection using (possibly predicted) camera intrinsics and extrinsics, the resulting multi-frame pointclouds are misaligned, reflecting both pose and content drift.

To establish a consistent global reference, the pipeline applies a non-rigid, iterative frame-to-model ICP procedure. Here, per-frame deformations are parameterized by small multilayer perceptrons (MLPs) that map 3D points to SE(3) twists, capturing both rigid and local non-rigid misalignments. Each frame's partial pointcloud is aligned to an evolving "model" cloud via a coarse-to-fine nearest-neighbor and normal-based loss, augmented with colored-ICP, sparse correspondence constraints, and TV regularization on deformation fields.

Subsequently, a global optimization stage refines all camera and deformation parameters, enforcing zero-thickness surface consistency across all aligned pointclouds. This is realized by drawing explicit K-nearest neighbor correspondences across frames and penalizing global deviation from pre-alignment states.

3. Inverse Deformation Fields and Non-Rigid-Aware Rendering Losses

Even after alignment, the reconstructed pointcloud may not explain or be explainable by each original, possibly inconsistent, input view. To address this, an "inverse deformation" field is learned: given a canonical pointcloud and an index to an original view, a learned MLP predicts an SE(3) transformation that warps points back into the coordinate frame and geometry of that specific input. Training of this field minimizes the discrepancy between the mapped canonical points and their observed camera-space positions, together with total variation spatial regularity.

This mapping is crucial for non-rigid-aware differentiable rendering (used with 2D Gaussian Splatting or other neural rendering backbones): for each view, the inverse deformations of the canonical Gaussians generate pseudo-measurements for image synthesis. Optimization then proceeds with a photometric loss (typically $\ell_1$ and LPIPS), as well as geometric depth and normal alignment regularizers, ensuring both rendering fidelity and surface sharpness.

4. Canonical Surface Parameterizations and Regularization Strategies

Robust 3D surface parameterizations underview incomplete or inconsistent observations span several research directions:

Explicit 3D Gaussian fields: Used as the canonical representation in (Höllein et al., 17 Mar 2026, Zhu et al., 6 Aug 2025, Vora et al., 2023), these enable direct optimization via differentiable rendering and compositing, leveraging surface regularizers like depth-normal consistency and local planarity.
Signed Distance Functions (SDF): Employed in SC-NeuS and DiViNeT, SDFs enable extraction of zero-level set surfaces while supporting gradient-based regularization (Eikonal, re-projection, patch NCC) for fine-grained alignment (Huang et al., 2023, Vora et al., 2023).
Neural templates and learned anchors: DiViNeT (Vora et al., 2023) constrains SDF reconstruction with pre-learned neural template Gaussians, providing surface prior "anchors" that stabilize optimization even with very sparse input.

Multi-branch cross-view attention architectures (Surf3R (Zhu et al., 6 Aug 2025)) and learned priors from large-scale pretraining further enable fedforward surface prediction without pose supervision.

5. Handling Illumination, Texture, and Hallucination Artifacts

Illumination inconsistency presents a formidable obstacle, especially when reconstructing from generative model outputs or images captured under variable lighting. Solutions such as GS-I $^{3}$ (Wang et al., 16 Mar 2025) (not summarized in detail here, pending full technical disclosures) employ CNN-based tone mapping corrections and normal fusion mechanisms to address geometric drift due to underexposed regions and lighting mismatches. Normal compensation fuses single-view and multi-view normals to reduce bias.

A persistent risk in reconstructing from generative video models is the literal inclusion of hallucinated or missing objects. The non-rigid alignment and backward-warp-based rendering loss improves fidelity but cannot distinguish between true dynamic scene changes and generative "hallucinations." Robust outlier detection and temporal consistency priors remain open areas for further advancement (Höllein et al., 17 Mar 2026).

6. Integrating Generative Priors, Diffusion, and Classical Reconstruction

Several approaches leverage learned generative priors, either from diffusion models or neural templates, to bridge gaps arising from view sparsity and inconsistencies:

Diffusion model inversion: iFusion (Wu et al., 2023) utilizes pre-trained novel-view synthesis diffusion models not to synthesize but to estimate camera poses by inversion, pairing this with object-specific LoRA fine-tuning and downstream photometric or score-distillation-sampling (SDS) reconstruction. This procedure enables joint pose retrieval and image-driven 3D learning from as few as two unposed images, demonstrating robust improvements in pose recall and geometric accuracy.
Conditional multi-view diffusion with consistency correction: ReconViaGen (Chang et al., 27 Oct 2025) designs multi-scale attention-based conditioning mechanisms and rendering-aware velocity compensation inside the denoising loop, aligning diffusion outputs with photometric evidence at each step—a vital property for unbounded, large-scale world reconstructions involving generative priors.
Classical template and geometry-driven constraints: Joint discrete-continuous optimization over features, plane parameters, and camera hypotheses (Planar Surface Reconstruction (Jin et al., 2021)) enables robust geometric matching and pose estimation even under minimal overlap and pose ambiguity.

In all cases, hybridizing generative priors with photometric, geometric, or template-based constraints is critical for metrically accurate, high-fidelity world reconstruction from inconsistent inputs.

7. Evaluation Metrics, Datasets, and Practical Considerations

Key evaluation metrics in this literature include 3D consistency (by SLAM reprojection error or Chamfer distance), photometric consistency (flow-based metrics, L1/LPIPS), and semantic quality (CLIP-IQA+, CLIP Aesthetic). WorldScore is used in (Höllein et al., 17 Mar 2026) to combine these into a composite performance index. Notably, robust non-rigid alignment and backward-warp rendering loss in (Höllein et al., 17 Mar 2026) yielded the highest scores across all metrics and eliminated "ghosts" and blur in qualitative evaluation.

Practical complexities arise from the high dimensionality of deformation fields, the computational cost of large-scale optimization (typically tens of minutes and significant GPU memory per scene), and the risk of compounding generative inconsistencies in static reconstructions. Runtime/memory and scalability tradeoffs are explicitly documented in implementation details (Höllein et al., 17 Mar 2026). Successful pipelines do not simply rely on feedforward inference but incorporate regularization, bundle adjustment, and explicit correspondence or rendering-based losses wherever possible.

The domain of world reconstruction from inconsistent views is thus characterized by its blend of advanced geometric alignment, neural regularization, generative priors, and robust rendering-based losses. Ongoing innovations in cross-view attention, non-rigid field learning, and rendering-aware diffusion synthesis continue to drive advances in this challenging, foundational area of 3D vision and synthetic scene generation.