Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 177 tok/s
Gemini 2.5 Pro 43 tok/s Pro
GPT-5 Medium 26 tok/s Pro
GPT-5 High 25 tok/s Pro
GPT-4o 119 tok/s Pro
Kimi K2 202 tok/s Pro
GPT OSS 120B 439 tok/s Pro
Claude Sonnet 4.5 38 tok/s Pro
2000 character limit reached

Scene Coordinate Reconstruction Priors (2510.12387v1)

Published 14 Oct 2025 in cs.CV

Abstract: Scene coordinate regression (SCR) models have proven to be powerful implicit scene representations for 3D vision, enabling visual relocalization and structure-from-motion. SCR models are trained specifically for one scene. If training images imply insufficient multi-view constraints SCR models degenerate. We present a probabilistic reinterpretation of training SCR models, which allows us to infuse high-level reconstruction priors. We investigate multiple such priors, ranging from simple priors over the distribution of reconstructed depth values to learned priors over plausible scene coordinate configurations. For the latter, we train a 3D point cloud diffusion model on a large corpus of indoor scans. Our priors push predicted 3D scene points towards plausible geometry at each training step to increase their likelihood. On three indoor datasets our priors help learning better scene representations, resulting in more coherent scene point clouds, higher registration rates and better camera poses, with a positive effect on down-stream tasks such as novel view synthesis and camera relocalization.

Summary

  • The paper introduces a probabilistic formulation of SCR training as maximum likelihood estimation that integrates hand-crafted and learned priors.
  • It employs a 3D point cloud diffusion model along with Laplace and Wasserstein priors to enhance depth accuracy and camera pose estimation in indoor scenes.
  • Experimental results demonstrate significant improvements in registration rates, novel view synthesis (up to +1.1dB PSNR), and robustness under sparse data conditions.

Scene Coordinate Reconstruction Priors: Probabilistic Regularization for SCR Models

Overview

The paper "Scene Coordinate Reconstruction Priors" (2510.12387) introduces a probabilistic framework for training Scene Coordinate Regression (SCR) models, enabling the integration of high-level reconstruction priors into neural Structure-from-Motion (SfM) pipelines. The authors propose both hand-crafted and learned priors—including a 3D point cloud diffusion model—to regularize SCR training, improving scene geometry, camera pose estimation, and downstream tasks such as novel view synthesis and relocalization. The approach is demonstrated on ACE, ACE0, and GLACE frameworks, showing consistent improvements across multiple indoor datasets.

Probabilistic Reformulation of SCR Training

The core contribution is the reformulation of SCR training as a maximum likelihood estimation problem. The SCR model ff predicts 3D scene coordinates yi\mathbf{y}_i for image patches pi\mathbf{p}_i, and training is cast as maximizing the posterior p(yIM,h)p(\mathbf{y}|\mathcal{I}_M, \mathbf{h}^*), where IM\mathcal{I}_M are mapping images and h\mathbf{h}^* are their poses. The loss function is decomposed into a reprojection error (likelihood term) and a regularization term (prior):

logp(yIM,h)Lreprojlogp(y)-\log p(\mathbf{y}|\mathcal{I}_M, \mathbf{h}^*) \propto L_\text{reproj} - \log p(\mathbf{y})

This formulation allows for the seamless integration of priors that encode geometric plausibility, either via explicit depth distributions or learned models.

Hand-Crafted Depth Distribution Priors

Two hand-crafted priors are introduced:

  • Laplace Negative Log-Likelihood (NLL): Encourages predicted depths to follow a Laplace distribution fitted to ground truth data, penalizing implausible depth values.
  • Wasserstein Distance (WD): Minimizes the Wasserstein distance between the predicted depth distribution and the target Laplace distribution, enforcing both mean and variance constraints.

These priors are lightweight and can be applied per-pixel or per-batch, leveraging the ACE framework's random sampling of scene points.

RGB-D Depth Priors

For scenes with measured depth (RGB-D), the prior is centered at the ground truth depth did_i^* for each pixel, with a narrow bandwidth to enforce strong geometric consistency. This is implemented as:

logp(yi)logLap(didi,b)\log p(\mathbf{y}_i) \propto \log \text{Lap}(d_i | d_i^*, b')

where bb' is a small constant (e.g., 10cm).

Learned 3D Point Cloud Diffusion Prior

A key innovation is the use of a 3D point cloud diffusion model as a learned prior. The model is trained offline on ScanNet scenes to capture plausible indoor geometries. During SCR training, the frozen diffusion model provides a gradient of the log-likelihood for the current point cloud, nudging the SCR predictions toward realistic scene layouts. Figure 1

Figure 1: System overview showing SCR training with reprojection loss and regularization via depth distribution or diffusion priors.

The diffusion prior is only applied after sufficient SCR iterations (post 5k), aligning the diffusion time steps with the SCR optimization trajectory. The prior is masked to exclude points with low reprojection error, focusing regularization on ambiguous regions. Figure 2

Figure 2: Comparison of ACE training and diffusion process, motivating the alignment of time steps for effective regularization.

Implementation Details

  • Diffusion Model Architecture: PVCNN is used for point cloud encoding, with timestep embedding modifications for diffusion.
  • Training Protocol: 5,120 points are sampled per scene, with augmentations and normalization. The model is trained for 100k iterations on a single V100 GPU.
  • Integration: During SCR mapping, the diffusion prior is applied every kk iterations (typically k=4k=4 for efficiency), with gradient normalization to balance regularization and reprojection loss.

Experimental Results

Structure-from-Motion

On ScanNet and Indoor6, the proposed priors yield:

  • Higher registration rates: Up to +4.7% on Indoor6 with diffusion prior.
  • Improved pose accuracy: Lower ATE/RPE and median errors.
  • Better novel view synthesis: PSNR increases up to +1.1dB.

Relocalization

On 7Scenes and Indoor6:

  • ACE + Diffusion prior: Improves relocalization accuracy, especially on challenging scenes (e.g., +4.1% on Stairs).
  • GLACE + Diffusion prior: Consistent improvements, demonstrating generality across SCR frameworks.
  • Mapping time: Diffusion prior adds modest overhead (~3 minutes for ACE). Figure 3

    Figure 3: Diffusion prior regularizes ACE training on the Stairs scene, yielding coherent geometry and improved pose accuracy.

Point Cloud Quality

Depth evaluation shows substantial reductions in outlier points and overall error metrics when using priors, with the diffusion prior outperforming hand-crafted alternatives. Figure 4

Figure 4: Point clouds generated by the diffusion model and ScanNet, illustrating the learned prior's plausibility.

Ablations

  • Encoder architecture: PVCNN outperforms Pointwise-Net, highlighting the importance of structural encoding.
  • Mask threshold: Optimal regularization is achieved by masking points with reprojection error below 30 pixels.
  • Efficiency: Applying diffusion every 4 iterations balances accuracy and runtime. Figure 5

    Figure 5: Qualitative comparison of point cloud encoders, showing improved structure with PVCNN.

Robustness to Scarce Data

Subsampling mapping frames degrades performance, but ACE with diffusion prior maintains higher accuracy, demonstrating the prior's regularization strength. Figure 6

Figure 6: Mapping sample rate vs. accuracy, showing ACE-Diff's robustness to reduced data.

Outdoor Scenes

Preliminary results on Cambridge Landmarks indicate small improvements with indoor-trained priors, but highlight the need for more expressive models and diverse training data for outdoor environments.

Practical and Theoretical Implications

The probabilistic framework enables principled integration of priors into SCR training, improving robustness in ambiguous or under-constrained regions. The learned diffusion prior demonstrates that even low-fidelity generative models can provide effective regularization for scene reconstruction. The approach is modular, applicable to various SCR frameworks, and does not impact test-time efficiency.

Theoretically, the work bridges generative modeling and geometric reconstruction, leveraging score-based diffusion models for 3D scene regularization. The empirical results suggest that high-level priors can mitigate degeneracies inherent in classical and neural SfM pipelines.

Future Directions

  • Outdoor scene priors: Requires larger, more diverse datasets and expressive architectures.
  • Conditional diffusion models: Incorporating additional signals (e.g., semantics, layout) could further improve regularization.
  • Efficient architectures: Balancing fidelity and runtime remains a challenge for large-scale scenes.

Conclusion

The paper presents a rigorous probabilistic approach to SCR training, enabling the integration of both hand-crafted and learned priors. The proposed regularization strategies yield more coherent scene representations, improved camera pose estimation, and enhanced downstream performance, with minimal impact on efficiency. The diffusion prior, in particular, demonstrates the utility of generative models as regularizers in 3D vision tasks, opening avenues for further research in scene-level priors and efficient 3D generative modeling. Figure 7

Figure 7: Qualitative results on Indoor6, showing reduced noise and improved structure with ACE+Diffusion prior.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 2 tweets and received 58 likes.

Upgrade to Pro to view all of the tweets about this paper:

alphaXiv