Learning 3D Reconstruction with Priors in Test Time

Published 4 Apr 2026 in cs.CV | (2604.03878v1)

Abstract: We introduce a test-time framework for multiview Transformers (MVTs) that incorporates priors (e.g., camera poses, intrinsics, and depth) to improve 3D tasks without retraining or modifying pre-trained image-only networks. Rather than feeding priors into the architecture, we cast them as constraints on the predictions and optimize the network at inference time. The optimization loss consists of a self-supervised objective and prior penalty terms. The self-supervised objective captures the compatibility among multi-view predictions and is implemented using photometric or geometric loss between renderings from other views and each view itself. Any available priors are converted into penalty terms on the corresponding output modalities. Across a series of 3D vision benchmarks, including point map estimation and camera pose estimation, our method consistently improves performance over base MVTs by a large margin. On the ETH3D, 7-Scenes, and NRGBD datasets, our method reduces the point-map distance error by more than half compared with the base image-only models. Our method also outperforms retrained prior-aware feed-forward methods, demonstrating the effectiveness of our test-time constrained optimization (TCO) framework for incorporating priors into 3D vision tasks.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper introduces TCO that leverages camera and depth priors as differentiable constraints to refine multiview 3D reconstruction.
It decouples geometric consistency and prior adherence by optimizing only the shared decoder via LoRA adaptation, ensuring architecture-agnostic improvements.
Empirical results demonstrate over 50% error reduction on benchmarks like ETH3D and DTU, outperforming retrained prior-aware methods.

Test-Time Constrained Optimization for 3D Reconstruction with Priors

Introduction

The paper "Learning 3D Reconstruction with Priors in Test Time" (2604.03878) presents a novel framework for leveraging auxiliary scene and camera priors to enhance multiview 3D reconstruction, without modifying or retraining the original Multi-View Transformer (MVT) architectures. The approach, termed Test-time Constrained Optimization (TCO), treats such priors (e.g., camera poses, depth maps, intrinsics) as differentiable constraints on the network predictions. During inference, the framework performs task-specific optimization of the pre-trained network parameters, guided by a self-supervised cross-view consistency objective and penalty terms corresponding to available priors.

This methodology addresses two fundamental challenges in 3D geometric vision: (1) effectively leveraging available priors at inference in a model-agnostic manner, and (2) maintaining architectural flexibility and avoiding costly retraining for every new prior type or network. The method is empirically validated on several 3D vision benchmarks, including ETH3D, DTU, 7-Scenes, NRGBD, and ScanNet, with quantitative and qualitative results that establish strong improvements over image-only MVTs as well as retrained prior-aware feed-forward methods.

Figure 1: Overview of the TCO framework: standard MVTs output geometry solely from RGB; feed-forward prior-aware approaches retrain MVTs to take priors as input; TCO instead imposes priors as constraints only at test time with no architecture changes.

Methodology

The core of the TCO framework is the separation of cross-view geometric consistency and explicit prior adherence as two complementary aspects of inference-time network tuning. Given an MVT pre-trained only on RGB images, TCO operates as follows:

Objective Function: A self-supervised prediction compatibility loss is constructed using differentiable rendering. The consistency among the predictions (e.g., depth, pose, intrinsics from different views) is enforced via a photometric or geometric loss between the renderings of the source and target views. The differentiable renderer 2D Gaussian Splatting (2DGS) produces projections from the predicted scene geometry to target images.
Priors as Constraints: Available priors (e.g., externally or sensor-derived camera poses, intrinsics, depth maps) are formulated as penalty terms over the model’s outputs, e.g., angular, translation, and focal length discrepancies, or depth alignment loss with scale and shift adjustment.
Optimization and Fine-Tuning: Only the shared decoder of the MVT is fine-tuned using LoRA adaptation, while all task-specific prediction heads and the encoder remain frozen, maximizing transfer across modalities. The composite loss includes the self-supervised compatibility objective and weighted prior constraints.

This plug-and-play inference scheme is extensible to arbitrary priors and MVT architectures, given only the appropriate loss terms, and can leverage any available auxiliary data at test time.

Figure 2: Qualitative comparison: TCO-VGGT (middle column) corrects architectural errors present in image-only reconstructions (left), outperforming both feed-forward prior-aware methods Pow3R and MapAnything (right).

Empirical Evaluation

Point Map Estimation

Comprehensive experiments on sparse-view and wide-baseline datasets (ETH3D, DTU, 7-Scenes, NRGBD) demonstrate that TCO with camera and/or intrinsics priors reduces reconstruction errors by over 50% relative to base models. In every evaluated setting, TCO not only surpasses feed-forward models with retrained prior input pipelines (e.g., Pow3R, MapAnything), but also yields architecture-agnostic gains when applied to different MVT backbones (VGGT, $\pi^3$ ). Notably, TCO achieves:

On ETH3D, TCO cuts the mean accuracy and completion error from 0.280/0.305 (VGGT) to 0.114/0.116 with all priors, and to 0.071/0.065 with the more scalable $\pi^3$ backbone.
On DTU and other datasets, the improvements are consistent, and the effect of priors is strongly amplified when used in tandem with the cross-view compatibility objective.
Qualitative results establish that TCO corrects large-scale structural errors, especially in challenging or ambiguous geometry.
Figure 3: Fine-grained qualitative overlays illustrate that TCO-VGGT closely matches the ground truth geometry, achieving smaller errors than retrained prior-aware competitors.

Camera Pose Estimation

In camera pose estimation (ScanNet), where the prior is a set of depth maps, TCO optimizes the pose predictions by satisfying geometric constraints, consistently outperforming base and retrained prior-aware approaches across baseline separations. For example, with sparse keyframes, the Absolute Trajectory Error (ATE) is reduced from 0.0094 (VGGT) to 0.0072 (TCO-VGGT), illustrating effective cross-modality transfer under the TCO scheme.

Test-Time Optimization Analysis

The ablation studies confirm that only adapting the shared decoder (and not the prediction heads) is key to extracting the synergistic potential of prior constraints and self-supervised consistency. TCO remains robust to moderate prior noise, and the performance scales with the number of test-time optimization steps but plateaus beyond moderate iterations.

Figure 4: TCO performance scaling curve on ETH3D: test-time optimization yields rapid gains in early iterations, stabilizing as the number of steps increases.

Differentiable Rendering and Compatibility

Using 2DGS for the prediction compatibility objective is integral: the paper justifies heuristic parameter choices for opacity, radii, and parameterization, finding that rather than directly optimizing the renderer, constraining the MVT via predicted geometry leads to sharper improvements.

Figure 5: Visual results from 2DGS rendering pipeline, where heuristic parameterization (from MVT predictions) yields rendered images, depth, and normals closely matching ground-truth.

Implications and Future Directions

The TCO framework introduces a principled paradigm for test-time inference in vision models leveraging geometric priors. The practical advantages are clear: prior data can be exploited without costly retraining or architecture engineering. Theoretical implications further suggest that cross-modal self-supervised losses and plug-and-play constraints at test time are broadly applicable to other tasks, including SLAM, object pose estimation, or neural rendering, given the modular formulation.

Open challenges include reducing test-time computational overhead—current optimization incurs higher wall-clock inference time compared to pure feed-forward operation—and extending the scheme to broader prior types (e.g., semantics or partial geometry). The integration with reinforcement learning or energy-based methods for further test-time adaptation remains a promising direction.

Conclusion

Test-time Constrained Optimization provides an efficient, model-agnostic, and practical means of harnessing externally-available priors to boost the performance of pre-trained 3D MVTs at inference. By decoupling architecture design from the choice of priors and focusing on cross-view compatibility, TCO advances both the state-of-the-art on standard benchmarks and the theoretical landscape of flexible, plug-and-play geometric perception (2604.03878).

Markdown Report Issue