Test3R: Learning to Reconstruct 3D at Test Time (2506.13750v1)

Published 16 Jun 2025 in cs.CV

Abstract: Dense matching methods like DUSt3R regress pairwise pointmaps for 3D reconstruction. However, the reliance on pairwise prediction and the limited generalization capability inherently restrict the global geometric consistency. In this work, we introduce Test3R, a surprisingly simple test-time learning technique that significantly boosts geometric accuracy. Using image triplets ($I_1,I_2,I_3$), Test3R generates reconstructions from pairs ($I_1,I_2$) and ($I_1,I_3$). The core idea is to optimize the network at test time via a self-supervised objective: maximizing the geometric consistency between these two reconstructions relative to the common image $I_1$. This ensures the model produces cross-pair consistent outputs, regardless of the inputs. Extensive experiments demonstrate that our technique significantly outperforms previous state-of-the-art methods on the 3D reconstruction and multi-view depth estimation tasks. Moreover, it is universally applicable and nearly cost-free, making it easily applied to other models and implemented with minimal test-time training overhead and parameter footprint. Code is available at https://github.com/nopQAQ/Test3R.

Summary

The paper introduces Test3R, a novel test-time learning method that improves the accuracy and consistency of pairwise 3D reconstruction models by adapting them to specific test scenes using a self-supervised geometric consistency objective.
Test3R employs Visual Prompt Tuning (VPT) to efficiently adapt a pre-trained model by optimizing only a small set of learnable prompts based on a triplet consistency loss derived from different image pairs in the same scene.
Evaluations show that Test3R consistently improves performance over base models like DUSt3R on multiple 3D reconstruction and multi-view depth estimation benchmarks, enhancing detail preservation and reducing outliers.

The paper "Test3R: Learning to Reconstruct 3D at Test Time" (2506.13750) introduces a novel test-time learning technique aimed at improving the accuracy and consistency of 3D reconstruction models, particularly those based on pairwise dense matching like DUSt3R [wang2024dust3r]. The core idea is to adapt a pre-trained model to a specific test scene by optimizing it using a self-supervised objective that leverages geometric consistency between predictions from different image pairs within that scene.

Existing dense matching methods typically predict dense pointmaps for image pairs. For example, DUSt3R takes two images, a reference $I_{ref}$ and a source $I_{src}$ , and outputs pointmaps $X^{ref,ref}$ and $X^{src,ref}$ representing the 3D coordinates of pixels in the reference view's coordinate frame. While effective, this pairwise approach can lead to inconsistencies when pointmaps for the same reference view are generated using different source views. As illustrated in Figure 2, pointmaps predicted for $I_1$ using $I_2$ as the source view may differ from those predicted using $I_3$ . This cross-pair inconsistency arises because the model processes only two images at a time, limiting the geometric information available and leading to differing visual correspondences for the same 3D points across different source views. This issue, combined with the limited generalization capabilities of deep learning models to unseen scenes, results in persistent errors even after subsequent global alignment steps.

Test3R addresses this by introducing a test-time adaptation phase. For a given test scene with multiple images $\{I_i\}_{i=1}^N$ , Test3R selects image triplets $(I_{ref}, I_{src1}, I_{src2})$ . From this triplet, it forms two pairs: $(I_{ref}, I_{src1})$ and $(I_{ref}, I_{src2})$ . The pre-trained model (e.g., DUSt3R) is used to predict pointmaps for the reference view from both pairs, yielding $X^{ref,ref}_1$ (from $(I_{ref}, I_{src1})$ ) and $X^{ref,ref}_2$ (from $(I_{ref}, I_{src2})$ ). The central self-supervised objective for test-time training is to minimize the difference between these two pointmaps:

$\ell = \|X^{ref,ref}_1 - X^{ref,ref}_2\|$

This objective encourages the model to produce consistent 3D coordinates for the pixels in $I_{ref}$ , regardless of whether $I_{src1}$ or $I_{src2}$ was used as the source view. This forces the model to reconcile the geometric information from different pairs, improving both precision and consistency.

To perform this test-time optimization efficiently without requiring extensive data or risking catastrophic forgetting of the pre-trained model's capabilities, Test3R employs Visual Prompt Tuning (VPT) [jia2022visual]. A small set of learnable prompt tokens are inserted into the encoder layers of the base model's Vision Transformer (ViT) backbone. During the test-time training phase, only these prompt parameters are updated via gradient descent based on the triplet consistency loss, while the weights of the original pre-trained model remain frozen.

The practical implementation involves:

Loading a pre-trained model: The base model, like DUSt3R, is loaded.
Adding Learnable Prompts: Learnable prompt tokens are initialized and integrated into the Transformer layers of the encoder. The paper explores different prompt lengths and insertion strategies (e.g., inserting distinct prompts at each encoder layer, or using the same prompts carried through layers). The results suggest that layer-specific prompts (Test3R) generally perform better than using prompts only in the first layer (Test3R-S), as feature distributions differ across layers. Prompt length also impacts performance, with a length of 32 per layer showing good results in ablations.
Forming Triplets: For a given test scene with $N$ images, image triplets $(I_i, I_j, I_k)$ are sampled, where $I_i$ serves as the reference view, and $I_j$ and $I_k$ are distinct source views. If the total number of possible triplets is very large ( $N^3$ ), a subset is randomly sampled (e.g., 165 triplets in the experiments) for computational efficiency.
Test-Time Optimization: An optimizer (e.g., Adam [kingma2014adam]) is configured to update only the parameters of the learnable prompts. For each sampled triplet, the model performs two forward passes (for $(I_i, I_j)$ and $(I_i, I_k)$ ) to obtain the pointmaps $X^{i,i}_1$ and $X^{i,i}_2$ . The consistency loss $\|X^{i,i}_1 - X^{i,i}_2\|$ is computed, and gradients are backpropagated only through the prompts. This optimization is performed for a small number of iterations (e.g., 1 epoch over the sampled triplets for the scene).
Final Reconstruction: After test-time tuning, the model with the optimized prompts is used to predict all necessary pairwise pointmaps for the scene images. These pointmaps are then fed into the original global alignment pipeline (e.g., DUSt3R's optimization process) to obtain the final consistent 3D reconstruction and camera poses.

Here is a simplified pseudocode representation of the test-time training process:

def test_time_train(model, scene_images, optimizer, num_epochs=1, max_triplets=165):
    model.eval() # Ensure backbone weights are frozen
    optimizer.zero_grad()

    # Select triplets from scene_images
    triplets = select_triplets(scene_images, max_triplets)

    for epoch in range(num_epochs):
        for ref_img, src1_img, src2_img in triplets:
            # Enable gradient calculation only for prompts
            with torch.enable_grad():
                # Predict pointmaps for (ref, src1) pair
                # Model uses learnable prompts internally
                X_ref_ref_1, _, _, _ = model(ref_img, src1_img)

                # Predict pointmaps for (ref, src2) pair
                # Model uses learnable prompts internally
                X_ref_ref_2, _, _, _ = model(ref_img, src2_img)

                # Calculate L2 consistency loss
                loss = torch.norm(X_ref_ref_1 - X_ref_ref_2, p=2)

                # Backpropagate loss and update ONLY prompt parameters
                loss.backward()
                optimizer.step()
                optimizer.zero_grad()

    # Model.prompts are now tuned for the scene
    # Use the tuned model for downstream tasks (pairwise prediction + global alignment)

In terms of real-world applications, Test3R directly enhances the quality of 3D reconstruction and multi-view depth estimation pipelines that rely on pairwise predictions. Improved 3D reconstruction is crucial for applications like creating digital twins of environments, virtual and augmented reality content generation, urban planning, and architectural modeling. More accurate depth estimation benefits autonomous navigation (robotics, self-driving cars), industrial inspection, and visual effects.

Implementation considerations include:

Computational Overhead: While parameter-efficient due to VPT, test-time training does add processing time for each new scene. Table 3 reports an overhead of around 30 seconds per scene for tuning, in addition to the standard inference time. This is a trade-off for improved accuracy.
Memory Usage: The additional memory for prompts is negligible compared to the base model (Table 3).
Triplet Selection: For scenes with many images, an effective strategy for sampling triplets is important to balance computational cost and coverage of different viewpoints and baseline lengths.
Hyperparameter Tuning: The learning rate for prompt optimization is specific to the dataset and the number of images, requiring some empirical tuning (Appendix A provides values used).
Dependency on Base Model: Test3R's performance is built upon the capabilities of the pre-trained base model (e.g., DUSt3R). It improves consistency and adaptation but doesn't replace the fundamental learning of geometric relationships performed by the base model.

The paper demonstrates the effectiveness of Test3R through extensive experiments on 3D reconstruction (7Scenes [shotton2013scene], NRGBD [azinovic2022neural]) and multi-view depth estimation (DTU [aanaes2016large], ETH3D [schops2017multi]) benchmarks. Quantitatively (Table 1, 2), Test3R consistently outperforms the base DUSt3R model and achieves competitive or state-of-the-art results compared to other methods, even surpassing some that require ground truth camera parameters or domain-specific training. Qualitatively (Figure 3, 4, 6), Test3R produces reconstructions with fewer outliers and better preservation of fine-grained details, demonstrating improved consistency and accuracy. The generalization paper (Table 4) shows that Test3R can be applied to other related models like MAST3R [leroy2024grounding] and MonST3R [zhang2024monst3r], yielding performance improvements, indicating its broad applicability.

In summary, Test3R provides a simple yet powerful approach to leverage multi-view information at test time to improve the robustness and accuracy of pairwise dense matching models for 3D reconstruction, using efficient prompt tuning to adapt to unseen scenes.

PDF Markdown

Related Papers

Find Related Papers

GitHub

GitHub - nopQAQ/Test3R (6 stars)

Tweets

https://twitter.com/rsasaki0109/status/1937292128419479986

https://twitter.com/AdinaYakup/status/1935010318226174081

https://twitter.com/zhenjun_zhao/status/1934981581954756876

https://twitter.com/ducha_aiki/status/1935342514774425784

https://twitter.com/_akhaliq/status/1934981778931851352

https://twitter.com/yuan_nop/status/1934987736903229712

YouTube

Show All Videos