SAIL-Recon: Scalable SfM via Anchor Localization

Updated 4 July 2026

The paper introduces SAIL-Recon, which unifies scene regression and visual localization by building a neural scene representation from a subset of anchor images.
It employs a feed-forward Transformer architecture that processes anchors to create a compact map and then reconstructs query images through localization-conditioned attention.
The method demonstrates competitive accuracy and efficiency on large-scale SfM by balancing dense scene regression with robust, attention-based camera and depth estimation.

SAIL-Recon is a feed-forward Transformer for large-scale Structure-from-Motion (SfM) that augments scene regression with visual localization. It is designed for large unordered image collections or long videos, where the objective is to jointly recover camera extrinsics $T_i$ , camera intrinsics $K_i$ , per-image depth $D_i$ , and dense scene coordinate maps $S_i$ , while maintaining the robustness of scene regression under extreme viewpoint changes and scaling to scenes containing thousands of images (Deng et al., 25 Aug 2025). Its central idea is to compute a neural scene representation from a subset of anchor images and then reconstruct all remaining images conditioned on that representation, thereby replacing joint all-image regression with an anchor-based map-building stage followed by localization-conditioned reconstruction (Deng et al., 25 Aug 2025).

1. Problem setting and conceptual basis

SAIL-Recon addresses large-scale SfM from an image set $\{\mathcal{I}_i\}_{i=1}^M$ , where the full regression problem is written as

$(\mathcal{R},\{ T_i, K_i, D_i, S_i\}_{i=1}^M) = \mathcal{T}_{\theta}(\{\mathcal{I}_i\}_{i=1}^M),$

with $\mathcal{R}$ denoting a learned neural scene representation, $T_i \in \mathbb{R}^{4\times 4}$ the camera extrinsic matrix, $K_i \in \mathbb{R}^{3\times 3}$ the intrinsic matrix, $D_i \in \mathbb{R}^{H\times W}$ the depth map, and $K_i$ 0 the scene coordinate map (Deng et al., 25 Aug 2025). The paper positions this formulation within the recent scene-regression line of SfM, in which geometry and camera parameters are predicted directly from images rather than recovered through the classical sequence of feature detection, matching, triangulation, and bundle adjustment (Deng et al., 25 Aug 2025).

The motivation is computational as much as geometric. Methods such as VGGT are described as handling extreme viewpoint changes well, but their direct joint-attention formulation does not scale favorably with image count; the paper states that such methods “typically cannot handle more than 100 input images on consumer GPUs” (Deng et al., 25 Aug 2025). Existing scaling strategies based on sequential memory tokens or segment-wise reconstruction with later alignment are described as often suffering from drift and depending on additional global alignment (Deng et al., 25 Aug 2025). SAIL-Recon adopts a different principle: use a subset of images as anchors to build a neural map, then localize all remaining images against that map, in analogy with the role of localization in SLAM (Deng et al., 25 Aug 2025).

This suggests a hybridization of two previously separate ideas. Scene regression supplies robustness to wide baselines and difficult geometry, while localization supplies scalability. The paper’s contribution is to unify those functions within a single feed-forward architecture rather than treating localization as an external post-process (Deng et al., 25 Aug 2025).

2. Anchor-based representation and localization-conditioned reconstruction

The defining structural choice in SAIL-Recon is the decomposition of the reconstruction pipeline into anchor processing and query localization. Instead of jointly processing all $K_i$ 1 input images, the method first selects a subset of anchor images

$K_i$ 2

for large scenes where $K_i$ 3 may exceed 1000 (Deng et al., 25 Aug 2025). The paper states that these anchors are uniformly sampled from the full image set (Deng et al., 25 Aug 2025).

The anchors are processed jointly through the scene-regression Transformer to produce a latent neural scene representation $K_i$ 4. This representation functions as an implicit neural map: it is not an explicit point cloud or mesh, but a token-based memory carrying both appearance and geometry cues (Deng et al., 25 Aug 2025). After $K_i$ 5 has been built, each remaining image is treated as a query image $K_i$ 6, and reconstruction is performed conditionally: $K_i$ 7 Thus the same network that performs scene regression on anchors performs localization-conditioned reconstruction for queries (Deng et al., 25 Aug 2025).

A central technical point is that SAIL-Recon does not use only the final-layer anchor tokens as scene memory. The naive choice

$K_i$ 8

is described as suboptimal because of the “significant discrepancy between 2D and 3D feature tokens” (Deng et al., 25 Aug 2025). Instead, the method stores intermediate tokens from every layer: $K_i$ 9 where $D_i$ 0 are the intermediate anchor tokens after frame-wise attention at layer $D_i$ 1, and $D_i$ 2 is a downsampling operator (Deng et al., 25 Aug 2025). This preserves the transition from 2D appearance features to 3D geometry features, which the paper argues is important for correlating a 2D query image with the stored map (Deng et al., 25 Aug 2025).

To keep the representation compact, the method downsamples tokens from each anchor frame. If

$D_i$ 3

then during training it randomly selects a ratio

$D_i$ 4

of tokens from each anchor frame, yielding

$D_i$ 5

At test time, the token count is adjusted to balance efficiency and accuracy; the paper reports using about 300 tokens per anchor image in the main benchmarks, corresponding roughly to $D_i$ 6 (Deng et al., 25 Aug 2025).

3. Backbone architecture and attention mechanism

SAIL-Recon adopts VGGT as its backbone and augments it with visual localization capabilities (Deng et al., 25 Aug 2025). Each anchor image $D_i$ 7 is first encoded by DINOv2 into patch tokens

$D_i$ 8

and is additionally assigned one camera token

$D_i$ 9

and four register tokens

$S_i$ 0

(Deng et al., 25 Aug 2025). These tokens are processed through $S_i$ 1 Transformer layers with alternating frame-wise and global attention: $S_i$ 2

$S_i$ 3

Frame-wise attention refines each image independently, whereas global attention aggregates cross-view information needed for geometry and pose estimation (Deng et al., 25 Aug 2025).

At the output stage, dense geometry is predicted from per-image tokens using a DPT head: $S_i$ 4 where $S_i$ 5 and $S_i$ 6 are confidence maps for depth and scene coordinates (Deng et al., 25 Aug 2025). Camera pose and intrinsics are predicted from the camera tokens using a PoseHead: $S_i$ 7

For query localization, SAIL-Recon modifies the global attention so that query tokens attend to the stored anchor representation: $S_i$ 8 where

$S_i$ 9

is the anchor representation for layer $\{\mathcal{I}_i\}_{i=1}^M$ 0 (Deng et al., 25 Aug 2025). This is implemented with an attention mask rather than a separate cross-attention block. The mask enforces that anchor tokens can fully attend to one another, while query tokens cannot attend to other query frames; they can attend only within the same query frame and to the scene representation $\{\mathcal{I}_i\}_{i=1}^M$ 1 (Deng et al., 25 Aug 2025). The supplement, as summarized in the provided material, states that the same mask is also used in the pose head (Deng et al., 25 Aug 2025).

The paper notes that query pose could in principle be estimated from the predicted scene coordinate map $\{\mathcal{I}_i\}_{i=1}^M$ 2 via PnP, but avoids this because DPT upsampling to high-resolution scene coordinates is expensive (Deng et al., 25 Aug 2025). Instead, pose is regressed directly from the query camera token and the anchor camera tokens: $\{\mathcal{I}_i\}_{i=1}^M$ 3 Localization is therefore learned as attention-based regression against a latent map rather than explicit feature matching followed by PnP (Deng et al., 25 Aug 2025).

4. Training procedure, objective, and normalization

SAIL-Recon is trained by fine-tuning a pretrained VGGT checkpoint (Deng et al., 25 Aug 2025). During training, each batch contains 4–48 images, of which 2–24 are randomly designated as anchors and the remainder as queries (Deng et al., 25 Aug 2025). Anchor and query images are forwarded together under the masked attention scheme, and the model is trained to reconstruct every frame conditioned on the anchor representation (Deng et al., 25 Aug 2025). Random token subsampling with $\{\mathcal{I}_i\}_{i=1}^M$ 4 is used during training to make the representation robust to varying memory budgets and scene sizes (Deng et al., 25 Aug 2025).

The training objective is a multitask loss: $\{\mathcal{I}_i\}_{i=1}^M$ 5 Camera prediction is supervised using

$\{\mathcal{I}_i\}_{i=1}^M$ 6

where

$\{\mathcal{I}_i\}_{i=1}^M$ 7

contains quaternion rotation $\{\mathcal{I}_i\}_{i=1}^M$ 8, translation $\{\mathcal{I}_i\}_{i=1}^M$ 9, and field of view $(\mathcal{R},\{ T_i, K_i, D_i, S_i\}_{i=1}^M) = \mathcal{T}_{\theta}(\{\mathcal{I}_i\}_{i=1}^M),$ 0 (Deng et al., 25 Aug 2025). The principal point is assumed to be at the image center (Deng et al., 25 Aug 2025).

Depth supervision is confidence-weighted and includes a gradient term: $(\mathcal{R},\{ T_i, K_i, D_i, S_i\}_{i=1}^M) = \mathcal{T}_{\theta}(\{\mathcal{I}_i\}_{i=1}^M),$ 1 Scene coordinate map supervision has the same structure: $(\mathcal{R},\{ T_i, K_i, D_i, S_i\}_{i=1}^M) = \mathcal{T}_{\theta}(\{\mathcal{I}_i\}_{i=1}^M),$ 2 The paper does not include rendering losses, photometric losses, or explicit correspondence losses in the main training objective (Deng et al., 25 Aug 2025).

A practical stabilization step is scene normalization. The model randomly selects an anchor image $(\mathcal{R},\{ T_i, K_i, D_i, S_i\}_{i=1}^M) = \mathcal{T}_{\theta}(\{\mathcal{I}_i\}_{i=1}^M),$ 3 as reference, computes the average Euclidean distance from that camera center to the 3D points of all anchor frames, and uses this scale to normalize the 3D scene, depth maps, and scene coordinate maps (Deng et al., 25 Aug 2025). This suggests that scale normalization is important for transferring one network across scenes with heterogeneous metric extent.

Training details reported in the paper are specific. Fine-tuning runs for 30K iterations on 16 NVIDIA A800 GPUs, uses bfloat16 precision and gradient checkpointing, and takes about 4 days (Deng et al., 25 Aug 2025). The optimizer uses a cosine learning-rate schedule with maximum learning rate $(\mathcal{R},\{ T_i, K_i, D_i, S_i\}_{i=1}^M) = \mathcal{T}_{\theta}(\{\mathcal{I}_i\}_{i=1}^M),$ 4 and 2K warmup iterations (Deng et al., 25 Aug 2025). Training data are drawn from a mixed dataset subset similar to VGGT, including CO3Dv2, BlendMVS, DL3DV, MegaDepth, WildRGB, ScanNet++, HyperSim, Mapillary, Replica, MVS-Synth, Virtual KITTI, Aria Synthetic Environments, and Aria Digital Twin, weighted proportionally to relative size (Deng et al., 25 Aug 2025). Input images are resized to maximum 518 px while preserving aspect ratio in $(\mathcal{R},\{ T_i, K_i, D_i, S_i\}_{i=1}^M) = \mathcal{T}_{\theta}(\{\mathcal{I}_i\}_{i=1}^M),$ 5, with color jitter, Gaussian blur, and grayscale conversion as augmentations (Deng et al., 25 Aug 2025).

5. Inference procedure and scalability behavior

At inference time, SAIL-Recon first selects anchors, typically 50 on TUM-RGBD, 7-Scenes, and Mip-NeRF 360, and about 100 on Tanks & Temples (Deng et al., 25 Aug 2025). The anchors are jointly processed once to compute the layer-wise token memory

$(\mathcal{R},\{ T_i, K_i, D_i, S_i\}_{i=1}^M) = \mathcal{T}_{\theta}(\{\mathcal{I}_i\}_{i=1}^M),$ 6

which is then cached; the provided summary states that a KV-cache is used to store $(\mathcal{R},\{ T_i, K_i, D_i, S_i\}_{i=1}^M) = \mathcal{T}_{\theta}(\{\mathcal{I}_i\}_{i=1}^M),$ 7 as keys and values in each global attention layer (Deng et al., 25 Aug 2025). Query images are then processed sequentially or in batches, with query tokens acting as queries and the cached scene representation acting as key-value memory (Deng et al., 25 Aug 2025).

This computational structure explains the method’s scaling behavior. Rather than letting all $(\mathcal{R},\{ T_i, K_i, D_i, S_i\}_{i=1}^M) = \mathcal{T}_{\theta}(\{\mathcal{I}_i\}_{i=1}^M),$ 8 images interact with all others through joint global attention, SAIL-Recon pays the full scene-regression cost only on the anchor subset and then performs per-query localization against the compact map (Deng et al., 25 Aug 2025). The result is a feed-forward large-scene SfM pipeline that does not require global bundle adjustment or global alignment for its main reported results (Deng et al., 25 Aug 2025).

The paper does allow an optional post-refinement stage. Instead of bundle adjustment or alignment on pairwise constraints, it applies a BARF-like NeRF pose optimization minimizing rendering loss (Deng et al., 25 Aug 2025). This is not required for the primary feed-forward reconstruction, but can improve downstream novel-view synthesis quality. Reported refinement cost is typically 2–10 min, with about 2.5 min per 10k iterations, and the method is stated to handle scenes with more than 10K images in this regime (Deng et al., 25 Aug 2025).

The runtime numbers given in the paper are an important part of the method’s identity. On Tanks & Temples pose estimation, averaged over scenes with 150–1100 images, SAIL-Recon takes 81 s, compared with 63 s for Light3R-SfM, 238 s for VGGT-SLAM, 42 s for Cut3R, and 70 s for SLAM3R (Deng et al., 25 Aug 2025). On 7-Scenes localization, around 4000-frame scenes are reconstructed in 8 min, versus 2 h for ACE0 (Deng et al., 25 Aug 2025). On Mip-NeRF 360, average runtime is 5 min, versus 8 h for ACE0 (Deng et al., 25 Aug 2025). For Tanks & Temples image-set scenes, the method reconstructs camera poses in 3–4 min on average (Deng et al., 25 Aug 2025).

A plausible implication is that SAIL-Recon is not the absolute fastest feed-forward method in all regimes, but it targets a specific operating point: large-scene scalability with strong pose accuracy and downstream rendering quality, while remaining largely optimization-free in the main path.

6. Evaluation, empirical performance, and limitations

The empirical evaluation spans camera pose estimation, localization, and pose-sensitive view synthesis (Deng et al., 25 Aug 2025). On Tanks & Temples pose estimation, the feed-forward SAIL-Recon results are:

$(\mathcal{R},\{ T_i, K_i, D_i, S_i\}_{i=1}^M) = \mathcal{T}_{\theta}(\{\mathcal{I}_i\}_{i=1}^M),$ 9
$\mathcal{R}$ 0
$\mathcal{R}$ 1
$\mathcal{R}$ 2
Time = 81 s

These are the best feed-forward results among the compared methods in rotation and translation accuracy, while tying the best ATE (Deng et al., 25 Aug 2025). With optional optimization, SAIL-Recon-OPT improves to:

$\mathcal{R}$ 3
$\mathcal{R}$ 4
$\mathcal{R}$ 5
Time = 233 s and becomes competitive with optimization-based systems such as GLOMAP (Deng et al., 25 Aug 2025).

On TUM-RGBD, SAIL-Recon achieves average RMSE ATE of 0.051, compared with 0.158 for DROID-SLAM, 0.060 for MASt3R-SLAM, and 0.074 / 0.053 for VGGT-SLAM variants (Deng et al., 25 Aug 2025). On 7-Scenes localization, it obtains 93.8% of poses under the $\mathcal{R}$ 6 threshold, matching ACE0 while reducing runtime from 2 h to 8 min (Deng et al., 25 Aug 2025). On Mip-NeRF 360, using SAIL-Recon poses to train Nerfacto yields average PSNR 24.77, essentially matching or slightly exceeding the pseudo-ground-truth COLMAP reference at 24.7 and substantially outperforming DROID-SLAM, BARF, NoPe-NeRF, and ACE0 (Deng et al., 25 Aug 2025).

The ablation results clarify which design choices matter. On CO3Dv2, when starting from 10 total images and varying anchor count, mAA@30 degrades from 88.1 at $\mathcal{R}$ 7 anchors to 87.3 at $\mathcal{R}$ 8, 85.0 at $\mathcal{R}$ 9, and 78.5 at $T_i \in \mathbb{R}^{4\times 4}$ 0, while RRA@15 and RTA@15 remain strong even with only two anchors at 96.4 and 89.7 respectively (Deng et al., 25 Aug 2025). Token count per anchor improves accuracy as it increases, but 300 tokens per image is chosen as the preferred trade-off (Deng et al., 25 Aug 2025). The variable-token training strategy also outperforms fixed token count and average pooling on CO3Dv2, with mAA@30 / mAA@5 of 87.3 / 57.6 versus 86.5 / 53.6 and 86.5 / 53.5 respectively (Deng et al., 25 Aug 2025).

The stated limitations are explicit. The paper notes that global pose estimation in a fixed reference coordinate system can hurt some sequences, and suggests better view selection criteria as a future direction (Deng et al., 25 Aug 2025). It also notes that uniform anchor sampling may miss important scene regions, especially in large or diverse scenes, and suggests coverage-aware anchor selection for future work (Deng et al., 25 Aug 2025). The provided summary additionally notes likely difficulties in textureless scenes such as the TUM “floor” sequence, scenarios with large viewpoint gaps between queries and anchors, and cases where uniform anchor sampling under-covers the scene (Deng et al., 25 Aug 2025). Dynamic content, repetitive texture, and very sparse overlap are not extensively analyzed in the paper text provided.

Taken together, these results define SAIL-Recon less as a new geometry representation than as a deployment-oriented reformulation of scene-regression SfM. Its principal contribution is to show that scene regression can be scaled by turning a small anchor subset into a reusable neural map and using that map for localization-conditioned reconstruction of the remaining images (Deng et al., 25 Aug 2025). This suggests a broader methodological lesson: large-scale learned SfM may benefit less from processing all images jointly than from learning the boundary between map construction and localization in a single architecture.

Markdown Report Issue Upgrade to Chat

References (1)

SAIL-Recon: Large SfM by Augmenting Scene Regression with Localization (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SAIL-Recon.