AnyRecon: Sparse 3D Reconstruction via Video Diffusion

Updated 4 July 2026

AnyRecon is a sparse-view 3D reconstruction framework that leverages a video diffusion model to generate geometrically controlled novel views from arbitrary input images.
It employs an iterative loop between view generation and explicit 3D geometry updates, ensuring accurate reconstruction from irregular and sparse captures.
The system integrates global scene memory and geometry-aware retrieval to mitigate drift and maintain high fidelity over long trajectories and large scenes.

AnyRecon is a sparse-view 3D reconstruction framework that uses a video diffusion model to generate geometrically controlled novel views from arbitrary, unordered input images, then feeds those generated views back into an explicit 3D reconstruction process. It is designed for 3D reconstruction from sparse, irregularly captured views, with an emphasis on flexible conditioning cardinality, explicit geometric control, and scalability to large scenes and long trajectories. The method turns reconstruction into an iterative loop between novel-view generation and explicit 3D geometry update, rather than treating generation and reconstruction as disjoint stages (Chen et al., 21 Apr 2026).

1. Scope and problem formulation

AnyRecon addresses reconstruction from sparse, irregularly captured views. The paper characterizes this setting as difficult because the number of input capture views is not fixed, the views are not sequential in time or viewpoint, viewpoint gaps can be very large, and large scenes or long trajectories cannot be processed as one monolithic sequence. In this setting, classical neural reconstruction methods such as NeRF and 3D Gaussian Splatting degrade badly when observations are sparse, while earlier diffusion-based approaches often condition on only one or two RGB capture frames or rely too heavily on implicit geometric consistency (Chen et al., 21 Apr 2026).

The system is therefore positioned against two specific limitations. First, many geometry-aware video diffusion methods condition on point-cloud renderings plus only one or two RGB capture frames, which weakens appearance fidelity and scene-level context. Second, some video diffusion methods condition only on RGB images and camera poses, leaving pose alignment and spatial consistency to be learned implicitly. AnyRecon targets the harder setting of arbitrary and unordered sparse-input conditioning, and the paper argues that this requires both a stronger conditional generator and a tighter coupling between generation and geometry update.

A central technical motivation concerns temporally causal latent compression. The paper argues that standard video diffusion backbones are poorly matched to sparse-view reconstruction because adjacent “frames” in the model may correspond to widely separated viewpoints rather than smooth temporal evolution. This creates a mismatch between video-generation assumptions and geometric reconstruction requirements, especially when frame-level correspondence under large viewpoint change is necessary.

2. Iterative reconstruction loop and scene memory

AnyRecon organizes the full system around a closed loop between generation and reconstruction. Sparse captured views are stored as a capture view bank, denoted $\mathcal{I}_{cap}$ . An initial point-cloud-like geometry is estimated from these captures using a feed-forward geometry estimator, specifically VGGT or $\pi^3$ , producing an initial 3D geometry memory $\mathcal{M}_{geo}$ . For a desired novel camera trajectory $V_{novel}$ , the trajectory is split into manageable segments, and each segment is processed using the current geometry memory (Chen et al., 21 Apr 2026).

The segment-level pipeline has a fixed structure. Geometry-aware retrieval selects the most useful capture views from $\mathcal{I}_{cap}$ . The current geometry memory is rendered into the target viewpoints to produce geometric conditions $I_{render}$ and visibility masks $M_t$ . These rendered geometric priors, together with selected capture RGBs, are fed into a modified video diffusion transformer that supports arbitrary, unordered conditioning. The model then synthesizes novel RGB views $\hat{I}_{novel}$ for the current segment. Geometry is reconstructed from the newly generated views together with original views, and the resulting geometry replaces or updates $\mathcal{M}_{geo}$ for subsequent segments.

The core objects in the framework are compact and explicit.

Object	Meaning	Role
$\mathcal{I}_{cap}$	capture view bank	stores sparse input views
$\pi^3$ 0	selected conditioning views	chunk-level retrieved references
$\pi^3$ 1	global geometry memory	geometry prior and update target
$\pi^3$ 2	rendered geometric priors	explicit geometry conditioning
$\pi^3$ 3	visibility masks	visibility-aware conditioning
$\pi^3$ 4	target novel trajectory	generation target
$\pi^3$ 5	generated novel views	pseudo-observations for update

This design makes the geometry memory more than a reconstruction output. It is rendered into target views, supports visibility reasoning, drives conditioning-view retrieval, anchors future segments to a coherent global structure, and is refreshed after each generation step. A plausible implication is that AnyRecon treats geometry not as an endpoint but as an active control variable for generation.

3. Unordered contextual video diffusion

The central generative component is an unordered contextual video diffusion model. Its first major modification is a persistent global scene memory implemented by prepending retrieved reference views $\pi^3$ 6 to the beginning of each chunk. These prepended frames serve as a persistent global key-value memory cache within the video diffusion transformer. Capture views are therefore not treated as temporally adjacent neighbors of the target frames; instead, they operate as a queryable memory that target frames can attend to during denoising (Chen et al., 21 Apr 2026).

The second modification is the removal of temporal compression. AnyRecon replaces temporally compressive latent encoders with a frame-wise 2D VAE, which the paper calls Non-Compressive Latent Encoding. The stated motivation is that sparse-view inputs are not smooth over time or viewpoint, temporal pooling entangles distant viewpoints, and reconstruction requires a strong one-to-one relationship between latent tokens and pixels. The paper attributes improved frame-level correspondence under large viewpoint change to this decision.

Geometry is injected directly into the diffusion backbone. For each target view, the target noisy latents are concatenated along the channel dimension with the rendered point-cloud observations $\pi^3$ 7 and their corresponding visibility masks $\pi^3$ 8. This gives explicit geometric control rather than relying only on RGB and camera poses. In architectural terms, the method is not merely a view synthesizer conditioned on images; it is a geometry-conditioned video diffusion system.

The implementation uses Wan2.1-I2V-14B with LoRA rank 32. Reported training stages are full self-attention fine-tuning for 100k iterations, sparse-attention warm-up for 10k iterations, and DMD2 distillation for 30k iterations. Sparse attention uses a $\pi^3$ 9 block size, and each frame attends to the retrieved geometry-aligned reference views $\mathcal{M}_{geo}$ 0, the 8 preceding frames, and the 8 succeeding frames. The training resolution is $\mathcal{M}_{geo}$ 1.

4. Geometry-aware retrieval and memory update

AnyRecon couples generation and reconstruction through an explicit 3D geometric memory. The geometry memory $\mathcal{M}_{geo}$ 2 is an incrementally updated point cloud reconstructed initially from the original sparse captures. After each generated segment, the system re-estimates geometry from the generated views together with original views using $\mathcal{M}_{geo}$ 3, and this newly reconstructed geometry then replaces the existing memory $\mathcal{M}_{geo}$ 4. The method thus refreshes rather than merely accumulates geometry, and the paper presents this as a mechanism for mitigating geometric drift in long trajectories (Chen et al., 21 Apr 2026).

Capture-view retrieval is driven by geometric contribution under the current target viewpoint, not by image similarity or simple field-of-view heuristics. Let the set of capture views and poses be

$\mathcal{M}_{geo}$ 5

For a target novel view, the current geometry memory is rendered to produce a visibility index map identifying, for each visible point, which source view it originated from during reconstruction. The relevance score for capture view $\mathcal{M}_{geo}$ 6 is

$\mathcal{M}_{geo}$ 7

where $\mathcal{M}_{geo}$ 8 is the set of geometry points visible from the target viewpoint and $\mathcal{M}_{geo}$ 9 is the subset of points in $V_{novel}$ 0 reconstructed from capture view $V_{novel}$ 1. The top- $V_{novel}$ 2 views under $V_{novel}$ 3 are then chosen as $V_{novel}$ 4.

This retrieval rule makes the conditioning set occlusion-aware and explicitly grounded in scene geometry. The visible overlap is computed relative to the current target viewpoint, so the retrieved references are those with the highest geometric contribution to the specific segment being generated. This suggests that AnyRecon uses retrieval not as a generic memory lookup but as a view-dependent geometric selection mechanism.

The system also derives rendered geometric priors and visibility masks from $V_{novel}$ 5. Generated views are then fed back into the geometry estimation stage together with original captures, so the synthesized frames function as pseudo-observations that enlarge scene coverage and improve later conditioning.

5. Training and inference procedure

Training uses DL3DV-10K, a large-scale indoor/outdoor 3D scene dataset. Original videos are partitioned into 40-frame clips at resolution $V_{novel}$ 6. To simulate arbitrary and unordered sparse inputs, the first frame is always used as a base reference and $V_{novel}$ 7 additional conditioning views are randomly selected. With 50% probability, additional views are sampled from the first 20 frames, and with 50% probability they are sampled from the full 40-frame window. The selected views are processed by $V_{novel}$ 8 to initialize $V_{novel}$ 9, which is then projected into target viewpoints to form $\mathcal{I}_{cap}$ 0 and $\mathcal{I}_{cap}$ 1 (Chen et al., 21 Apr 2026).

Optimization uses AdamW on 64 NVIDIA A800 GPUs, with learning rate $\mathcal{I}_{cap}$ 2 initially and $\mathcal{I}_{cap}$ 3 for distillation. The paper’s main mathematical diffusion content concerns 4-step Distribution Matching Distillation. The continuous noise schedule is discretized into

$\mathcal{I}_{cap}$ 4

The student is trained to generate in 4 denoising steps. The paper states that generator and critic are trained with alternating updates.

At test time, arbitrary sparse captures are collected into $\mathcal{I}_{cap}$ 5, a feed-forward point-map estimator such as VGGT or $\mathcal{I}_{cap}$ 6 constructs the initial $\mathcal{I}_{cap}$ 7, and a user specifies the target camera path $\mathcal{I}_{cap}$ 8. The trajectory is chunked into segments. For each segment, the method computes visibility-aware retrieval scores $\mathcal{I}_{cap}$ 9, selects top- $I_{render}$ 0 conditioning capture views $I_{render}$ 1, renders $I_{render}$ 2 into target viewpoints to obtain $I_{render}$ 3 and $I_{render}$ 4, prepends $I_{render}$ 5 as the capture-view cache, generates segment RGB views $I_{render}$ 6 using the 4-step distilled diffusion model with sparse attention, reconstructs updated geometry from generated views plus original captures using $I_{render}$ 7, and replaces or updates $I_{render}$ 8. After all segments are processed, the resulting geometry memory serves as the final reconstructed 3D scene representation.

6. Empirical evaluation and ablations

Evaluation is performed on 10 scenes from DL3DV-Evaluation and 5 scenes from Tanks and Temples. For each test sequence, 40 frames are sampled at $I_{render}$ 9, and Tanks and Temples sequences are temporally subsampled by $M_t$ 0 to create sparse-view difficulty. Two settings are used. In interpolation, conditioning views are the $M_t$ 1, $M_t$ 2, and $M_t$ 3 frames. In extrapolation, conditioning views are the $M_t$ 4, $M_t$ 5, $M_t$ 6, and $M_t$ 7 frames. Baselines are Difix3D+, ViewCrafter, and Uni3C. Metrics are PSNR, SSIM, LPIPS, and runtime per sequence (Chen et al., 21 Apr 2026).

On DL3DV, interpolation results are: Difix3D+ 17.88 / 0.551 / 0.290 / 1200 s, ViewCrafter 15.86 / 0.463 / 0.394 / 170 s, Uni3C 16.33 / 0.471 / 0.319 / 340 s, and AnyRecon 20.95 / 0.656 / 0.151 / 105 s. On DL3DV extrapolation, the corresponding results are 18.74 / 0.576 / 0.261 / 1200 s, 15.51 / 0.459 / 0.406 / 170 s, 15.69 / 0.457 / 0.344 / 340 s, and 21.16 / 0.660 / 0.158 / 105 s. On Tanks and Temples interpolation, the results are 19.43 / 0.629 / 0.163 / 1200 s, 15.85 / 0.474 / 0.364 / 170 s, 16.77 / 0.514 / 0.263 / 340 s, and 20.37 / 0.639 / 0.158 / 105 s. On Tanks and Temples extrapolation, the results are 18.67 / 0.594 / 0.190 / 1200 s, 15.83 / 0.481 / 0.361 / 170 s, 16.54 / 0.502 / 0.274 / 340 s, and 20.30 / 0.629 / 0.181 / 105 s.

The ablation results isolate several architectural choices. For temporal compression on DL3DV interpolation at 50 diffusion steps, Full TC yields 20.16 / 0.616 / 0.179 / 210+(15) s, Partial TC yields 21.10 / 0.661 / 0.153 / 270+(15) s, and w/o TC yields 21.57 / 0.687 / 0.140 / 1820+(15) s. This supports the claim that removing temporal compression improves fine geometric detail and frame correspondence, but it also introduces major computational cost unless paired with distillation and sparse attention.

For efficiency, the no-TC model gives 1820 s with 50-step full attention, 140 s with 4-step distilled full attention and PSNR 21.32, and 90 s with 4-step distilled sparse attention and PSNR 20.95. The paper reports that combining distillation and sparse attention yields up to $M_t$ 8 generation speedup over vanilla diffusion. For global scene memory, the ablation reports w/o Global Scene Memory: 20.18 / 0.634 / 0.205, and w/ Global Scene Memory: 20.95 / 0.656 / 0.151. The paper also provides qualitative evidence that without geometry memory update, later segments become inconsistent, and that geometry-driven retrieval excludes occluded or geometrically irrelevant capture views better than FOV or similarity heuristics.

7. Relation to adjacent reconstruction systems and limitations

Within the broader reconstruction literature, AnyRecon occupies a scene-level sparse-view position that differs from both simulation-ready compositional reconstruction and automated salient-object reconstruction. SimRecon reconstructs cluttered indoor RGB videos into an object-centric, simulator-ready scene assembled through a “Perception–Generation–Simulation” pipeline, whereas AnyRecon uses a video diffusion model plus explicit geometry memory to reconstruct from arbitrary and unordered sparse inputs (Xia et al., 2 Mar 2026). AutoRecon, by contrast, is a fully automated pipeline for discovering and reconstructing a salient foreground object from object-centric multi-view images using SfM, self-supervised ViT features, and decomposed neural scene representations (Wang et al., 2023). This comparison suggests that AnyRecon is centered on arbitrary-view scene reconstruction under sparse observation, rather than simulation-ready compositional assembly or salient-object isolation.

The method’s reported strengths are support for arbitrary and unordered sparse conditioning views, preservation of explicit geometric control via point-cloud renderings and visibility masks, scalability to long trajectories by chunked generation plus persistent geometry memory, and improved fidelity and efficiency relative to prior diffusion-based baselines. The reported gains are strongest in the regimes the paper targets: irregular sparse inputs, large viewpoint gaps, interpolation and extrapolation, and long trajectories or large scenes.

The stated limitations are tied to the quality of the 3D geometric memory. The paper says the method is robust to minor inaccuracies such as pose misalignment, noise, and artifacts, but still requires basic structural coherence. In extreme cases with minimal view overlap, the initial reconstruction may fail, leading to weak guidance for diffusion and poor synthesis. The paper does not discuss dynamic scenes, reflective materials, or memory growth overhead in detail. A plausible implication is that AnyRecon’s explicit geometric control yields strong conditioning when the evolving point-map geometry is coherent, but also ties performance to the reliability of that geometry throughout the iterative loop.