Papers
Topics
Authors
Recent
Search
2000 character limit reached

AnyRecon: Sparse 3D Reconstruction via Video Diffusion

Updated 4 July 2026
  • AnyRecon is a sparse-view 3D reconstruction framework that leverages a video diffusion model to generate geometrically controlled novel views from arbitrary input images.
  • It employs an iterative loop between view generation and explicit 3D geometry updates, ensuring accurate reconstruction from irregular and sparse captures.
  • The system integrates global scene memory and geometry-aware retrieval to mitigate drift and maintain high fidelity over long trajectories and large scenes.

AnyRecon is a sparse-view 3D reconstruction framework that uses a video diffusion model to generate geometrically controlled novel views from arbitrary, unordered input images, then feeds those generated views back into an explicit 3D reconstruction process. It is designed for 3D reconstruction from sparse, irregularly captured views, with an emphasis on flexible conditioning cardinality, explicit geometric control, and scalability to large scenes and long trajectories. The method turns reconstruction into an iterative loop between novel-view generation and explicit 3D geometry update, rather than treating generation and reconstruction as disjoint stages (Chen et al., 21 Apr 2026).

1. Scope and problem formulation

AnyRecon addresses reconstruction from sparse, irregularly captured views. The paper characterizes this setting as difficult because the number of input capture views is not fixed, the views are not sequential in time or viewpoint, viewpoint gaps can be very large, and large scenes or long trajectories cannot be processed as one monolithic sequence. In this setting, classical neural reconstruction methods such as NeRF and 3D Gaussian Splatting degrade badly when observations are sparse, while earlier diffusion-based approaches often condition on only one or two RGB capture frames or rely too heavily on implicit geometric consistency (Chen et al., 21 Apr 2026).

The system is therefore positioned against two specific limitations. First, many geometry-aware video diffusion methods condition on point-cloud renderings plus only one or two RGB capture frames, which weakens appearance fidelity and scene-level context. Second, some video diffusion methods condition only on RGB images and camera poses, leaving pose alignment and spatial consistency to be learned implicitly. AnyRecon targets the harder setting of arbitrary and unordered sparse-input conditioning, and the paper argues that this requires both a stronger conditional generator and a tighter coupling between generation and geometry update.

A central technical motivation concerns temporally causal latent compression. The paper argues that standard video diffusion backbones are poorly matched to sparse-view reconstruction because adjacent “frames” in the model may correspond to widely separated viewpoints rather than smooth temporal evolution. This creates a mismatch between video-generation assumptions and geometric reconstruction requirements, especially when frame-level correspondence under large viewpoint change is necessary.

2. Iterative reconstruction loop and scene memory

AnyRecon organizes the full system around a closed loop between generation and reconstruction. Sparse captured views are stored as a capture view bank, denoted Icap\mathcal{I}_{cap}. An initial point-cloud-like geometry is estimated from these captures using a feed-forward geometry estimator, specifically VGGT or π3\pi^3, producing an initial 3D geometry memory Mgeo\mathcal{M}_{geo}. For a desired novel camera trajectory VnovelV_{novel}, the trajectory is split into manageable segments, and each segment is processed using the current geometry memory (Chen et al., 21 Apr 2026).

The segment-level pipeline has a fixed structure. Geometry-aware retrieval selects the most useful capture views from Icap\mathcal{I}_{cap}. The current geometry memory is rendered into the target viewpoints to produce geometric conditions IrenderI_{render} and visibility masks MtM_t. These rendered geometric priors, together with selected capture RGBs, are fed into a modified video diffusion transformer that supports arbitrary, unordered conditioning. The model then synthesizes novel RGB views I^novel\hat{I}_{novel} for the current segment. Geometry is reconstructed from the newly generated views together with original views, and the resulting geometry replaces or updates Mgeo\mathcal{M}_{geo} for subsequent segments.

The core objects in the framework are compact and explicit.

Object Meaning Role
Icap\mathcal{I}_{cap} capture view bank stores sparse input views
π3\pi^30 selected conditioning views chunk-level retrieved references
π3\pi^31 global geometry memory geometry prior and update target
π3\pi^32 rendered geometric priors explicit geometry conditioning
π3\pi^33 visibility masks visibility-aware conditioning
π3\pi^34 target novel trajectory generation target
π3\pi^35 generated novel views pseudo-observations for update

This design makes the geometry memory more than a reconstruction output. It is rendered into target views, supports visibility reasoning, drives conditioning-view retrieval, anchors future segments to a coherent global structure, and is refreshed after each generation step. A plausible implication is that AnyRecon treats geometry not as an endpoint but as an active control variable for generation.

3. Unordered contextual video diffusion

The central generative component is an unordered contextual video diffusion model. Its first major modification is a persistent global scene memory implemented by prepending retrieved reference views π3\pi^36 to the beginning of each chunk. These prepended frames serve as a persistent global key-value memory cache within the video diffusion transformer. Capture views are therefore not treated as temporally adjacent neighbors of the target frames; instead, they operate as a queryable memory that target frames can attend to during denoising (Chen et al., 21 Apr 2026).

The second modification is the removal of temporal compression. AnyRecon replaces temporally compressive latent encoders with a frame-wise 2D VAE, which the paper calls Non-Compressive Latent Encoding. The stated motivation is that sparse-view inputs are not smooth over time or viewpoint, temporal pooling entangles distant viewpoints, and reconstruction requires a strong one-to-one relationship between latent tokens and pixels. The paper attributes improved frame-level correspondence under large viewpoint change to this decision.

Geometry is injected directly into the diffusion backbone. For each target view, the target noisy latents are concatenated along the channel dimension with the rendered point-cloud observations π3\pi^37 and their corresponding visibility masks π3\pi^38. This gives explicit geometric control rather than relying only on RGB and camera poses. In architectural terms, the method is not merely a view synthesizer conditioned on images; it is a geometry-conditioned video diffusion system.

The implementation uses Wan2.1-I2V-14B with LoRA rank 32. Reported training stages are full self-attention fine-tuning for 100k iterations, sparse-attention warm-up for 10k iterations, and DMD2 distillation for 30k iterations. Sparse attention uses a π3\pi^39 block size, and each frame attends to the retrieved geometry-aligned reference views Mgeo\mathcal{M}_{geo}0, the 8 preceding frames, and the 8 succeeding frames. The training resolution is Mgeo\mathcal{M}_{geo}1.

4. Geometry-aware retrieval and memory update

AnyRecon couples generation and reconstruction through an explicit 3D geometric memory. The geometry memory Mgeo\mathcal{M}_{geo}2 is an incrementally updated point cloud reconstructed initially from the original sparse captures. After each generated segment, the system re-estimates geometry from the generated views together with original views using Mgeo\mathcal{M}_{geo}3, and this newly reconstructed geometry then replaces the existing memory Mgeo\mathcal{M}_{geo}4. The method thus refreshes rather than merely accumulates geometry, and the paper presents this as a mechanism for mitigating geometric drift in long trajectories (Chen et al., 21 Apr 2026).

Capture-view retrieval is driven by geometric contribution under the current target viewpoint, not by image similarity or simple field-of-view heuristics. Let the set of capture views and poses be

Mgeo\mathcal{M}_{geo}5

For a target novel view, the current geometry memory is rendered to produce a visibility index map identifying, for each visible point, which source view it originated from during reconstruction. The relevance score for capture view Mgeo\mathcal{M}_{geo}6 is

Mgeo\mathcal{M}_{geo}7

where Mgeo\mathcal{M}_{geo}8 is the set of geometry points visible from the target viewpoint and Mgeo\mathcal{M}_{geo}9 is the subset of points in VnovelV_{novel}0 reconstructed from capture view VnovelV_{novel}1. The top-VnovelV_{novel}2 views under VnovelV_{novel}3 are then chosen as VnovelV_{novel}4.

This retrieval rule makes the conditioning set occlusion-aware and explicitly grounded in scene geometry. The visible overlap is computed relative to the current target viewpoint, so the retrieved references are those with the highest geometric contribution to the specific segment being generated. This suggests that AnyRecon uses retrieval not as a generic memory lookup but as a view-dependent geometric selection mechanism.

The system also derives rendered geometric priors and visibility masks from VnovelV_{novel}5. Generated views are then fed back into the geometry estimation stage together with original captures, so the synthesized frames function as pseudo-observations that enlarge scene coverage and improve later conditioning.

5. Training and inference procedure

Training uses DL3DV-10K, a large-scale indoor/outdoor 3D scene dataset. Original videos are partitioned into 40-frame clips at resolution VnovelV_{novel}6. To simulate arbitrary and unordered sparse inputs, the first frame is always used as a base reference and VnovelV_{novel}7 additional conditioning views are randomly selected. With 50% probability, additional views are sampled from the first 20 frames, and with 50% probability they are sampled from the full 40-frame window. The selected views are processed by VnovelV_{novel}8 to initialize VnovelV_{novel}9, which is then projected into target viewpoints to form Icap\mathcal{I}_{cap}0 and Icap\mathcal{I}_{cap}1 (Chen et al., 21 Apr 2026).

Optimization uses AdamW on 64 NVIDIA A800 GPUs, with learning rate Icap\mathcal{I}_{cap}2 initially and Icap\mathcal{I}_{cap}3 for distillation. The paper’s main mathematical diffusion content concerns 4-step Distribution Matching Distillation. The continuous noise schedule is discretized into

Icap\mathcal{I}_{cap}4

The student is trained to generate in 4 denoising steps. The paper states that generator and critic are trained with alternating updates.

At test time, arbitrary sparse captures are collected into Icap\mathcal{I}_{cap}5, a feed-forward point-map estimator such as VGGT or Icap\mathcal{I}_{cap}6 constructs the initial Icap\mathcal{I}_{cap}7, and a user specifies the target camera path Icap\mathcal{I}_{cap}8. The trajectory is chunked into segments. For each segment, the method computes visibility-aware retrieval scores Icap\mathcal{I}_{cap}9, selects top-IrenderI_{render}0 conditioning capture views IrenderI_{render}1, renders IrenderI_{render}2 into target viewpoints to obtain IrenderI_{render}3 and IrenderI_{render}4, prepends IrenderI_{render}5 as the capture-view cache, generates segment RGB views IrenderI_{render}6 using the 4-step distilled diffusion model with sparse attention, reconstructs updated geometry from generated views plus original captures using IrenderI_{render}7, and replaces or updates IrenderI_{render}8. After all segments are processed, the resulting geometry memory serves as the final reconstructed 3D scene representation.

6. Empirical evaluation and ablations

Evaluation is performed on 10 scenes from DL3DV-Evaluation and 5 scenes from Tanks and Temples. For each test sequence, 40 frames are sampled at IrenderI_{render}9, and Tanks and Temples sequences are temporally subsampled by MtM_t0 to create sparse-view difficulty. Two settings are used. In interpolation, conditioning views are the MtM_t1, MtM_t2, and MtM_t3 frames. In extrapolation, conditioning views are the MtM_t4, MtM_t5, MtM_t6, and MtM_t7 frames. Baselines are Difix3D+, ViewCrafter, and Uni3C. Metrics are PSNR, SSIM, LPIPS, and runtime per sequence (Chen et al., 21 Apr 2026).

On DL3DV, interpolation results are: Difix3D+ 17.88 / 0.551 / 0.290 / 1200 s, ViewCrafter 15.86 / 0.463 / 0.394 / 170 s, Uni3C 16.33 / 0.471 / 0.319 / 340 s, and AnyRecon 20.95 / 0.656 / 0.151 / 105 s. On DL3DV extrapolation, the corresponding results are 18.74 / 0.576 / 0.261 / 1200 s, 15.51 / 0.459 / 0.406 / 170 s, 15.69 / 0.457 / 0.344 / 340 s, and 21.16 / 0.660 / 0.158 / 105 s. On Tanks and Temples interpolation, the results are 19.43 / 0.629 / 0.163 / 1200 s, 15.85 / 0.474 / 0.364 / 170 s, 16.77 / 0.514 / 0.263 / 340 s, and 20.37 / 0.639 / 0.158 / 105 s. On Tanks and Temples extrapolation, the results are 18.67 / 0.594 / 0.190 / 1200 s, 15.83 / 0.481 / 0.361 / 170 s, 16.54 / 0.502 / 0.274 / 340 s, and 20.30 / 0.629 / 0.181 / 105 s.

The ablation results isolate several architectural choices. For temporal compression on DL3DV interpolation at 50 diffusion steps, Full TC yields 20.16 / 0.616 / 0.179 / 210+(15) s, Partial TC yields 21.10 / 0.661 / 0.153 / 270+(15) s, and w/o TC yields 21.57 / 0.687 / 0.140 / 1820+(15) s. This supports the claim that removing temporal compression improves fine geometric detail and frame correspondence, but it also introduces major computational cost unless paired with distillation and sparse attention.

For efficiency, the no-TC model gives 1820 s with 50-step full attention, 140 s with 4-step distilled full attention and PSNR 21.32, and 90 s with 4-step distilled sparse attention and PSNR 20.95. The paper reports that combining distillation and sparse attention yields up to MtM_t8 generation speedup over vanilla diffusion. For global scene memory, the ablation reports w/o Global Scene Memory: 20.18 / 0.634 / 0.205, and w/ Global Scene Memory: 20.95 / 0.656 / 0.151. The paper also provides qualitative evidence that without geometry memory update, later segments become inconsistent, and that geometry-driven retrieval excludes occluded or geometrically irrelevant capture views better than FOV or similarity heuristics.

7. Relation to adjacent reconstruction systems and limitations

Within the broader reconstruction literature, AnyRecon occupies a scene-level sparse-view position that differs from both simulation-ready compositional reconstruction and automated salient-object reconstruction. SimRecon reconstructs cluttered indoor RGB videos into an object-centric, simulator-ready scene assembled through a “Perception–Generation–Simulation” pipeline, whereas AnyRecon uses a video diffusion model plus explicit geometry memory to reconstruct from arbitrary and unordered sparse inputs (Xia et al., 2 Mar 2026). AutoRecon, by contrast, is a fully automated pipeline for discovering and reconstructing a salient foreground object from object-centric multi-view images using SfM, self-supervised ViT features, and decomposed neural scene representations (Wang et al., 2023). This comparison suggests that AnyRecon is centered on arbitrary-view scene reconstruction under sparse observation, rather than simulation-ready compositional assembly or salient-object isolation.

The method’s reported strengths are support for arbitrary and unordered sparse conditioning views, preservation of explicit geometric control via point-cloud renderings and visibility masks, scalability to long trajectories by chunked generation plus persistent geometry memory, and improved fidelity and efficiency relative to prior diffusion-based baselines. The reported gains are strongest in the regimes the paper targets: irregular sparse inputs, large viewpoint gaps, interpolation and extrapolation, and long trajectories or large scenes.

The stated limitations are tied to the quality of the 3D geometric memory. The paper says the method is robust to minor inaccuracies such as pose misalignment, noise, and artifacts, but still requires basic structural coherence. In extreme cases with minimal view overlap, the initial reconstruction may fail, leading to weak guidance for diffusion and poor synthesis. The paper does not discuss dynamic scenes, reflective materials, or memory growth overhead in detail. A plausible implication is that AnyRecon’s explicit geometric control yields strong conditioning when the evolving point-map geometry is coherent, but also ties performance to the reliability of that geometry throughout the iterative loop.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AnyRecon.