Papers
Topics
Authors
Recent
Search
2000 character limit reached

AnchorWeave: Memory-Augmented Video Gen

Updated 18 February 2026
  • AnchorWeave is a memory-augmented video generation framework that uses per-frame local spatial memories and a multi-anchor weaving controller to maintain consistent spatial structures.
  • It employs a greedy coverage retrieval algorithm to select optimal local point clouds from depth-estimated frames, mitigating cross-view misalignment issues.
  • Empirical results show that increasing the number of anchors improves metrics like PSNR and SSIM, demonstrating enhanced long-horizon video synthesis quality.

AnchorWeave is a memory-augmented video generation framework designed to maintain world-consistent spatial structures over long horizons in camera-controllable video diffusion. Diverging from global 3D fusion-based paradigms, which are susceptible to cross-view misalignment artifacts, AnchorWeave deploys multiple clean, per-frame local spatial memories and reconciles their inconsistencies through a novel multi-anchor weaving controller. This architecture achieves substantial improvements in scene consistency and visual quality for long-form, arbitrarily navigated video synthesis under user-specified camera trajectories (Wang et al., 16 Feb 2026).

1. Motivation and Problem Formulation

Long-horizon, camera-controllable video generation requires synthesizing video sequences that remain consistent with both viewed and generated history, especially when the camera revisits regions of the environment. Existing video diffusion models based on architectures such as DiT-based latent diffusion models (LDMs) offer strong short-range visual quality, but inherently lack explicit global memory, resulting in spatial drift, ghosting, or hallucination when revisiting prior contexts.

Traditional memory-based solutions reconstruct a global 3D scene from temporally accumulated views—such as point clouds or neural fields—and condition generation on rendered anchor videos derived from these reconstructions. However, global fusion is fundamentally limited by the accumulation of pose and depth estimation errors: identical surfaces are rendered at slightly different locations across views, leading to misalignment and contaminated geometry in the global memory, which, when used for conditioning, degrades output fidelity.

AnchorWeave replaces this approach with local geometric memories: each stored as a single-frame, pose-aligned, clean point cloud, free from cross-view fusion artifacts. During synthesis, the framework retrieves multiple local memories whose aggregate fields-of-view optimally cover the target viewpoint and combines their rendered signals in a controller designed to resolve residual spatial discrepancies.

2. Local Memory Construction and Coverage-Driven Retrieval

Local Spatial Memory Bank

AnchorWeave constructs its spatial memory as a bank of per-frame local point clouds. For each historical frame ii, a pretrained depth estimator (e.g., TTT3R) infers a depth map and the intrinsic and extrinsic camera parameters. Each pixel is back-projected into a local point cloud

Pi={(xij,yij,zij,cij)}j=1Ni\mathcal{P}_i = \bigl\{(x_{ij},y_{ij},z_{ij},\,c_{ij})\bigr\}_{j=1}^{N_i}

with color cijR3c_{ij}\in\mathbb{R}^3 and corresponding camera pose TiSE(3)T_i\in SE(3). The set {Pi,Ti}\{\mathcal{P}_i,\,T_i\} defines the local memory bank.

Coverage-Oriented Retrieval

Given a query camera trajectory {T^t}t=1T\{\hat T_t\}_{t=1}^T divided into temporal chunks of length DD, and denoting each chunk as Cm\mathcal{C}_m, the goal is to select up to KK local memories whose union of visible regions maximizes coverage over Cm\mathcal{C}_m. For each Pi\mathcal{P}_i, the set of visible image region pixels under camera T^t\hat T_t is

vis(Pi,T^t)={(u,v)Ωπ(T^t1x)Ω,xPi}\mathrm{vis}(\mathcal{P}_i,\hat T_t) = \bigl\{(u,v)\in\Omega \,|\, \pi(\hat T_t^{-1}x)\in\Omega,\, x\in\mathcal{P}_i\bigr\}

where π()\pi(\cdot) denotes projection and Ω\Omega is the image domain.

The aggregate coverage for a candidate set SS over a chunk is

cov(S;Cm)=PiSVi(Cm)Ω\mathrm{cov}(S;\mathcal{C}_m) = \frac{\left|\bigcup_{\mathcal{P}_i\in S} V_i(\mathcal{C}_m)\right|}{|\Omega|}

where Vi(Cm)V_i(\mathcal{C}_m) is the union over chunk visibility.

The selection task is

S=argmaxSCandidates,SKcov(S;Cm)S^* = \arg\max_{S\subseteq\text{Candidates},\,|S|\le K}\mathrm{cov}(S;\mathcal{C}_m)

solved via a greedy set-cover approximation (see Algorithm 1 in (Wang et al., 16 Feb 2026)).

Each retrieved memory is rendered as an anchor video over the query chunk; fewer than KK anchors are padded with empties for architectural consistency. The framework also computes the set of relative camera poses {ΔTi,t=T^tTi1}\{\Delta T_{i,t} = \hat T_t\, T_i^{-1}\} to encode spatial relations.

3. Multi-Anchor Weaving Controller

AnchorWeave’s generation pipeline injects the retrieved anchor videos, together with their relative pose trajectories, into a frozen DiT-based diffusion model using a stack of ControlNet blocks. The controller architecture comprises two primary mechanisms:

Joint Multi-Anchor Attention

All KK anchor videos are encoded via a shared 3D VAE into latent representations FkRLa×CaF_k\in \mathbb{R}^{L_a\times C_a}, concatenated, and processed with a self-attention layer: Fconcat=[F1;F2;;FK]R(KLa)×CaF_{\text{concat}} = [F_1;F_2;\dots;F_K] \in \mathbb{R}^{(K L_a)\times C_a} This joint attention stage enables global context exchange, allowing the network to amplify spatially consistent cues and suppress spurious or misaligned evidence.

Pose-Aware Fusion

For each anchor, the corresponding relative pose ΔTk,t\Delta T_{k,t} is embedded by flattening rotation and translation components and passing them through a small MLP, followed by a softmax over anchors: wk=softmax(MLP(vec(ΔTk))),kwk=1w_k = \mathrm{softmax}\bigl(\mathrm{MLP}(\mathrm{vec}(\Delta T_k))\bigr), \quad \sum_k w_k = 1 These importance weights are used to synthesize the anchor-conditioned features (obtained from attention) via a weighted sum: Gfused=k=1KwkGkG_{\text{fused}} = \sum_{k=1}^K w_k\,G_k GfusedG_{\text{fused}} is then injected into the corresponding diffusion backbone residual block, providing world-consistent geometric conditioning. A parallel branch encodes explicit camera pose for direct control.

4. Training Paradigm

AnchorWeave employs a standard denoising diffusion probabilistic model (DDPM)-based training procedure. Let xRT×3×H×W\bm x \in \mathbb{R}^{T\times 3\times H\times W} and its latent encoding z=E(x)\bm z = \mathcal{E}(\bm x), Gaussian noise ϵ\epsilon is sampled and added at each timestep, and the network is trained to minimize the MSE: Ldiff=Ez,ϵ,tϵϵθ(zt,t,cond)22\mathcal{L}_{\mathrm{diff}} = \mathbb{E}_{\bm{z},\epsilon,t}\bigl\|\epsilon - \epsilon_\theta(\bm z_t, t, \text{cond})\bigr\|_2^2 The conditioning tuple includes a text or image prompt, the anchor latents, and pose embeddings. The diffusion backbone remains frozen; only the newly introduced weaving modules are trained. Classifier-free guidance (CFG) is applied in sampling. No adversarial or auxiliary losses are used.

5. Inference Workflow

At inference, with initial historical frames and user-specified camera trajectory, AnchorWeave applies an update–retrieve–generate loop:

1
2
3
4
5
6
7
8
9
10
11
12
13
# Retrieve:
A_m = GreedyCoverageRetrieval(M, C_m, K)
# Render:
anchor_videos = Render(A_m, C_m)
rel_poses = ComputeRelativePoses(A_m, C_m)
# Generate:
z0 = DiffusionSample(
    cond = {anchor_videos, rel_poses, C_m},
    backbone=FrozenDiT
)
x_gen = Decode(z0)
# Update:
add local point clouds from x_gen frames to M

The process maintains a sliding-window memory bank updated with newly generated frames, enabling dynamic and consistent long-horizon generation as scenes evolve.

6. Empirical Evaluation and Ablation

Performance is assessed on multiple benchmarks under a partial-revisit protocol, with metrics including PSNR, SSIM for reconstruction consistency, and VBench subjective and perceptual quality scores (Subject Consistency, Background Consistency, Motion Smoothness, Temporal Flicker, Aesthetic Quality, Imaging Quality).

A selection of results:

Method Total Quality↑ PSNR↑ SSIM↑
Context-as-Memory (reimpl.) 78.07 17.91 0.5884
SPMem (reimpl.) 76.85 17.25 0.5710
SEVA (Zhou et al. 2025) 79.66 21.13 0.6711
AnchorWeave (KK=1) 80.07 19.01 0.6145
AnchorWeave (KK=4) 80.98 21.04 0.6739

AnchorWeave achieves state-of-the-art performance, exceeding prior approaches in both visual realism and long-range consistency, with further gains observed as the anchor count KK increases. Ablation studies reveal:

  • Using a global (fused) memory instead of local memories reduces PSNR from 20.96 to 16.31 and SSIM from 0.6727 to 0.5345.
  • Pose-conditioned fusion suppresses misaligned anchors, avoiding ghosting artifacts.
  • Joint multi-anchor attention outperforms separate, per-anchor attention for producing sharper geometry.
  • Increasing KK from 1 to 4 further boosts PSNR and SSIM, confirming the benefit of aggregating complementary evidence from multiple anchors.

Long-horizon explorations demonstrate near lossless object consistency and structure maintenance across over 200 generated frames, even when originating from a single open-domain image.

7. Current Limitations and Directions for Further Development

Several limitations are identified:

  • Depth estimation failures in textureless or reflective scenes result in incomplete local memories and holes in anchor renderings; large untextured regions may lead to plausible but unconstrained hallucinations.
  • Rapid camera motions can leave target regions uncovered if KK or chunk length DD is insufficient relative to memory history, exposing a trade-off between computational budget and spatial coverage.
  • Static scene assumption for each memory frame limits handling of deformable objects or dynamic scenes, necessitating future integration with dynamic surfel or Gaussian splatting schemes, or temporal object tracking.
  • The current retrieval policy is greedy and non-differentiable; introducing differentiable, end-to-end retrieval selection may enhance adaptivity and downstream video generation quality.

Advancements such as learned retrieval policies, improved depth completion or explicit inpainting for local memories, and extension to dynamic scene representations constitute plausible directions. This suggests that AnchorWeave’s architecture provides a flexible scaffold for ongoing research in long-horizon, world-consistent video generation (Wang et al., 16 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AnchorWeave.