AnchorWeave: Memory-Augmented Video Gen

Updated 18 February 2026

AnchorWeave is a memory-augmented video generation framework that uses per-frame local spatial memories and a multi-anchor weaving controller to maintain consistent spatial structures.
It employs a greedy coverage retrieval algorithm to select optimal local point clouds from depth-estimated frames, mitigating cross-view misalignment issues.
Empirical results show that increasing the number of anchors improves metrics like PSNR and SSIM, demonstrating enhanced long-horizon video synthesis quality.

AnchorWeave is a memory-augmented video generation framework designed to maintain world-consistent spatial structures over long horizons in camera-controllable video diffusion. Diverging from global 3D fusion-based paradigms, which are susceptible to cross-view misalignment artifacts, AnchorWeave deploys multiple clean, per-frame local spatial memories and reconciles their inconsistencies through a novel multi-anchor weaving controller. This architecture achieves substantial improvements in scene consistency and visual quality for long-form, arbitrarily navigated video synthesis under user-specified camera trajectories (Wang et al., 16 Feb 2026).

1. Motivation and Problem Formulation

Long-horizon, camera-controllable video generation requires synthesizing video sequences that remain consistent with both viewed and generated history, especially when the camera revisits regions of the environment. Existing video diffusion models based on architectures such as DiT-based latent diffusion models (LDMs) offer strong short-range visual quality, but inherently lack explicit global memory, resulting in spatial drift, ghosting, or hallucination when revisiting prior contexts.

Traditional memory-based solutions reconstruct a global 3D scene from temporally accumulated views—such as point clouds or neural fields—and condition generation on rendered anchor videos derived from these reconstructions. However, global fusion is fundamentally limited by the accumulation of pose and depth estimation errors: identical surfaces are rendered at slightly different locations across views, leading to misalignment and contaminated geometry in the global memory, which, when used for conditioning, degrades output fidelity.

AnchorWeave replaces this approach with local geometric memories: each stored as a single-frame, pose-aligned, clean point cloud, free from cross-view fusion artifacts. During synthesis, the framework retrieves multiple local memories whose aggregate fields-of-view optimally cover the target viewpoint and combines their rendered signals in a controller designed to resolve residual spatial discrepancies.

2. Local Memory Construction and Coverage-Driven Retrieval

Local Spatial Memory Bank

AnchorWeave constructs its spatial memory as a bank of per-frame local point clouds. For each historical frame $i$ , a pretrained depth estimator (e.g., TTT3R) infers a depth map and the intrinsic and extrinsic camera parameters. Each pixel is back-projected into a local point cloud

$\mathcal{P}_i = \bigl\{(x_{ij},y_{ij},z_{ij},\,c_{ij})\bigr\}_{j=1}^{N_i}$

with color $c_{ij}\in\mathbb{R}^3$ and corresponding camera pose $T_i\in SE(3)$ . The set $\{\mathcal{P}_i,\,T_i\}$ defines the local memory bank.

Coverage-Oriented Retrieval

Given a query camera trajectory $\{\hat T_t\}_{t=1}^T$ divided into temporal chunks of length $D$ , and denoting each chunk as $\mathcal{C}_m$ , the goal is to select up to $K$ local memories whose union of visible regions maximizes coverage over $\mathcal{C}_m$ . For each $\mathcal{P}_i$ , the set of visible image region pixels under camera $\hat T_t$ is

$\mathrm{vis}(\mathcal{P}_i,\hat T_t) = \bigl\{(u,v)\in\Omega \,|\, \pi(\hat T_t^{-1}x)\in\Omega,\, x\in\mathcal{P}_i\bigr\}$

where $\pi(\cdot)$ denotes projection and $\Omega$ is the image domain.

The aggregate coverage for a candidate set $S$ over a chunk is

$\mathrm{cov}(S;\mathcal{C}_m) = \frac{\left|\bigcup_{\mathcal{P}_i\in S} V_i(\mathcal{C}_m)\right|}{|\Omega|}$

where $V_i(\mathcal{C}_m)$ is the union over chunk visibility.

The selection task is

$S^* = \arg\max_{S\subseteq\text{Candidates},\,|S|\le K}\mathrm{cov}(S;\mathcal{C}_m)$

solved via a greedy set-cover approximation (see Algorithm 1 in (Wang et al., 16 Feb 2026)).

Each retrieved memory is rendered as an anchor video over the query chunk; fewer than $K$ anchors are padded with empties for architectural consistency. The framework also computes the set of relative camera poses $\{\Delta T_{i,t} = \hat T_t\, T_i^{-1}\}$ to encode spatial relations.

3. Multi-Anchor Weaving Controller

AnchorWeave’s generation pipeline injects the retrieved anchor videos, together with their relative pose trajectories, into a frozen DiT-based diffusion model using a stack of ControlNet blocks. The controller architecture comprises two primary mechanisms:

Joint Multi-Anchor Attention

All $K$ anchor videos are encoded via a shared 3D VAE into latent representations $F_k\in \mathbb{R}^{L_a\times C_a}$ , concatenated, and processed with a self-attention layer: $F_{\text{concat}} = [F_1;F_2;\dots;F_K] \in \mathbb{R}^{(K L_a)\times C_a}$ This joint attention stage enables global context exchange, allowing the network to amplify spatially consistent cues and suppress spurious or misaligned evidence.

Pose-Aware Fusion

For each anchor, the corresponding relative pose $\Delta T_{k,t}$ is embedded by flattening rotation and translation components and passing them through a small MLP, followed by a softmax over anchors: $w_k = \mathrm{softmax}\bigl(\mathrm{MLP}(\mathrm{vec}(\Delta T_k))\bigr), \quad \sum_k w_k = 1$ These importance weights are used to synthesize the anchor-conditioned features (obtained from attention) via a weighted sum: $G_{\text{fused}} = \sum_{k=1}^K w_k\,G_k$ $G_{\text{fused}}$ is then injected into the corresponding diffusion backbone residual block, providing world-consistent geometric conditioning. A parallel branch encodes explicit camera pose for direct control.

4. Training Paradigm

AnchorWeave employs a standard denoising diffusion probabilistic model (DDPM)-based training procedure. Let $\bm x \in \mathbb{R}^{T\times 3\times H\times W}$ and its latent encoding $\bm z = \mathcal{E}(\bm x)$ , Gaussian noise $\epsilon$ is sampled and added at each timestep, and the network is trained to minimize the MSE: $\mathcal{L}_{\mathrm{diff}} = \mathbb{E}_{\bm{z},\epsilon,t}\bigl\|\epsilon - \epsilon_\theta(\bm z_t, t, \text{cond})\bigr\|_2^2$ The conditioning tuple includes a text or image prompt, the anchor latents, and pose embeddings. The diffusion backbone remains frozen; only the newly introduced weaving modules are trained. Classifier-free guidance (CFG) is applied in sampling. No adversarial or auxiliary losses are used.

5. Inference Workflow

At inference, with initial historical frames and user-specified camera trajectory, AnchorWeave applies an update–retrieve–generate loop:

# Retrieve:
A_m = GreedyCoverageRetrieval(M, C_m, K)
# Render:
anchor_videos = Render(A_m, C_m)
rel_poses = ComputeRelativePoses(A_m, C_m)
# Generate:
z0 = DiffusionSample(
    cond = {anchor_videos, rel_poses, C_m},
    backbone=FrozenDiT
)
x_gen = Decode(z0)
# Update:
add local point clouds from x_gen frames to M

The process maintains a sliding-window memory bank updated with newly generated frames, enabling dynamic and consistent long-horizon generation as scenes evolve.

6. Empirical Evaluation and Ablation

Performance is assessed on multiple benchmarks under a partial-revisit protocol, with metrics including PSNR, SSIM for reconstruction consistency, and VBench subjective and perceptual quality scores (Subject Consistency, Background Consistency, Motion Smoothness, Temporal Flicker, Aesthetic Quality, Imaging Quality).

A selection of results:

Method	Total Quality↑	PSNR↑	SSIM↑
Context-as-Memory (reimpl.)	78.07	17.91	0.5884
SPMem (reimpl.)	76.85	17.25	0.5710
SEVA (Zhou et al. 2025)	79.66	21.13	0.6711
AnchorWeave ( $K$ =1)	80.07	19.01	0.6145
AnchorWeave ( $K$ =4)	80.98	21.04	0.6739

AnchorWeave achieves state-of-the-art performance, exceeding prior approaches in both visual realism and long-range consistency, with further gains observed as the anchor count $K$ increases. Ablation studies reveal:

Using a global (fused) memory instead of local memories reduces PSNR from 20.96 to 16.31 and SSIM from 0.6727 to 0.5345.
Pose-conditioned fusion suppresses misaligned anchors, avoiding ghosting artifacts.
Joint multi-anchor attention outperforms separate, per-anchor attention for producing sharper geometry.
Increasing $K$ from 1 to 4 further boosts PSNR and SSIM, confirming the benefit of aggregating complementary evidence from multiple anchors.

Long-horizon explorations demonstrate near lossless object consistency and structure maintenance across over 200 generated frames, even when originating from a single open-domain image.

7. Current Limitations and Directions for Further Development

Several limitations are identified:

Depth estimation failures in textureless or reflective scenes result in incomplete local memories and holes in anchor renderings; large untextured regions may lead to plausible but unconstrained hallucinations.
Rapid camera motions can leave target regions uncovered if $K$ or chunk length $D$ is insufficient relative to memory history, exposing a trade-off between computational budget and spatial coverage.
Static scene assumption for each memory frame limits handling of deformable objects or dynamic scenes, necessitating future integration with dynamic surfel or Gaussian splatting schemes, or temporal object tracking.
The current retrieval policy is greedy and non-differentiable; introducing differentiable, end-to-end retrieval selection may enhance adaptivity and downstream video generation quality.

Advancements such as learned retrieval policies, improved depth completion or explicit inpainting for local memories, and extension to dynamic scene representations constitute plausible directions. This suggests that AnchorWeave’s architecture provides a flexible scaffold for ongoing research in long-horizon, world-consistent video generation (Wang et al., 16 Feb 2026).

Markdown Report Issue Upgrade to Chat

References (1)

AnchorWeave: World-Consistent Video Generation with Retrieved Local Spatial Memories (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AnchorWeave.

AnchorWeave: Memory-Augmented Video Gen

1. Motivation and Problem Formulation

2. Local Memory Construction and Coverage-Driven Retrieval

Local Spatial Memory Bank

Coverage-Oriented Retrieval

3. Multi-Anchor Weaving Controller

Joint Multi-Anchor Attention

Pose-Aware Fusion

4. Training Paradigm

5. Inference Workflow

6. Empirical Evaluation and Ablation

7. Current Limitations and Directions for Further Development

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

AnchorWeave: Memory-Augmented Video Gen

1. Motivation and Problem Formulation

2. Local Memory Construction and Coverage-Driven Retrieval

Local Spatial Memory Bank

Coverage-Oriented Retrieval

3. Multi-Anchor Weaving Controller

Joint Multi-Anchor Attention

Pose-Aware Fusion

4. Training Paradigm

5. Inference Workflow

6. Empirical Evaluation and Ablation

7. Current Limitations and Directions for Further Development

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research