AnchorWeave: Memory-Augmented Video Gen
- AnchorWeave is a memory-augmented video generation framework that uses per-frame local spatial memories and a multi-anchor weaving controller to maintain consistent spatial structures.
- It employs a greedy coverage retrieval algorithm to select optimal local point clouds from depth-estimated frames, mitigating cross-view misalignment issues.
- Empirical results show that increasing the number of anchors improves metrics like PSNR and SSIM, demonstrating enhanced long-horizon video synthesis quality.
AnchorWeave is a memory-augmented video generation framework designed to maintain world-consistent spatial structures over long horizons in camera-controllable video diffusion. Diverging from global 3D fusion-based paradigms, which are susceptible to cross-view misalignment artifacts, AnchorWeave deploys multiple clean, per-frame local spatial memories and reconciles their inconsistencies through a novel multi-anchor weaving controller. This architecture achieves substantial improvements in scene consistency and visual quality for long-form, arbitrarily navigated video synthesis under user-specified camera trajectories (Wang et al., 16 Feb 2026).
1. Motivation and Problem Formulation
Long-horizon, camera-controllable video generation requires synthesizing video sequences that remain consistent with both viewed and generated history, especially when the camera revisits regions of the environment. Existing video diffusion models based on architectures such as DiT-based latent diffusion models (LDMs) offer strong short-range visual quality, but inherently lack explicit global memory, resulting in spatial drift, ghosting, or hallucination when revisiting prior contexts.
Traditional memory-based solutions reconstruct a global 3D scene from temporally accumulated views—such as point clouds or neural fields—and condition generation on rendered anchor videos derived from these reconstructions. However, global fusion is fundamentally limited by the accumulation of pose and depth estimation errors: identical surfaces are rendered at slightly different locations across views, leading to misalignment and contaminated geometry in the global memory, which, when used for conditioning, degrades output fidelity.
AnchorWeave replaces this approach with local geometric memories: each stored as a single-frame, pose-aligned, clean point cloud, free from cross-view fusion artifacts. During synthesis, the framework retrieves multiple local memories whose aggregate fields-of-view optimally cover the target viewpoint and combines their rendered signals in a controller designed to resolve residual spatial discrepancies.
2. Local Memory Construction and Coverage-Driven Retrieval
Local Spatial Memory Bank
AnchorWeave constructs its spatial memory as a bank of per-frame local point clouds. For each historical frame , a pretrained depth estimator (e.g., TTT3R) infers a depth map and the intrinsic and extrinsic camera parameters. Each pixel is back-projected into a local point cloud
with color and corresponding camera pose . The set defines the local memory bank.
Coverage-Oriented Retrieval
Given a query camera trajectory divided into temporal chunks of length , and denoting each chunk as , the goal is to select up to local memories whose union of visible regions maximizes coverage over . For each , the set of visible image region pixels under camera is
where denotes projection and is the image domain.
The aggregate coverage for a candidate set over a chunk is
where is the union over chunk visibility.
The selection task is
solved via a greedy set-cover approximation (see Algorithm 1 in (Wang et al., 16 Feb 2026)).
Each retrieved memory is rendered as an anchor video over the query chunk; fewer than anchors are padded with empties for architectural consistency. The framework also computes the set of relative camera poses to encode spatial relations.
3. Multi-Anchor Weaving Controller
AnchorWeave’s generation pipeline injects the retrieved anchor videos, together with their relative pose trajectories, into a frozen DiT-based diffusion model using a stack of ControlNet blocks. The controller architecture comprises two primary mechanisms:
Joint Multi-Anchor Attention
All anchor videos are encoded via a shared 3D VAE into latent representations , concatenated, and processed with a self-attention layer: This joint attention stage enables global context exchange, allowing the network to amplify spatially consistent cues and suppress spurious or misaligned evidence.
Pose-Aware Fusion
For each anchor, the corresponding relative pose is embedded by flattening rotation and translation components and passing them through a small MLP, followed by a softmax over anchors: These importance weights are used to synthesize the anchor-conditioned features (obtained from attention) via a weighted sum: is then injected into the corresponding diffusion backbone residual block, providing world-consistent geometric conditioning. A parallel branch encodes explicit camera pose for direct control.
4. Training Paradigm
AnchorWeave employs a standard denoising diffusion probabilistic model (DDPM)-based training procedure. Let and its latent encoding , Gaussian noise is sampled and added at each timestep, and the network is trained to minimize the MSE: The conditioning tuple includes a text or image prompt, the anchor latents, and pose embeddings. The diffusion backbone remains frozen; only the newly introduced weaving modules are trained. Classifier-free guidance (CFG) is applied in sampling. No adversarial or auxiliary losses are used.
5. Inference Workflow
At inference, with initial historical frames and user-specified camera trajectory, AnchorWeave applies an update–retrieve–generate loop:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
# Retrieve: A_m = GreedyCoverageRetrieval(M, C_m, K) # Render: anchor_videos = Render(A_m, C_m) rel_poses = ComputeRelativePoses(A_m, C_m) # Generate: z0 = DiffusionSample( cond = {anchor_videos, rel_poses, C_m}, backbone=FrozenDiT ) x_gen = Decode(z0) # Update: add local point clouds from x_gen frames to M |
The process maintains a sliding-window memory bank updated with newly generated frames, enabling dynamic and consistent long-horizon generation as scenes evolve.
6. Empirical Evaluation and Ablation
Performance is assessed on multiple benchmarks under a partial-revisit protocol, with metrics including PSNR, SSIM for reconstruction consistency, and VBench subjective and perceptual quality scores (Subject Consistency, Background Consistency, Motion Smoothness, Temporal Flicker, Aesthetic Quality, Imaging Quality).
A selection of results:
| Method | Total Quality↑ | PSNR↑ | SSIM↑ |
|---|---|---|---|
| Context-as-Memory (reimpl.) | 78.07 | 17.91 | 0.5884 |
| SPMem (reimpl.) | 76.85 | 17.25 | 0.5710 |
| SEVA (Zhou et al. 2025) | 79.66 | 21.13 | 0.6711 |
| AnchorWeave (=1) | 80.07 | 19.01 | 0.6145 |
| AnchorWeave (=4) | 80.98 | 21.04 | 0.6739 |
AnchorWeave achieves state-of-the-art performance, exceeding prior approaches in both visual realism and long-range consistency, with further gains observed as the anchor count increases. Ablation studies reveal:
- Using a global (fused) memory instead of local memories reduces PSNR from 20.96 to 16.31 and SSIM from 0.6727 to 0.5345.
- Pose-conditioned fusion suppresses misaligned anchors, avoiding ghosting artifacts.
- Joint multi-anchor attention outperforms separate, per-anchor attention for producing sharper geometry.
- Increasing from 1 to 4 further boosts PSNR and SSIM, confirming the benefit of aggregating complementary evidence from multiple anchors.
Long-horizon explorations demonstrate near lossless object consistency and structure maintenance across over 200 generated frames, even when originating from a single open-domain image.
7. Current Limitations and Directions for Further Development
Several limitations are identified:
- Depth estimation failures in textureless or reflective scenes result in incomplete local memories and holes in anchor renderings; large untextured regions may lead to plausible but unconstrained hallucinations.
- Rapid camera motions can leave target regions uncovered if or chunk length is insufficient relative to memory history, exposing a trade-off between computational budget and spatial coverage.
- Static scene assumption for each memory frame limits handling of deformable objects or dynamic scenes, necessitating future integration with dynamic surfel or Gaussian splatting schemes, or temporal object tracking.
- The current retrieval policy is greedy and non-differentiable; introducing differentiable, end-to-end retrieval selection may enhance adaptivity and downstream video generation quality.
Advancements such as learned retrieval policies, improved depth completion or explicit inpainting for local memories, and extension to dynamic scene representations constitute plausible directions. This suggests that AnchorWeave’s architecture provides a flexible scaffold for ongoing research in long-horizon, world-consistent video generation (Wang et al., 16 Feb 2026).