MosaicMem: Hybrid Spatial Memory
- MosaicMem is a hybrid spatial memory system that combines explicit 3D patch storage with implicit attention-based conditioning for controllable video generation.
- It lifts latent video patches into 3D space and leverages camera-aware positional encodings (PRoPE) to achieve robust inter-frame and inter-view consistency.
- The system demonstrates low rotational and translational errors, outperforming baselines in long-horizon video navigation and scene editing tasks.
MosaicMem: Hybrid Spatial Memory
MosaicMem is a hybrid spatial memory system tailored for controllable video world models, designed to address persistent challenges in maintaining inter-frame and inter-view consistency across long video horizons with dynamic objects and user interventions. It introduces hybridization between explicit geometric memory (for spatial accuracy) and implicit attention-based conditioning (for flexibility and dynamics) by lifting latent video patches into 3D space, enabling targeted memory retrieval, and leveraging camera-aware positional encodings in the core transformer architecture (Yu et al., 17 Mar 2026).
1. Architectural Foundations
MosaicMem extends a Transformer-based latent video diffusion architecture (Wan 2.2) trained with continuous flow matching. The core video denoising process is expressed by the ODE:
where are video latents, is the input image, is an optional text prompt, denotes camera pose trajectories, and is the hybrid spatial memory. At each rollout step:
- The input frame is encoded into latent space by a VAE.
- Latents are subdivided into patches .
- Each is lifted to a 3D world centroid via depth and pose ().
- During video generation, patches from overlapping with the target camera frustum are retrieved and reprojected to the current view.
- These reprojected patches are added as conditioning tokens at precise rotary positional encoding (RoPE) coordinates in the Transformer (DiT), with camera-consistent attention via the PRoPE module.
- The standard latent diffusion denoising proceeds, now conditioned on view-aligned memory patches and prompt inputs.
This construction achieves both view-consistent static memory (via explicit 3D patch storage and retrieval) and dynamic, prompt-responsive evolution (via the Transformer’s flexibility in granting or overriding memory influence).
2. Hybrid Memory and 3D Patch Lifting
Each memory entry in MosaicMem consists of:
- A patch’s explicit 3D position (computed for the patch center or a local grid), associated source camera parameters, and its latent feature vector .
- Original 2D RoPE coordinates.
The forward and inverse mappings are:
- Lifting to 3D: For patch pixel center , depth , intrinsics , extrinsics :
- Projecting to a new view: For target camera ,
Memory retrieval is geometric: all stored patches whose projected overlap the target frame are candidates. Retrieved tokens are supplied for conditioning; the architecture can exploit these for consistency, or disregard them for dynamic, evolving content.
3. Patch-and-Compose Retrieval and Warping
MosaicMem's patch-and-compose interface enables precise placement of memory patches in the current view’s latent space using two principal alignment strategies:
Warped RoPE
- The rotary positional encodings for patch tokens are sampled at the exact fractional projected coordinates instead of snapping to the integer grid.
- This provides sub-pixel positional fidelity, improving geometric alignment during attention.
Warped Latent
- Source latent features are resampled—using differentiable bilinear interpolation—at the projected coordinates .
- Warped features are concatenated to the Transformer input.
Both methods operate without explicit alignment or reprojection losses. Instead, the fusion of feature-space and positional alignment is learned during fine-tuning by exposing the network to these warped patch tokens as part of the conditioning sequence, enabling robust and precise composition of observable scene elements despite the low-frequency nature of latent representations.
4. Camera Pose Conditioning via PRoPE
PRoPE (Projective Positional Encoding) injects accurate inter-frame and inter-view geometry into every self-attention block:
- Each frame is associated with a 4×4 projective matrix .
- For each query/key interaction in self-attention, the relative transform is included in a block-diagonal matrix , enriching the attention with both projective geometry and 2D rotary positional encoding.
- This ensures that learned attention heads are constrained to camera-consistent correspondences, which significantly reduces egomotion drift across long video horizons.
Compared to camera control via simpler MLP means, PRoPE achieves lower rotational error (RotErr) and translational error (TransErr), especially at long rollouts and under complex camera trajectories.
5. Training, Evaluation Metrics, and Ablations
The MosaicMem model is trained end-to-end with the standard continuous flow matching loss:
No auxiliary alignment or reprojection losses are used; alignment is driven implicitly via memory warping.
Evaluation benchmarks include:
- Camera Control: RotErr (degrees), TransErr (meters)
- Visual Quality: FID, FVD
- Consistency: SSIM, PSNR, LPIPS
- Dynamics: Optical flow magnitude
Results (Table 1 in (Yu et al., 17 Mar 2026)) indicate:
| Method | RotErr ↓ | TransErr ↓ | FID ↓ | FVD ↓ | SSIM ↑ | PSNR ↑ | LPIPS ↓ | Dyn ↑ |
|---|---|---|---|---|---|---|---|---|
| SEVA (explicit best) | 1.42 | 0.12 | 74.7 | 302 | 0.66 | 22.0 | 0.15 | 1.22 |
| CaM (implicit best) | 4.65 | 0.43 | 85.3 | 392 | 0.49 | 15.8 | 0.42 | 1.72 |
| MosaicMem (full) | 0.51 | 0.06 | 65.7 | 233 | 0.75 | 23.6 | 0.11 | 2.58 |
Ablations reveal:
- PRoPE is essential for low drift.
- Combined Warped RoPE and Warped Latent produce the most robust results across consistency and dynamics.
- Memory sparsity and patch size affect trade-offs between overhead and detail/sharpness.
- Below ≈200 patches, geometry may be lost; above ≈400 patches, returns diminish.
6. Applications, Qualitative Outcomes, and Comparative Context
Navigation and Long-horizon Generation
MosaicMem supports minute-level video navigation by continually appending scene memory during autoregressive rollouts. Scene revisiting with accurate geometry persists over multi-minute intervals, surpassing memory-as-context baselines that collapse after ~20 seconds.
Scene Editing and Compositionality
Owing to explicit centroids for each patch, MosaicMem enables direct manipulation of spatial memory: cut, paste, or transform (e.g., world-space block flipping) entire scene segments. This property is leveraged for creative tasks—stitched world navigation, inventive scene layouts—that are infeasible for previous implicit or explicit-only memories.
Autoregressive Rollout
With causal and rolling forcing applied to bidirectional diffusion models, real-time video output achieves both subjective and objective improvements over prior models (e.g., RELIC, Matrix-Game 2.0), as seen in VBench quality and pose metrics.
Relation to Alternative Hybrid and Biological Models
Previous hybrid spatial memory systems—such as those with explicit 3D TSDF voxel grids combined with episodic and working memory slots (Wu et al., 5 Jun 2025)—emphasize geometric grounding but typically require dense voxel storage, separate handling for dynamic objects, and operate at a coarser scene level. MosaicMem’s patch-based, RoPE-augmented explicit-implicit blend enables finer spatial selectivity, lower overhead per patch, and seamless compositionality.
Neural models for spatial memory, such as those employing local random connectivity and global inhibition to form persistent, tiled attractor states (so-called "mosaic" codes) (Natale et al., 2019), provide theoretical grounding for discretized, yet spanning, spatial memory. However, they operate at the neuronal/dynamical systems level, whereas MosaicMem leverages learned geometric-token correspondences within modern large-scale vision-LLMs.
7. Limitations and Prospective Enhancements
Known limitations include:
- Dependence on pre-trained depth and pose estimators, which limit fidelity in low-texture or featureless regions.
- Memory overhead: storage, token count, and geometric index computations presently do not scale sublinearly with scene size, necessitating future hierarchical or compressed representations.
- Handling highly dynamic or non-rigid entities (e.g., articulated motion, fluids) remains difficult; approaches coupling dedicated trackers or dense flow modules are suggested as potential supplements.
- Absence of explicit alignment or reprojection losses; including or targets could further regularize the position-feature mapping.
MosaicMem demonstrates, by compositing view-realigned latent patches under precise projective attention, that world models can maintain both spatial consistency and dynamic flexibility at the scale necessary for immersive, controllable video generation (Yu et al., 17 Mar 2026). This hybrid approach informs both near-term improvements to autoregressive video synthesis and longer-term strategies for memory architectures in world modeling and generative embodied agents.