MemFlow: Adaptive Memory for Video & Flow

Updated 18 December 2025

MemFlow is a dual-model framework in computer vision that employs adaptive memory modules to achieve high temporal coherence and efficiency in both streaming video generation and optical flow estimation.
Its streaming video generation model uses a diffusion transformer with Narrative Adaptive Memory and Sparse Memory Activation to deliver narrative consistency at >18 FPS with less than 10% computational overhead.
The optical flow variant uses a GRU-based recurrent update with temporal memory aggregation, outperforming benchmarks on Sintel, KITTI-15, and high-resolution datasets.

MemFlow refers to two high-impact research models in computer vision, each addressing memory-based challenges in streaming settings: (1) long-context consistency and efficiency in streaming video generation using adaptive memory in diffusion transformers (Ji et al., 16 Dec 2025), and (2) optical flow estimation and prediction leveraging temporal memory aggregation for real-time applications (Dong et al., 7 Apr 2024). Both exploit memory modules to achieve high temporal coherence, efficient inference, and adaptation to changing content or scene dynamics, setting new baselines for consistency and computational performance in their respective domains.

1. Adaptive Memory in Streaming Video Generation

MemFlow for streaming video generation introduces an autoregressive diffusion transformer (DiT) architecture designed for multi-prompt, context-rich narrative video composition (Ji et al., 16 Dec 2025). The model operates by chunk-wise generation: at each iteration, $T$ new frames are synthesized, conditioned on the previous $n$ frames. Key architectural elements include:

Narrative Adaptive Memory (NAM): Dynamically retrieves a subset of historical key–value (KV) cache entries most semantically relevant to the text prompt for the upcoming chunk, allowing the model to flexibly reference distinct temporal cues appropriate to evolving narratives or scene changes.
Sparse Memory Activation (SMA): Restricts attention within the transformer layers to only the top- $k$ most relevant memory frames and their associated KV tensor slices, minimizing computational cost without sacrificing the temporal coherence necessary for narrative consistency.

These mechanisms are designed for sub-10% computational overhead compared to a memory-free baseline, while delivering real-time generation throughput (>18 FPS at 832×480 resolution).

2. Dynamic Memory Bank and Update Algorithms

The dynamic memory bank architecture underpins MemFlow's efficient handling of long-range context:

Memory Storage: After generating a chunk $m$ (consisting of $T$ frames), the model stores at each transformer layer $l$ pairs $\{K^l_{m,i}, V^l_{m,i}\}_{i=1…T}$ , maintaining a global bank of $b$ frames indexed by $(m, i)$ .
Prompt-driven Retrieval: For the next chunk’s prompt, the model computes text-query matrices $Q^l_\text{text} \in \mathbb{R}^{q \times d}$ and performs cross-attention with stored keys, yielding scalar relevance scores via aggregated softmax attention:

$\mathcal{S}^l_{m',i} = \mathrm{Aggregate}\left(\mathrm{Softmax}\left(\frac{Q^l_\text{text}(K^l_{m', i})^\top}{\sqrt{d}}\right)\right)$

Update Rule: After identifying the top- $k$ relevant memory frames, a condensed “prototype” (KV cache of the first latent frame from the just-generated chunk) is appended. The updated bank for layer $l$ in chunk $m+1$ is:

$M^{l}_{m+1} = \left\{(K^l_{m', i}, V^l_{m', i})\right\}_{(m',i) \in I_k} \cup \{(K^l_{m,1}, V^l_{m,1})\}$

where $I_k$ are the top- $k$ scoring indices.

This strategy ensures both responsiveness to prompt-specific history and tractable memory size, supporting multi-prompt interactive story generation and subject trackability despite event or scene switches.

3. Sparse Memory Activation and Attention Efficiency

MemFlow restricts in-chunk cross-memory attention using an efficient selection and masking procedure at every transformer layer:

Framewise Scoring: The mean-pooled query vector $\bar{q}^l_\text{vis}$ from active tokens is scored against mean-pooled keys $\bar{k}^l_j$ from each candidate memory frame:

$s_j = \bar{q}^l_\text{vis} \cdot (\bar{k}^l_j)^\top$

Top- $k$ Frame Selection: Only the $k$ frames with highest $s_j$ values participate in the cross-attention operation, with all others masked out:

$\mathrm{Attn}\left(Q^l_\text{vis}, K^l_\text{mem}, V^l_\text{mem}\right) \approx \mathrm{Attn}\left(Q^l_\text{vis}, K^l_{\text{mem}, \mathcal{I}_k}, V^l_{\text{mem}, \mathcal{I}_k}\right)$

where $\mathcal{I}_k$ is the set of selected indices.

This approach achieves a favorable speed–quality trade-off: only 7.9% throughput reduction (18.7 FPS on NVIDIA H100) compared to the memory-free baseline, while maintaining or exceeding state-of-the-art long-range consistency and text–video alignment scores (VBench-Long quality of 85.02 vs. 84.28 for LongLive) (Ji et al., 16 Dec 2025).

4. Memory in Real-Time Optical Flow Estimation and Prediction

MemFlow has also been developed as a memory-augmented architecture for optical flow estimation and future flow prediction (Dong et al., 7 Apr 2024). The model’s core is a GRU-based recurrent flow update module that interfaces with a memory buffer:

Feature and Context Extraction: CNN encoders process frame pairs to produce feature maps; a 4D all-pairs correlation volume supports motion encoding.
Temporal Aggregation: The memory buffer $M_{t-1} = \{k_m, v_m\}$ (with key and value tensors from prior time steps) yields an aggregated motion feature $f_{am}$ via attention:

$f_{am} = f_m + \alpha \cdot \text{Softmax}(S \cdot qk^\top)v$

where scaling $S$ is resolution-adaptive and $\alpha$ is a learnable scalar.

Memory Update: Each new motion feature (from the GRU’s final iteration) is projected and inserted into the memory buffer, maintaining a sliding window of past context.

The memory mechanism allows MemFlow to aggregate long-range motion without access to future frames, outperforming VideoFlow and other baselines on Sintel, KITTI-15, and the high-resolution Spring dataset in both accuracy and inference speed.

5. Computational Characteristics and Benchmarking

Both MemFlow variants emphasize real-time suitability and cross-resolution generalization:

Streaming Video Generation (DiT): Throughput of 18.7 FPS at 832×480 with only 7.9% added compute. VBench-Long quality, consistency, and aesthetic metrics outperform competitive models at interactive durations up to 60 seconds. Ablation studies show that both memory (NAM) and efficiency modules (SMA) contribute to subject coherence ( $>$ 98% subject consistency with NAM+SMA vs. 94.4% without memory) (Ji et al., 16 Dec 2025).
Optical Flow (CNN backbone): At 15 iterations, the model achieves 5.6 FPS (9.5M params), and 14.5 FPS at 5 iterations with minor accuracy loss. Zero-shot generalization and fine-tuning results on public benchmarks show the model delivering top performance on both classical and high-resolution (1080p Spring) benchmarks with fewer parameters than previous approaches (Dong et al., 7 Apr 2024).

Summary of key results for MemFlow optical flow:

Model Variant	Sintel (clean)	KITTI-15 EPE	Spring 1px Err (%)
MemFlow (gen)	0.93	3.88	5.76
MemFlow-T	0.85	3.38	-
MemFlow (fine-tuned)	1.05	4.10	4.48

6. Limitations and Development Directions

Current limitations, as described in the foundational papers:

In video generation, excessive memory bank size ( $b \gg 3$ ) leads to degraded short-term coherence due to imbalanced local/global context weighting. The “prototype” (first frame) used for chunk summarization may not capture all relevant context; more expressive or learned prototypes represent a plausible improvement area. Retrieval is currently based on cross-attention with single-modality scores; richer or multi-modal indices may offer improvements (Ji et al., 16 Dec 2025).
For optical flow, the temporal memory is limited in both depth and modeling capacity; longer-range future-flow prediction is hampered by error drift. Hierarchical or distilled long-term memory and self-supervised or unsupervised flow learning are suggested as promising future directions (Dong et al., 7 Apr 2024).

7. Significance within the Research Landscape

MemFlow’s dual instantiations—narrative-consistent video generation and memory-based optical flow—demonstrate the critical role of adaptive, prompt- or query-driven memory in overcoming scaling bottlenecks, information redundancy, and loss of temporal coherence in video models. Both models are compatible with real-time deployment scenarios and integrate seamlessly with existing backbone architectures, positioning MemFlow as a template for memory-efficient, context-aware temporal modeling across generative and perceptual tasks. By achieving state-of-the-art performance with modest parameter overhead and practical latency, MemFlow advances memory system design for streaming vision systems (Ji et al., 16 Dec 2025, Dong et al., 7 Apr 2024).