StoryMem: Memory-Augmented Narrative Modeling

Updated 25 December 2025

StoryMem is a family of architectures, benchmarks, and methodologies for memory-augmented narrative modeling, integrating explicit memory into video and text generation.
It employs techniques like semantic keyframe selection, aesthetic filtering, and dynamic memory updates to maintain cross-shot consistency and narrative coherence in storytelling.
Benchmark results and ablation studies demonstrate improved prompt following, narrative recall, and memory retrieval performance across vision-language and LLM tasks.

StoryMem refers to a family of architectures, benchmarks, and methodologies for modeling, generating, or evaluating stories—primarily in vision-language and LLM settings—where persistent memory plays a foundational role. The StoryMem framework encompasses explicit memory-augmented video generation pipelines (Zhang et al., 22 Dec 2025), autoregressive story generation with visual memory and visio-lingual co-reference (Rahman et al., 2022), LLM benchmarks for long-term narrative memory (Wan et al., 16 Jun 2025), and early work on semantically supervised chained memory for narrative modeling (Liu et al., 2018).

1. Memory-Augmented Video Storytelling Architectures

StoryMem, as introduced in "StoryMem: Multi-shot Long Video Storytelling with Memory" (Zhang et al., 22 Dec 2025), reformulates long-form video storytelling as iterative shot synthesis conditioned on an explicit and dynamically updated memory bank. The core mechanism is the Memory-to-Video (M2V) module, which augments a pre-trained single-shot video diffusion model (Wan2.2-I2V, equipped with a 3D-VAE encoder $\mathcal{E}$ and a Diffusion Transformer backbone) with an external, compact memory of informative keyframes selected from generated shots.

Inputs to a shot synthesis step include:

Noisy video latent $z_t\in\mathbb{R}^{c\times f\times h\times w}$
Memory latent $z_m\in\mathbb{R}^{c\times f_m\times h\times w}$ (from $K_\mathrm{max}$ stored keyframes)
Binary mask $M$
Text embedding $e_t$ of the shot’s prompt

Injection of the memory is accomplished by concatenating $z_t$ , $z_c$ , and $M$ , and feeding them to DiT blocks, which are equipped with LoRA adapters (rank $r=128$ ) for efficient fine-tuning.

Memory frames are treated as preceding the shot temporally (negative RoPE shifts with stride $S$ ), allowing cross-attention and spatial-temporal self-attention to span both the memory and the current shot. Keyframes are selected via semantic diversity (CLIP-based cosine similarity thresholding, dynamically adjusted per sequence) and aesthetic filtering (HPSv3 score threshold).

2. Memory Bank Design and Update Mechanisms

The StoryMem memory bank is a set of VAE-encoded latent frames representing semantically salient and aesthetically preferred moments across previous shots (Zhang et al., 22 Dec 2025).

Key components:

Semantic Keyframe Selection uses CLIP features to ensure memory frames have low within-memory redundancy, with a dynamic threshold $\tau_{\mathrm{sem}}$ .
Aesthetic Preference Filtering excludes frames with sub-threshold HPSv3 scores ( $\geq 3.0$ ).
Hybrid Sink+Sliding Window Pruning maintains a capped history by anchoring earliest/last selected frames and discarding intermediates if over capacity.
Memory is encoded as $m_i = \{z_{k_1}, \ldots, z_{k_{|m_i|}}\}$ , with $|m_i| \leq K_{\max}$ , and injected into the next shot's generation.

This mechanism achieves high cross-shot character and background consistency and enables smooth or explicit ("hard cut") scene transitions by optionally reusing the last generated frame as the first of the next shot.

3. Autoregressive Story Generation with Visual Memory

"Make-A-Story" (Rahman et al., 2022) presents a StoryMem variant where frame-by-frame story visualization is conditioned on both current sentence context and an evolving ordered memory of prior frames, each slot recording:

Key: embedding of the $i$ -th sentence $K_i=f_\mathrm{txt}(S^i)\in\mathbb{R}^d$
Value: pooled latent feature $V_i=\hat f_\mathrm{vis}(Z^i)\in\mathbb{R}^d$

Memory retrieval is performed via sentence-conditioned soft attention: for the $m$ -th sentence, query $Q_m$ is computed and attended over memory keys, resulting in context $C_m$ injected into each U-Net block alongside the sentence embedding. Memory grows with each generated frame (unless explicitly pruned). Reference and co-reference are handled implicitly; attention to antecedents is learned as part of the unified diffusion objective, with no auxiliary consistency loss required.

4. Benchmarking Long-Term Narrative Memory in LLMs

StoryMem is also used to refer to a rigorous benchmark for evaluating LLMs’ long-term memory in dynamic, branching narrative environments (Wan et al., 16 Jun 2025). Here, an interactive fiction game is formalized as a decision tree (DAG) of scene and choice nodes, with deep branching and causal dependencies.

Crucial features include:

Immediate Feedback and Self-Recovery Evaluation modes, capturing both local correction and global narrative recall.
A hand-constructed dataset derived from "The Invisible Guardian," with 311 scene nodes, 86 choice nodes, and $\sim$ 80 branching paths, mean length $T\sim15$ –$20$.
Metrics span overall accuracy, first-try accuracy, knowledge retention (longest correct streak), sequential reasoning (easy/hard split), and efficiency (runtime, token usage).
Failure analysis reveals knowledge-loss errors (contradicting distant events), shallow backtracking, and retry loops; performance degrades without explicit feedback or structured memory access.

5. Semantic-Supervised Memory Chains for Narrative Modeling

Prior to large-scale video-centric or LLM memory benchmarks, StoryMem-style memory chaining was developed for text-based narrative modeling in "Narrative Modeling with Memory Chains and Semantic Supervision" (Liu et al., 2018). Key ideas include:

Four external memory chains (as in EntNet): event, sentiment, topic (semantically supervised), and a "free" chain.
BiGRU encodes context; memory chains update at each time step via candidate $\tilde{m}_i^j$ , attention gates $g_i^j$ , and $\ell^j$ -supervised binary aspect triggers.
Gate supervision uses FrameNet, POS, and sentiment lexicons for event/sentiment/topic detection, causing distinct clustering of chain representations.
The model achieves state-of-the-art Story Cloze Test performance (78.5% accuracy), outperforming non-semantic baselines.

6. Comparative Results and Ablations

Quantitative evaluation across vision-language and LLM domains demonstrates the impact of memory-augmented StoryMem architectures:

Multi-shot Visual Storytelling (Zhang et al., 22 Dec 2025):

Method	Aesthetic	Prompt Following (Global)	Cross-shot Consistency
Wan2.2-T2V	0.6452	0.2174	0.3937
StoryDiffusion+Wan2.2-I2V	0.6085	0.2288	0.4600
IC-LoRA+Wan2.2-I2V	0.5704	0.2131	0.4110
HoloCine	0.5653	0.2199	0.4628
StoryMem	0.6133	0.2289	0.5065

LLM Long-Term Memory Benchmark (Wan et al., 16 Jun 2025):

Model	Overall Acc (IF)	First-Try Acc (IF)	LongestCorr (IF)
Doubao 1.5-pro	80.98%	79.14%	10.0
GPT-4o	71.88%	63.49%	8.0
Claude 3.5	74.86%	68.21%	8.5
Deepseek-R1	70.45%	65.16%	10.2

Ablation results demonstrate loss of cross-shot consistency when semantic selection or aesthetic filtering is omitted (Zhang et al., 22 Dec 2025).

7. Extensions, Limitations, and Directions

Advantages of StoryMem approaches include:

Dynamic, explicit, and context-sensitive memory updates tuned to narrative or visual structure.
Robust cross-shot or cross-turn consistency, both for global (character/background) and local (scene-cut) continuity.
Unified optimization objectives in most cases—memory integration does not require complex auxiliary losses.

Limitations and open avenues:

Memory bank size grows with story length in naive implementations; research on pruning/summarization is ongoing.
No guarantees that memory strictly enforces reference resolution unless guided by additional supervision or loss.
LLM benchmarks show distinct tradeoffs between immediate correction and deep causal mapping; memory decay and shallow error localization remain obstacles.

Research suggests further improvement is possible through hierarchical memory structures, explicit backtracking, external key–value stores, and combined multi-modal prompting for causal narrative tracking (Wan et al., 16 Jun 2025).