Autoregressive Story Generation with Visual Memory

Updated 5 January 2026

The paper introduces key memory mechanisms that condition each generated panel on prior outputs and captions, ensuring referential and visual coherence.
Methodologies leverage cross-attention, sliding-window, and state-space models to maintain long-range narrative consistency in illustrated stories.
Empirical evaluations demonstrate improved FID scores and character accuracy compared to independent frame generation approaches.

Autoregressive story generation with visual memory refers to the sequence-wise synthesis of coherent illustrated narratives, where each panel or frame is conditioned not only on its associated text prompt but also on a structured representation of previously generated panels and their captions, often incorporating architectural mechanisms to preserve long-range consistency in character, background, and storyline context. This paradigm addresses the limitations of frame-wise independent diffusion models and basic causal attention by introducing explicit or implicit memory systems capable of carrying over high-level semantic and visual details, resolving references, and managing compounding generation errors.

1. Key Architectural Principles in Autoregressive Visual Story Models

Autoregressive visual story generation departs from simple text-to-image models by introducing mechanisms that condition the generation of each panel $x_j$ on previous outputs $\hat{x}_{<j}$ and captions $c_{\leq j}$ . Formally, models such as AR-LDM implement

$P(X | C) = \prod_{j=1}^L P(x_j | \hat{x}_{<j}, c_{\leq j})$

where generation proceeds step-wise, with a multimodal “visual memory” $\varphi_j$ encoding the historical context utilized at every inference and denoising step (Pan et al., 2022).

Other frameworks, such as Make-A-Story (Story-LDM), employ a key–value memory bank $M^{<m} = \{(K^k, V^k)\}_{k=0,\cdots, m-1}$ at each step $m$ , where keys $K^k$ are derived from sentence embeddings and values $V^k$ encode the latent representations of previous panels. Attention queries generated from the current sentence $S^m$ soft-select memory slots for referential resolution and visual consistency (Rahman et al., 2022).

State-of-the-art approaches further refine the memory system. VideoSSM frames long-horizon video (and by extension story) synthesis as a recurrent dynamical process, leveraging a hybrid of local (sliding-window) and global (state-space model) memory for scalable minute-scale coherence (Yu et al., 4 Dec 2025). Memorize-and-Generate (MAG) decouples memory compression and generation, using transformer-based KV compression and retrieval-style cross-attention for efficient retention and utilization (Zhu et al., 21 Dec 2025). ContextualStory introduces temporal and spatially-enhanced attention mechanisms ("SETA"), as well as storyline context and explicit scene-change adapters to robustly maintain continuity under substantial character or background shifts (Zheng et al., 2024).

2. Memory Representations and Conditioning Mechanisms

Visual memory modules range from simple history stacks to sophisticated compression and state-tracking systems. Common representations include:

Key–Value Memory Banks: Each previously generated panel and its caption are encoded as keys ( $K^k$ ) and values ( $V^k$ ), enabling targeted retrieval via attention for both textual and visual context. These facilitate implicit coreference and background recall (Rahman et al., 2022).
Multimodal Fusion: BLIP-based encoders fuse image and text for each historical frame, stacked with type and time embeddings to form the memory vector $\varphi_j$ injected into UNet denoisers (Pan et al., 2022).
Sliding-Window Attention: Local windows maintain lossless memory over the last $L$ frames, capturing motion cues and detail, and are complemented by global compressed memory for efficiency (Yu et al., 4 Dec 2025).
State-Space Models (SSMs): Global memory is modeled as a recurrent hidden state $x_t$ with continuous/discrete update dynamics:

$x_t = \bar{A} x_{t-1} + \bar{B} u_t$

where $u_t$ summarizes evicted tokens or latent vectors. The output $m_t$ is injected into the transformer backbone, supporting coordinated long-term scene and character consistency (Yu et al., 4 Dec 2025).

Conditioning mechanisms typically involve:

Cross-attention to memory banks (keys/values)
Fusion of current panel and storyline embeddings
Router gates and position-aware adapters to align local and global context
Prompt-adaptive control allowing for smooth integration of new narrative cues without historical context loss (Yu et al., 4 Dec 2025)

3. Training Objectives and Loss Formulations

Training regimes integrate denoising losses with objectives designed for context integrity and consistency:

Masked Denoising (StoryImager, AR-LDM): Loss computed only on target frames, ignoring others via binary masks $m$ :

$\mathcal{L}_{\rm DM} = \mathbb{E}_{\epsilon, t}\big\| m \odot \epsilon - m \odot \epsilon_\theta(z_t, t, s) \big\|_2^2$

(Tao et al., 2024)

Contextual/Memory Attention Losses: Full story loss is a sum over per-frame conditional denoising,

$\mathcal{L}_{\mathrm{story}} = \sum_{m=0}^M \mathcal{L}_m$

with each $\mathcal{L}_m$ incorporating memory-attended features (Rahman et al., 2022).

Distribution-Matching Distillation (DMD): VideoSSM and MAG promote long-horizon consistency by matching the autoregressive student’s distribution to a bidirectional teacher or latent teacher over sampled windows, forcing memory modules to self-correct over extended sequences (Yu et al., 4 Dec 2025, Zhu et al., 21 Dec 2025).
Contrastive and State-Alignment Losses (VRAG adaptation): Encourage generated panels to preserve object identity and match retrieved memory/context; penalize deviations in global state/plot embedding across panels (Chen et al., 28 May 2025).

4. Advances in Scalability, Efficiency, and Reference Resolution

Modern frameworks address scalability and efficiency via:

Compressed Memory Representations: MAG demonstrates near-lossless, $N$ -to-1 KV compression per block, mitigating memory growth and supporting longer narratives (Zhu et al., 21 Dec 2025).
Sliding Window and Global State Partitioning: Systems such as VideoSSM decouple local lossless motion cache from summary-based global SSM, ensuring $O(Ld + d^2)$ complexity per frame—sampling remains feasible even for minute-scale or multi-panel sequences (Yu et al., 4 Dec 2025).
Spatially-enhanced Attention: ContextualStory’s SETA restricts attention to local spatial neighborhoods across frames, efficiently capturing temporal dependencies under large movements and reducing redundancy (Zheng et al., 2024).
Unified Masking and Bidirectional Generation: StoryImager enables flexible inclusion/exclusion of panels, supporting inpainting, backtracking, and continuation in a single framework via its frame masking strategy (Tao et al., 2024).

Implicit reference resolution is addressed primarily through memory-attended cross-modal fusion. Models such as Story-LDM demonstrate high actor and background consistency on referential datasets (e.g., achieving char-acc 69.2% and FID 69.5 vs GAN/vanilla LDM baselines) without explicit coreference resolutions, relying on learned dot-product attention over semantic keys and latent features (Rahman et al., 2022).

5. Quantitative Results and Empirical Evaluations

State-of-the-art autoregressive models with visual memory deliver substantial improvements over naive autoregressive or independent frame generation. Selected results include:

Model	Dataset/Task	FID	Char F1/Acc (%)	Notable Consistency Results
AR-LDM (Pan et al., 2022)	PororoSV Visual	16.59	-	Human Quality Win: 90.6%
ContextualStory (Zheng et al., 2024)	PororoSV Visual	13.61	77.24 / 51.59	Storyline context, SETA gains
StoryImager (Tao et al., 2024)	PororoSV Visual	15.63	-	Unified masking, efficient training/inference
MAG (Zhu et al., 21 Dec 2025)	MAG-Bench	SSIM 0.66	-	Best-match LPIPS: 0.17
VideoSSM (Yu et al., 4 Dec 2025)	Video (60s)	-	Subj. >92%, Bg >93%	Sustained long-term consistency
VRAG (Chen et al., 28 May 2025)	Minecraft Video	SSIM 0.506	-	+8.6%/8.8% SSIM vs. baseline

Additional ablation studies confirm the significance of memory modules: removal of SETA, storyline contextualizer, or story-flow adapters in ContextualStory results in marked degradation of FID and character accuracy; restricting context windows or ignoring explicitly retrievable memory in VRAG yields lower consistency and higher compounding error (Zheng et al., 2024, Chen et al., 28 May 2025).

6. Limitations, Open Problems, and Prospective Directions

Common challenges persist across models:

Linear Memory Growth: Several models accumulate key–value pairs without compression or summarization, leading to increased computational cost for extended narratives (Rahman et al., 2022, Pan et al., 2022).
Implicit vs. Explicit Reference Resolution: Purely memory-attended mechanisms may fail on ambiguous referents or backgrounds with weak textual/visual cues; future directions point to hierarchical or disentangled memory systems for robust scaling (Rahman et al., 2022).
Long-range Temporal Coherence: Minute-scale or multi-panel narratives require recurrent or hybrid memory architectures; state-space modeling and distillation mitigate, but do not eliminate, compounding errors (Yu et al., 4 Dec 2025, Chen et al., 28 May 2025).
Bidirectional and Unified Modeling: StoryImager addresses the need for unified, bidirectional modeling, but many pipelines remain uni-directional or require separate training schedules (Tao et al., 2024).

A plausible implication is that scalable hybrid memory systems (e.g., combining compressive state-space models and retrieval-augmented key–value caches) will be integral to sustaining high-fidelity, referentially coherent story generation as applications demand longer, more structurally complex visual narratives.

Emerging architectures increasingly focus on cross-modal attention—BLIP-style multimodal encoders, storyline contextualizers, and frame-story cross-attention modules—as essential for fusing text, context, and visual features across narrative sequences (Pan et al., 2022, Zheng et al., 2024, Tao et al., 2024). Future research will likely explore:

Hierarchical and Disentangled Memory: Explicit separation of actor, background, and object-level memory slots, potentially using scene graphs or external symbolic embeddings.
Adaptive Compression and Routing: Integration of dynamic memory gating and routing (as in VideoSSM’s position-aware router and gated $\Delta$ -rule) to balance immediate detail with long-term structure (Yu et al., 4 Dec 2025).
Unified Multitask Training: Combining story visualization, continuation, infilling, and editing in a single unified framework using masking and bidirectional objectives (Tao et al., 2024).

Continued benchmarking on referential, long-horizon datasets (MUGEN, PororoSV, FlintstonesSV) and specialized metrics (Char F1, FrameAcc, FSD, best-match LPIPS) will remain vital for rigorous quantitative comparison and for guiding the evolution of autoregressive story generation with visual memory.