StoryDiffusion: A Stable Diffusion XL Variant
- StoryDiffusion is a plug-in variant of Stable Diffusion XL that uses CSA and SMP to maintain subject consistency across image sequences and videos.
- It operates in a zero-shot manner by replacing self-attention layers without fine-tuning, repurposing pretrained diffusion backbone weights.
- Benchmarking shows improved CLIP similarity and user preference, indicating enhanced character coherence and smoother motion transitions compared to prior methods.
StoryDiffusion is a plug-in variant of Stable Diffusion XL (SD XL) designed to produce subject-consistent image sequences and smooth transition videos from text stories, addressing the long-standing issue of maintaining visual consistency for complex subjects across a generated sequence. The framework introduces Consistent Self-Attention (CSA) for training-free cross-frame content binding within image batches and a Semantic Motion Predictor (SMP) for generating temporally coherent intermediate video frames. StoryDiffusion operates as a zero-shot augmentation requiring no finetuning of the underlying diffusion backbone and remains fully compatible with pretrained SD XL or SD 1.5 weights (Zhou et al., 2024).
1. Architectural Foundations and Baseline Limitations
Stable Diffusion XL (SD XL) operates as a latent diffusion model. Images are mapped to and from a compact latent space via a pretrained Variational Autoencoder (VAE). Generation is handled by a U-Net denoiser , where denotes the noisy latent, is the timestep, and is a text embedding from CLIP. Each U-Net block comprises (i) spatial self-attention, (ii) cross-attention to CLIP text tokens, and (iii) feed-forward layers. Sampling procedures such as DDIM or DDPM iteratively denoise random Gaussian latents into final image latents, which the VAE decoder transforms into images (e.g., pixels).
SD XL generates images per prompt independently. Consequently, when tasked to generate a narrative or comic—i.e., a sequence of images involving recurring subjects—SD XL lacks a mechanism for cross-frame information flow. As a result, character identity, attire, and item persistency are not maintained, leading to subject drift across multiple images (Zhou et al., 2024).
2. Consistent Self-Attention Mechanism
Consistent Self-Attention (CSA) replaces every self-attention layer in the SD XL U-Net to facilitate subject coherence across a batch of related images. Given a batch of spatial features , standard self-attention computes:
CSA modifies this by sampling a subset of tokens from all images in the batch except using a random sampling rate :
The key/value pool for each image is augmented as
which shares the original SD XL projection weights:
The modified self-attention becomes
This structure enables each image within a batch to aggregate context from specific spatial tokens of other images, enforcing consistency in recurring visual attributes—e.g., facial features, clothing, or props—across the generated sequence. Implementation ablations show subject consistency degrades for sampling rates below (default ). CSA is implemented in a tile-wise manner to control memory overhead (Zhou et al., 2024).
3. Zero-Shot Plug-In and Hyperparameterization
CSA functions as a hot-plug replacement for the self-attention layers in the SD XL U-Net. No weight finetuning or retraining is required; the original pretrained SD XL Q/K/V weights are reused in CSA. Inference consists of:
- Loading SD XL backbone weights;
- Replacing self-attention with CSA in all U-Net blocks;
- Running the DDIM sampler (typically 50 steps) with classifier-free guidance ();
- Setting the spatial token sampling rate at and limiting memory via tile size ().
A direct implication is rapid adaptation to arbitrary stories or image batch sizes without retraining (Zhou et al., 2024).
4. Semantic Motion Predictor for Video Synthesis
To extend subject-consistent generation to videos, StoryDiffusion introduces the Semantic Motion Predictor (SMP), which interpolates and refines semantic trajectories between keyframes in CLIP-embedding space to produce motion-consistent intermediates. The SMP pipeline comprises:
(a) Semantic Encoding: A pretrained CLIP image encoder () computes semantic embeddings for source and target frames: .
(b) Trajectory Initialization: Linear interpolation in semantic space generates a crude intermediate embedding sequence:
for in-between frames.
(c) Trajectory Refinement: An -layer transformer () further processes these embeddings:
training against real ground-truth frames via:
where represents the diffusion decoder’s frame from conditioning .
(d) Diffusion Conditioning: For each intermediate frame, the refined motion embedding is concatenated to the text embedding () at the U-Net cross-attention layers:
This construction ensures that subject appearance and layout remain coherent throughout potentially large pose, viewpoint, or scene transitions (Zhou et al., 2024).
5. End-to-End Inference Workflow
The complete image and video generation pipeline consists of two sequential stages:
Stage 1 (Comics Generation):
- Tokenize a multi-sentence story into prompts .
- Form a batch of size , run the DDIM sampler (50 steps) on SD XL with CSA.
- Decode the resulting latents into consistent images.
Stage 2 (Video Synthesis):
- For each adjacent image pair , encode semantic pairs: .
- Interpolate and refine embeddings via the transformer to yield .
- For each , generate video frame conditioned on .
- Concatenate all segments to construct a complete, temporally smooth video (Zhou et al., 2024).
6. Empirical Performance and Benchmarking
Extensive benchmarking demonstrates StoryDiffusion's improvements in both qualitative and quantitative terms relative to prior baselines.
Subject-consistency in images (compared on SD XL backbone with 50 DDIM steps, ):
- Text–image CLIP similarity: IP-Adapter 0.613, PhotoMaker 0.654, StoryDiffusion 0.659
- Character–character CLIP similarity: IP-Adapter 0.880, PhotoMaker 0.892, StoryDiffusion 0.895
- User preference: StoryDiffusion 72.8%, IP-Adapter 10.4%, PhotoMaker 16.8%
Transition video quality (compared on SD 1.5 backbone with 50 DDIM steps, ):
| Metric | SEINE | SparseCtrl | StoryDiffusion |
|---|---|---|---|
| LPIPS-first (↓) | 0.433 | 0.491 | 0.379 |
| LPIPS-frames (↓) | 0.222 | 0.177 | 0.164 |
| CLIPSIM-first (↑) | 0.926 | 0.903 | 0.961 |
| CLIPSIM-frames(↑) | 0.974 | 0.976 | 0.987 |
| User preference | 11.6% | 6.4% | 82% |
Qualitative observations show enhanced preservation of attire, facial appearance, and props, with minimal artifacts and physically plausible motion—robust even to large-scale pose or viewpoint changes (Zhou et al., 2024).
7. Implementation Details and Reproducibility
- Code and pretrained weights are available at https://github.com/StoryDiffusion/StoryDiffusion.
- Experiment configuration: SD XL, 50 DDIM steps, guidance 5.0, CSA sampling rate , and tile size .
- SMP training uses WebVid10M transitions, integrating AnimateDiff V2 temporal module and CLIP ViT-H/14 encoder, with an 8-layer transformer (hidden dim 1024, 12 heads), optimized using AdamW (, 100k iterations on 8×A100 GPUs).
- For story sequences longer than batch limits, CSA is applied in a sliding window, and SMP is applied pairwise.
- All components exert zero-shot compatibility with standard SD XL/1.5 backbones; full reproductions of subject-consistent image sequences and transition videos require only minimal code changes (Zhou et al., 2024).