SEED-Story: Multimodal Narrative Generation
- SEED-Story is a multimodal narrative generation framework that integrates large language models and computer vision to create coherent, extended stories with interleaved text and images.
- The framework innovates with a multimodal attention sink mechanism that preserves essential token contexts, ensuring consistent character, style, and visual coherence over long sequences.
- Utilizing the StoryStream dataset and a diffusion de-tokenizer, SEED-Story outperforms baselines with lower FID scores and higher text–image alignment, confirming its efficient long story generation capabilities.
SEED-Story refers to a multimodal long story generation framework leveraging LLMs and computer vision techniques to create extended, interleaved sequences of text and images. The system is designed to generate coherent multimodal narratives with consistent character and visual style over long story arcs, overcoming the challenges of maintaining semantic and visual consistency across high-resolution, multi-step narrative content (Yang et al., 2024).
1. System Architecture
SEED-Story is built upon a pretrained multimodal LLM (MLLM) (e.g., LLaMA2-7B equipped with a Q-Former), which is capable of both ingesting image features and autoregressively generating the next text and visual tokens. The overall system consists of three primary components:
- Vision Tokenizer: A frozen ViT-G (e.g., Qwen-VL ViT-G) encodes each incoming image into a sequence of 256 patch features (), providing a high-capacity visual representation.
- MLLM Core with Q-Former: A set of learnable queries () attend over the vision features to interface with the transformer LLM (LLaMA2-7B). The transformer stack processes the concatenated queries and preceding text tokens to output both text continuations and predicted visual feature embeddings for the next image.
- Diffusion De-tokenizer: An SD-XL U-Net, pretrained for image denoising and adapted via finetuning, receives the MLLM’s output visual embeddings, conditioning the diffusion process to reconstruct high-fidelity images () that are consistent with the story and prior outputs.
The workflow is divided into three sequential training stages: (1) pretrained de-tokenizer; (2) instruction tuning of the MLLM/Q-Former module with interleaved text-image supervision; (3) de-tokenizer adaptation, where only the SD-XL U-Net is updated for precise image reconstruction.
2. Multimodal Attention Sink Mechanism
A core innovation in SEED-Story is the multimodal attention sink design. Standard dense attention is infeasible for long multimodal stories owing to quadratic memory scaling and eventual loss of global consistency, while simple sliding window approaches erase critical context, resulting in character or style collapse. SEED-Story circumvents these issues by continually preserving a small, carefully selected subset of token positions in the transformer’s KV-cache. These anchor points are:
- Start-of-Sequence (BOS) token.
- Begin-of-Image (BoI) and its few adjacent image tokens.
- End-of-Image (EoI) and adjacent image tokens.
- Recent tokens within a window (), matching or exceeding the training sequence length.
At each generation step, only the indexed set of anchor KV pairs is retained: where is typically 1–2. Attention for new tokens is computed only over , enabling both efficient scaling and the preservation of long-range visual and narrative context. Empirical analysis confirms that nearly all attention heads concentrate on these positions, validating the sufficiency of the mechanism for maintaining long-term coherence.
3. Dataset: StoryStream
The StoryStream dataset underpins SEED-Story training and evaluation, providing the largest available high-resolution resource for animated multimodal story generation:
- 257,850 images grouped into sequences of up to 30 images per story.
- Average text length: 146 tokens per description.
- Image resolution: , far exceeding standard benchmarks like Flintstones or Pororo.
- Data source pipeline: keyframes and subtitles from "Curious George," "Rabbids," and "The Land Before Time" are paired with GPT-4V/Qwen-VL-generated descriptions, then further grouped and synthesized into narrative paragraphs by GPT-4.
This scale and data richness facilitate instruction tuning for long, coherent multimodal generation.
4. Training Objectives and Methodology
SEED-Story is optimized via a composite objective during instruction tuning:
0
where 1 is the cross-entropy loss for next-token (text) prediction, and 2 is a cosine distance loss enforcing agreement between MLLM-predicted visual embeddings and the ViT features for the target image. 3 is set to 1 in experiments.
De-tokenizer adaptation is accomplished with an SD-XL U-Net trained using standard diffusion denoising objectives augmented by an 4 loss over predicted and target images.
Hardware configuration: instruction tuning on 8× NVIDIA A800 GPUs (bf16, 6 epochs); de-tokenizer finetuning on 4× A800 GPUs (bf16, 3 epochs).
5. Experimental Results and Quantitative Analysis
SEED-Story is evaluated both quantitatively and qualitatively against strong baselines (MM-interleaved, StoryGen, LDM):
- Visual Quality (FID): Lower FID scores indicate higher quality; SEED-Story achieves 79.67, outperforming dense attention (119.72), sliding-window (334.90), and vanilla sink (221.53).
- Text–Image Alignment: GPT-4V preference ratings over 180 stories find SEED-Story preferred in image quality (74% win rate), style consistency (68%), story engagement (61%), and coherence (75%).
- User Study: 55–86% user preference over five evaluation criteria.
- Long Story Generation: SEED-Story stably generates up to 25–50 sequential image–text pairs, greatly exceeding its training horizon.
Ablation confirms that only the multimodal sink mechanism consistently maintains both narrative and visual consistency at long lengths, preserving high CLIP scores and modest memory footprints.
6. Limitations and Prospective Directions
SEED-Story’s current limitations include:
- Domain Constraint: StoryStream contains exclusively animated content. Application to real-world photographic content has not yet been validated. Expanding to web-scale interleaved datasets (e.g., Obelics) is a suggested direction.
- Character Set Restriction: Character consistency relies on a fixed set of protagonists; open-set or few-shot personalized story generation (e.g. with DreamBooth integration) remains an open problem.
Additional future work could involve more dynamic attention mechanisms, live-action story generalization, and personalized multimodal generation through advanced prompt conditioning or parameter-efficient finetuning.
7. Significance and Impact
SEED-Story establishes a robust, scalable paradigm for long-form multimodal story generation by integrating efficient attention mechanisms, high-capacity diffusion de-tokenization, and large-scale animated training data. Its innovations permit the generation of character-consistent, richly visualized narratives that maintain coherence beyond the conventional limits seen in prior models. The public release of the model and the StoryStream dataset is positioned to catalyze further research in open-domain personalized multimodal content creation and to contribute significantly to the advancement of narrative AI systems (Yang et al., 2024).