StoryStream: Multimodal Story Dataset
- StoryStream is a large-scale multimodal dataset comprising 30 fixed image–text pairs per story, designed for coherent narrative generation.
- It includes 8,595 stories and 257,850 sequences drawn from popular animated franchises like Curious George and The Land Before Time.
- An automated three-stage pipeline—video-frame extraction, pre-annotation via GPT-4V, and narrative synthesis—ensures high-quality, aligned storytelling.
StoryStream is a large-scale, high-resolution, animated-style multimodal dataset specifically constructed to benchmark and advance long-sequence image–text generation in the context of narrative story creation. Designed for probing the interplay between textual and visual modalities in extended, coherent storytelling, StoryStream provides purpose-built corpora derived from well-known animated franchises and is structured to enable rigorous evaluation of multimodal long-form generation and consistency (Yang et al., 2024).
1. Dataset Composition
StoryStream consists of sequences drawn from three popular cartoon series: Curious George, Rabbids Invasion, and The Land Before Time. Each independent story comprises 30 interleaved image–text pairs, forming a multimodal narrative arc. The dataset contains 8,595 independent stories, yielding a total of 257,850 multimodal sequences (30 sequences per story). Each image corresponds to an individual narrative segment, and each segment averages 146 tokens, resulting in an average story length of approximately 4,380 tokens: Similarly, the average number of images per story is fixed: This degenerate (zero-variance) structure in the number of segments per story standardizes downstream benchmarking and model evaluation.
2. Data Collection and Annotation Pipeline
StoryStream is assembled through an automated three-stage process:
- Video-frame extraction: Using the video2dataset tool, keyframes and subtitles are extracted from publicly accessible streams of the three cartoon series.
- Vision–language pre-annotation: Each keyframe is annotated via GPT-4V, or alternatively Qwen-VL, to generate detailed scene descriptions, which are aligned with the extracted subtitles.
- Narrative synthesis: Batches of 30 tuples (keyframe, subtitle, scene description) are provided to GPT-4 with series-specific background prompts. Chain-of-thought instructions structure the generation of fluid, storybook-style narrative text.
Prompt engineering is central to quality control, with detailed image descriptions reducing ambiguity and chain-of-thought instructions ensuring logical cohesion. Spot-checks and early-stopping rules are employed to constrain off-topic generations. The process does not require manual annotation of characters, but periodic human review verifies the narrative integrity and visual–textual alignment.
3. Modality Format and Statistics
Text Modality:
Each story consists of 30 fixed-length narrative segments, each averaging 146 tokens. The distribution over the number of segments per story is degenerate, with mean, median, and standard