SIND: Sequential Image Narrative Dataset
- The dataset is the first large-scale resource explicitly designed for visual storytelling, offering rich image sequences paired with narrative annotations.
- It features three annotation tiers (DII, DIS, SIS) that provide isolated captions, context-informed descriptions, and cohesive story narratives.
- Baseline models using seq2seq architectures and advanced decoding heuristics demonstrate improved METEOR scores, highlighting SIND's research impact.
The Sequential Image Narrative Dataset (SIND) is the first large-scale dataset explicitly designed for sequential vision-to-language tasks, specifically targeting visual storytelling. Released as SIND v.1, it comprises richly annotated sequences of images paired with both structured captions and narrative stories, and forms a foundational resource for research in machine-based narrative understanding and generation (Ting-Hao et al., 2016).
1. Dataset Structure and Annotation Tiers
SIND v.1 consists of 20,211 image sequences (“stories-in-sequence,” SIS) comprising a total of 81,743 unique images. Each sequence contains at least five images, with the range typically between five and ten, selected from photo albums of 10–50 images sourced from the Flickr YFCC100M collection. The dataset features three distinct annotation tiers:
- DII: Descriptions of Images-in-Isolation Single-image captions in the style of MS COCO, annotated for each image independently.
- DIS: Descriptions of Images-in-Sequence MS COCO-style captions for each image within its chosen narrative sequence, conditioning the description on the sequence context.
- SIS: Stories for Images-in-Sequence A narrative, constructed as a series of sentences—one per image—that together form a cohesive story. Each “story card” within a SIS provides a sentence or phrase corresponding to a specific image and its position in the sequence.
Summary Table of Sequence Statistics:
| Annotation Tier | Sentences/Sequence (mean, σ) | Tokens/Sequence (mean, σ) |
|---|---|---|
| SIS | 5.00, 0.00 | 63.18, 30.90 |
| DIS | 1.00, — | 11.12, 6.43 |
| DII | 1.00, — | 10.92, 6.13 |
JSON entries include fields for album ID, sequence ID, image ID, image URL, DII caption, DIS caption, SIS story sentence, and story position.
2. Data Collection and Quality Control
The image source is Flickr’s Creative Commons–licensed YFCC100M dataset. Albums were required to contain 10–50 images taken within 48 hours. “Storyable” event albums were identified using possessive 5-grams (such as “John’s graduation”), filtered in accordance with WordNet event classes. Frequent album themes included “beach,” “amusement park,” and “birthday.”
Data annotation proceeded in multiple crowdsourcing stages:
- Stage 1: Storytelling Workers viewed the full album, selected and ordered at least five images, and composed a narrative one sentence at a time (“story cards”) for SIS. Albums without consensus “storyablity” (skipped by ≥2 workers) were excluded.
- Stage 2: Re-telling Additional workers selected one of two initial sequences per album and wrote an entirely new narrative.
- DII and DIS Captions For each image, three workers wrote DII captions and, for images in a chosen sequence, three workers wrote DIS captions.
Quality assurance involved filters such as minimum story lengths (15 words), post-processing with Stanford CoreNLP for tokenization and entity anonymization, and removal of albums with low annotator agreement on “storyable” status. Robustness was bolstered via multi-stage retelling, though no explicit inter-annotator statistic (e.g., κ) is provided.
3. Dataset Access, Formatting, and Splits
SIND v.1 is organized into 80% training, 10% validation, and 10% test splits at the album level (approximately 16k/2k/2k albums, respectively). Release artifacts are provided as JSON annotation files (stories.json, captions_dii.json, captions_dis.json) accompanied by corresponding image URLs. The standard directory structure separates images and annotation files by split. Full documentation and download access are available at http://sind.ai, and users are directed to cite Huang et al., “Visual Storytelling,” NAACL 2016 (Ting-Hao et al., 2016).
4. Baseline Tasks, Models, and Training Protocols
The primary modeling target on SIND is visual storytelling: the generation of a coherent narrative conditioned on a sequence of images. Baseline models employ a sequence-to-sequence (seq2seq) architecture:
- Image Encoder: An RNN (GRU) processes fc7 visual features in reverse sequence order.
- Story Decoder: A GRU generates words for the narrative one timestep at a time, using the update:
- Training Loss: The typical negative log-likelihood,
Decoding variants explored include beam search (beam=10), greedy decoding (beam=1), a –Dups heuristic preventing intra-story content word duplication, and +Grounded decoding (permitting “visual” words only if licensed by DII captions). Greedy decoding outperforms beam search for SIS (raising METEOR by ~4.6 points), while –Dups and +Grounded heuristics yield further METEOR improvements (+2.3 and +1.3, respectively). Model hyperparameters include embedding/hidden size 512, dropout 0.5, a vocabulary of ~20k, Adam optimizer (lr ≈ 4e-4), and 25–50 training epochs.
5. Evaluation Methodologies
SIND supports both automatic and manual evaluation protocols.
- Automatic Metrics:
BLEU-N (based on brevity penalty and n-gram match, BLEU ), METEOR , penalty, and final score as given), and Skip-Thoughts cosine similarity on sentence embeddings. Human correlation coefficients: METEOR (0.20) > Skip-Thoughts (0.16) > BLEU (0.08). ROUGE and CIDEr are also plausible.
- Human Evaluation:
Five Amazon Mechanical Turk judges per story rate each output (1–5 scale) for suitability as a personal story-sharing narrative (“If these were my photos, I would like using a story like this to share my experience with my friends”), with final scores averaged across judges.
Illustrative examples for both human-authored and baseline-generated SIS stories are provided, with the latter demonstrating the impact of –Dups and content realism constraints.
6. Applications and Limitations
Documented applications of SIND include:
- Multimodal narrative generation for photo-sharing platforms
- Event understanding for robotics and virtual agents
- Memory retrieval and summarization for personal photo albums
- Enhancing social media accessibility, for example to visually impaired users
Documented limitations comprise:
- Domain Bias: Overrepresentation of social and celebratory events from Flickr
- Structural Constraints: Sequence length restricted to ≥5 images, limiting narrative variability
- Annotator Bias: Dominance of informal narrative style and English-speaking contributors
- Limited Culture/Diversity: No explicit statistic on annotator agreement and limited narrative cultural diversity
A plausible implication is that SIND facilitates research into bridging concrete visual description (DII, DIS) with figurative and subjective storytelling (SIS), potentially advancing artificial intelligence toward nuanced, human-like narrative understanding grounded in visual event structure and expression (Ting-Hao et al., 2016).