Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
166 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Openstory++: A Large-scale Dataset and Benchmark for Instance-aware Open-domain Visual Storytelling (2408.03695v1)

Published 7 Aug 2024 in cs.CV

Abstract: Recent image generation models excel at creating high-quality images from brief captions. However, they fail to maintain consistency of multiple instances across images when encountering lengthy contexts. This inconsistency is largely due to in existing training datasets the absence of granular instance feature labeling in existing training datasets. To tackle these issues, we introduce Openstory++, a large-scale dataset combining additional instance-level annotations with both images and text. Furthermore, we develop a training methodology that emphasizes entity-centric image-text generation, ensuring that the models learn to effectively interweave visual and textual information. Specifically, Openstory++ streamlines the process of keyframe extraction from open-domain videos, employing vision-LLMs to generate captions that are then polished by a LLM for narrative continuity. It surpasses previous datasets by offering a more expansive open-domain resource, which incorporates automated captioning, high-resolution imagery tailored for instance count, and extensive frame sequences for temporal consistency. Additionally, we present Cohere-Bench, a pioneering benchmark framework for evaluating the image generation tasks when long multimodal context is provided, including the ability to keep the background, style, instances in the given context coherent. Compared to existing benchmarks, our work fills critical gaps in multi-modal generation, propelling the development of models that can adeptly generate and interpret complex narratives in open-domain environments. Experiments conducted within Cohere-Bench confirm the superiority of Openstory++ in nurturing high-quality visual storytelling models, enhancing their ability to address open-domain generation tasks. More details can be found at https://openstorypp.github.io/

Citations (1)

Summary

  • The paper presents a novel large-scale dataset featuring 100 million annotated unique samples and 1 million fully annotated sequences for visual storytelling.
  • The paper employs a specialized methodology that integrates keyframe extraction, BLIP2 captioning, and instance masking with SAM and YOLO-World to ensure narrative coherence.
  • The paper benchmarks its approach using Cohere-Bench, demonstrating improved semantic alignment, style consistency, and instance integrity compared to existing datasets.

Openstory++ : A Large-scale Dataset and Benchmark for Instance-aware Open-domain Visual Storytelling

Overview

The paper "Openstory++ : A Large-scale Dataset and Benchmark for Instance-aware Open-domain Visual Storytelling" presents a significant advancement in the field of multimodal AI, specifically addressing the challenges in maintaining consistency of multiple instances across frames in open-domain visual storytelling. The authors introduce Openstory++, a large-scale dataset enriched with instance-level annotations and tailored to improve the training and evaluation of multi-modal generative models. Furthermore, the paper elaborates on a specialized training methodology and introduces a pioneering benchmark framework called Cohere-Bench, both designed to enhance and assess the capabilities of visual storytelling models comprehensively.

Dataset and Methodology

The Openstory++ dataset is notable for its large scale and comprehensive annotations. The dataset incorporates:

  • 100 million high-quality, annotated unique samples
  • 1 million fully annotated sequence samples

A critical feature of Openstory++ is its emphasis on instance-level visual segmentation annotations, enabling the creation of coherent visual narratives that maintain subject consistency across frames. The dataset construction pipeline employs the following steps:

  1. Keyframe Extraction: From open-domain videos, keyframes are extracted and aesthetically evaluated to ensure high-quality inputs.
  2. Caption Generation: BLIP2 is used to generate descriptive captions, which are then refined by a LLM to ensure narrative coherence.
  3. Instance Masking: The Segment Anything Model (SAM) and YOLO-World are utilized to annotate and mask instances within images, providing detailed instance-level annotations.

This methodology addresses the primary challenge in existing datasets - the lack of granular instance feature labeling, which is essential for training models to maintain consistency of multiple instances across frames.

Benchmark Framework

The authors also introduce Cohere-Bench, a benchmark designed to evaluate the following dimensions in visual storytelling models:

  • Semantic Alignment: Ensuring the thematic alignment between visual and textual content.
  • Background Consistency: Maintaining visual coherence in the background across frames.
  • Style Consistency: Ensuring stylistic uniformity in generated frames.
  • Instance Consistency and Integrity: Maintaining the visual and semantic coherence of instances across frames.

Experiments conducted using Cohere-Bench demonstrate the superiority of Openstory++ over existing datasets in fostering high-quality visual storytelling models. The results indicate improvements in semantic alignment, style consistency, and instance integrity.

Experimental Results

Quantitative evaluation using metrics such as semantic alignment, style consistency, and instance consistency shows that models trained on Openstory++ outperform those trained on other datasets. For instance:

  • Semantic Alignment: 0.262 for Openstory++ compared to 0.183 for VIST.
  • Style Consistency: 0.783 for Openstory++ compared to 0.742 for VIST.
  • Instance Consistency: 0.765 for Openstory++ compared to 0.598 for VIST.

Moreover, human evaluations further validate the high quality and consistency of visual stories generated using models trained on Openstory++.

Implications and Future Directions

The introduction of Openstory++ and Cohere-Bench has several implications:

  • Enhanced Model Training: With granular instance-level annotations, models can be trained to maintain continuity in multi-frame visual narratives.
  • Comprehensive Evaluation: Cohere-Bench provides a robust framework to assess the multi-modal generation capabilities, focusing on long-context entity consistency and multi-turn generation.

Future developments could explore expanding the dataset to cover even more diverse scenarios and refining the annotation pipeline for improved accuracy. Additionally, integrating these advancements into LLMs could further bridge the gap between visual and textual narrative coherence.

Conclusion

Openstory++ and the accompanying Cohere-Bench framework represent a substantial contribution to the field of AI and multi-modal storytelling. By addressing the limitations of existing datasets and providing a thorough evaluation methodology, the authors pave the way for the development of sophisticated models capable of generating coherent and contextually rich visual stories. This work not only enhances current generative capabilities but also sets a benchmark for future innovations in the domain.

Overall, the Openstory++ dataset and Cohere-Bench framework will likely play a pivotal role in advancing the effectiveness and robustness of instance-aware visual storytelling models in open-domain settings.