Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

StoryGAN: A Sequential Conditional GAN for Story Visualization (1812.02784v2)

Published 6 Dec 2018 in cs.CV

Abstract: We propose a new task, called Story Visualization. Given a multi-sentence paragraph, the story is visualized by generating a sequence of images, one for each sentence. In contrast to video generation, story visualization focuses less on the continuity in generated images (frames), but more on the global consistency across dynamic scenes and characters -- a challenge that has not been addressed by any single-image or video generation methods. We therefore propose a new story-to-image-sequence generation model, StoryGAN, based on the sequential conditional GAN framework. Our model is unique in that it consists of a deep Context Encoder that dynamically tracks the story flow, and two discriminators at the story and image levels, to enhance the image quality and the consistency of the generated sequences. To evaluate the model, we modified existing datasets to create the CLEVR-SV and Pororo-SV datasets. Empirically, StoryGAN outperforms state-of-the-art models in image quality, contextual consistency metrics, and human evaluation.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Yitong Li (95 papers)
  2. Zhe Gan (135 papers)
  3. Yelong Shen (83 papers)
  4. Jingjing Liu (139 papers)
  5. Yu Cheng (354 papers)
  6. Yuexin Wu (23 papers)
  7. Lawrence Carin (203 papers)
  8. David Carlson (36 papers)
  9. Jianfeng Gao (344 papers)
Citations (200)

Summary

StoryGAN: A Sequential Conditional GAN for Story Visualization

The paper introduces StoryGAN, a novel approach to a unique task termed "Story Visualization." This task involves transforming a multi-sentence paragraph into a coherent sequence of images, with one image per sentence, emphasizing global consistency across different images and scenes, as opposed to the continuous frame focus of video generation. The paper highlights the challenges present in this task, such as maintaining consistency in the characters and scenes depicted, and proposes a solution through the StoryGAN framework.

StoryGAN Framework

StoryGAN is built upon the sequential conditional GAN framework, introducing innovative components not previously seen in image or video generation methods. It features a deep Context Encoder combining a GRU cell and a newly developed Text2Gist cell, specifically designed to capture and dynamically track the flow of the story. This Context Encoder effectively incorporates prior contextual information to enhance semantic coherence across generated images. The model further distinguishes itself with two discriminators operating at different levels: an image-level discriminator to ensure sentence-image relevance and a story-level discriminator to uphold the global consistency between image sequences and the overarching story.

Evaluation and Empirical Results

Two novel datasets, CLEVR-SV and Pororo-SV, were crafted from existing datasets for comparative evaluation of StoryGAN. Results indicate that StoryGAN surpasses state-of-the-art models in image quality, contextual consistency, and human evaluation metrics. These enhancements are attributed primarily to the Context Encoder's efficacy and the dual discriminator framework, which collaboratively work to produce a sequence of high-fidelity images that maintain story coherence.

Numerical Results and Contributions

The paper presents quantitative analysis using the Structural Similarity Index (SSIM) on the CLEVR-SV dataset, demonstrating improved similarity with ground truth images when employing StoryGAN over other methods. Furthermore, human evaluation on the Pororo-SV dataset supplements these numerical results with subjective quality assessments, where StoryGAN consistently scored higher compared to baselines. These findings reinforce the paper's claims about the superior performance of StoryGAN in handling the story visualization task.

Implications and Future Directions

The implications of this research extend beyond the field of story visualization. The novel Text2Gist and dual discriminator approaches introduce promising avenues for further research in sequential data generation, potentially influencing areas such as automated comic strip generation and narrative-based video creation. Furthermore, this work sets the stage for advancements in understanding complex multi-modal sequences, contributing to the larger field of AI-driven content creation.

As for future developments, the refinement of these methods in the context of even more intricate stories or diverse datasets could push the boundaries of how AI-assisted storytelling is perceived. Additionally, improved computational techniques and incorporation of richer datasets might further expand the capabilities of StoryGAN and similar models.

In summary, this paper presents significant advancements in the visualization of textual stories through images, suggesting substantial theoretical and practical contributions to the future of AI-generated content.