VIST-GPT: Ushering in the Era of Visual Storytelling with LLMs? (2504.19267v3)

Published 27 Apr 2025 in cs.CL, cs.AI, cs.CV, and cs.LG

Abstract: Visual storytelling is an interdisciplinary field combining computer vision and natural language processing to generate cohesive narratives from sequences of images. This paper presents a novel approach that leverages recent advancements in multimodal models, specifically adapting transformer-based architectures and large multimodal models, for the visual storytelling task. Leveraging the large-scale Visual Storytelling (VIST) dataset, our VIST-GPT model produces visually grounded, contextually appropriate narratives. We address the limitations of traditional evaluation metrics, such as BLEU, METEOR, ROUGE, and CIDEr, which are not suitable for this task. Instead, we utilize RoViST and GROOVIST, novel reference-free metrics designed to assess visual storytelling, focusing on visual grounding, coherence, and non-redundancy. These metrics provide a more nuanced evaluation of narrative quality, aligning closely with human judgment.

Summary

The paper introduces VIST-GPT, a framework leveraging large multimodal models and transformer architectures for enhanced visual storytelling.
VIST-GPT uses a dual encoder setup (CLIP, InternVideo-v2) with Vision-Language Adapters and Phi-3-mini LLM to integrate visual features for coherent narrative generation.
Evaluation uses novel metrics like RoViST and GROOVIST, demonstrating VIST-GPT's superior performance in visual grounding and coherence compared to previous models.

Visual Storytelling with VIST-GPT: An Analytical Perspective

Visual storytelling represents an interdisciplinary endeavor that combines computer vision and natural language processing to transform sequences of images into coherent narratives. The paper "VIST-GPT: Ushering in the Era of Visual Storytelling with LLMs?" presents advancements in this field by leveraging large multimodal models, specifically focusing on transformer-based architectures for enhancing storytelling capabilities.

Methodology and Model Architecture

The VIST-GPT framework undergoes an in-depth exploration of multimodal model architectures, integrating both image and video encoders. This dual encoder setup, utilizing CLIP ViT-L/14 for spatial feature capture and InternVideo-v2 for assessing temporal dynamics, facilitates a comprehensive understanding of the visuals. The innovative approach adopts Vision-Language (V-L) Adapters to project visual features into a unified space, allowing seamless integration into the story generation process handled by a LLM, Phi-3-mini-4k-instruct. Token pooling techniques are employed to optimize the data flow into the LLM’s context, ensuring that stories remain visually grounded without redundancy.

Evaluation Metrics and Comparative Analysis

The paper addresses the inadequacies of traditional pattern-matching metrics (BLEU, ROUGE, METEOR, CIDEr) in evaluating storytelling quality. Instead, it proposes novel metrics such as RoViST and GROOVIST, which emphasize visual grounding, coherence, and non-redundancy. These metrics align closely with human judgment, thereby providing a robust evaluation framework for narrative quality.

VIST-GPT demonstrates superior performance in terms of visual grounding and coherence, as indicated by GROOVIST and RoViST scores. For instance, the model achieved a visual grounding score of 0.9962 and coherence score of 0.7837, surpassing other models such as AREL, GLACNET, and MCSM+BART. Furthermore, the Human-to-Machine Distance metric further corroborates VIST-GPT’s close alignment with human narrative generation.

Implications and Future Directions

The implications of VIST-GPT’s advancements are extensive, positioning this model as a versatile tool in various domains such as media and content creation, education, and interactive technologies. The ability to generate visually aligned and coherent narratives could enhance user engagement and improve educational tools by offering more immersive storytelling experiences.

However, challenges remain in handling intricate visual scenes and nuanced storytelling dynamics. Future iterations of VIST-GPT may benefit from incorporating more sophisticated LLM architectures or expanded datasets that encompass diverse storytelling contexts. This could enhance narrative richness and adapt the model to a broader range of visuals, ensuring it maintains high fidelity to both imagery and storytelling essence.

Concluding Remarks

Overall, VIST-GPT represents a significant step forward in visual storytelling, bridging the gap between literal image descriptions and deeper narrative creation. Through integration of advanced multimodal models and robust evaluation metrics, this approach not only enhances storytelling capabilities but also opens new avenues for research and application in artificial intelligence, thereby advancing the intersection of computer vision and language processing.