Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Let Storytelling Tell Vivid Stories: An Expressive and Fluent Multimodal Storyteller (2403.07301v1)

Published 12 Mar 2024 in cs.CV

Abstract: Storytelling aims to generate reasonable and vivid narratives based on an ordered image stream. The fidelity to the image story theme and the divergence of story plots attract readers to keep reading. Previous works iteratively improved the alignment of multiple modalities but ultimately resulted in the generation of simplistic storylines for image streams. In this work, we propose a new pipeline, termed LLaMS, to generate multimodal human-level stories that are embodied in expressiveness and consistency. Specifically, by fully exploiting the commonsense knowledge within the LLM, we first employ a sequence data auto-enhancement strategy to enhance factual content expression and leverage a textual reasoning architecture for expressive story generation and prediction. Secondly, we propose SQ-Adatpter module for story illustration generation which can maintain sequence consistency. Numerical results are conducted through human evaluation to verify the superiority of proposed LLaMS. Evaluations show that LLaMS achieves state-of-the-art storytelling performance and 86% correlation and 100% consistency win rate as compared with previous SOTA methods. Furthermore, ablation experiments are conducted to verify the effectiveness of proposed sequence data enhancement and SQ-Adapter.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (27)
  1. Multimodal storytelling via generative adversarial imitation learning. arXiv preprint arXiv:1712.01455, 2017.
  2. Commonsense knowledge aware concept selection for diverse and informative visual storytelling. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 999–1008, 2021.
  3. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023.
  4. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  5. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  6. Knowledge-enriched visual storytelling. ArXiv, abs/1912.01496, 2019.
  7. What makes a good story? designing composite rewards for visual storytelling. In Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI), New York, USA, February 2020.
  8. Visual storytelling. In 15th Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2016), 2016.
  9. Glac net: Glocal attention cascading networks for multi-image cued story generation. ArXiv, abs/1805.10973, 2018.
  10. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
  11. Let your photos talk: Generating narrative paragraph for photo stream via bidirectional attention recurrent neural networks. In AAAI Conference on Artificial Intelligence, 2017.
  12. Visual instruction tuning, 2023.
  13. Expressing an image stream with a sequence of natural sentences. Advances in neural information processing systems, 28, 2015.
  14. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  15. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  16. Learning representations by back-propagating errors. nature, 323(6088):533–536, 1986.
  17. Vokenization: Improving language understanding with contextualized, visual-grounded supervision. arXiv preprint arXiv:2010.06775, 2020.
  18. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  19. Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems, 34:200–212, 2021.
  20. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  21. No metrics are perfect: Adversarial reward learning for visual storytelling. arXiv preprint arXiv:1804.09160, 2018.
  22. Next-gpt: Any-to-any multimodal llm, 2023.
  23. Imagine, reason and write: Visual storytelling with graph knowledge and relational reasoning. In AAAI Conference on Artificial Intelligence, 2021.
  24. Knowledgeable storyteller: A commonsense-driven generative model for visual storytelling. In International Joint Conference on Artificial Intelligence, 2019.
  25. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models. arXiv preprint arXiv:2308.06721, 2023.
  26. Minigpt-5: Interleaved vision-and-language generation via generative vokens. arXiv preprint arXiv:2310.02239, 2023.
  27. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Chuanqi Zang (1 paper)
  2. Jiji Tang (7 papers)
  3. Rongsheng Zhang (36 papers)
  4. Zeng Zhao (16 papers)
  5. Tangjie Lv (35 papers)
  6. Mingtao Pei (6 papers)
  7. Wei Liang (76 papers)
Citations (2)

Summary

We haven't generated a summary for this paper yet.