TARN-VIST: Topic Aware Reinforcement Network for Visual Storytelling (2403.11550v1)
Abstract: As a cross-modal task, visual storytelling aims to generate a story for an ordered image sequence automatically. Different from the image captioning task, visual storytelling requires not only modeling the relationships between objects in the image but also mining the connections between adjacent images. Recent approaches primarily utilize either end-to-end frameworks or multi-stage frameworks to generate relevant stories, but they usually overlook latent topic information. In this paper, in order to generate a more coherent and relevant story, we propose a novel method, Topic Aware Reinforcement Network for VIsual StoryTelling (TARN-VIST). In particular, we pre-extracted the topic information of stories from both visual and linguistic perspectives. Then we apply two topic-consistent reinforcement learning rewards to identify the discrepancy between the generated story and the human-labeled story so as to refine the whole generation process. Extensive experimental results on the VIST dataset and human evaluation demonstrate that our proposed model outperforms most of the competitive models across multiple evaluation metrics.
- SPICE: semantic propositional image caption evaluation. In Computer Vision - ECCV 14th European Conference, pages 382–398, Amsterdam, The Netherlands. Springer.
- Satanjeev Banerjee and Alon Lavie. 2005. METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization@ACL 2005, Ann Arbor, Michigan, USA, June 29, 2005, pages 65–72, Ann Arbor, Michigan, USA. Association for Computational Linguistics.
- Ordered attention for coherent visual storytelling. In MM ’22: The 30th ACM International Conference on Multimedia, pages 3310–3318, Lisboa, Portugal. ACM.
- Commonsense knowledge aware concept selection for diverse and informative visual storytelling. In Thirty-Fifth AAAI Conference on Artificial Intelligence, pages 999–1008, Virtual Event. AAAI Press.
- Sentistory: A multi-layered sentiment-aware generative model for visual storytelling. IEEE Trans. Circuits Syst. Video Technol., 32(11):8051–8064.
- No metrics are perfect: Adversarial reward learning for visual storytelling. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL, Volume 1: Long Papers, pages 899–909, Melbourne, Australia. Association for Computational Linguistics.
- Plot and rework: Modeling storylines for visual storytelling. CoRR, abs/2105.06950.
- Visual storytelling with hierarchical BERT semantic guidance. In MMAsia ’21: ACM Multimedia Asia, pages 24:1–24:7, Gold Coast, Australia. ACM.
- Coherent visual storytelling via parallel top-down visual and topic attention. IEEE Trans. Circuits Syst. Video Technol., 33(1):257–268.
- Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pages 770–778, Las Vegas, NV, USA. IEEE Computer Society.
- Knowledge-enriched visual storytelling. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, pages 7952–7960, Honolulu, Hawaii, USA. AAAI Press.
- What makes A good story? designing composite rewards for visual storytelling. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, pages 7969–7976, New York, NY, USA. AAAI Press.
- Hierarchically structured reinforcement learning for topically coherent visual story generation. In The Thirty-Third AAAI Conference on Artificial Intelligence, pages 8465–8472, Honolulu, Hawaii, USA. AAAI Press.
- Visual storytelling. In NAACL HLT The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1233–1239, San Diego California, USA. The Association for Computational Linguistics.
- Hide-and-tell: Learning to bridge photo streams for visual storytelling. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI, pages 11213–11220, New York, NY, USA. AAAI Press.
- GLAC net: Glocal attention cascading networks for multi-image cued story generation. CoRR, abs/1805.10973.
- Deep reinforcement learning for autonomous driving: A survey. IEEE Trans. Intell. Transp. Syst., 23(6):4909–4926.
- Topic adaptation and prototype encoding for few-shot visual storytelling. In MM ’20: The 28th ACM International Conference on Multimedia, pages 4208–4216, Virtual Event / Seattle, WA. ACM.
- Associative learning network for coherent visual storytelling. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), page 5, RHODES ISLAND, GREECE. IEEE.
- Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, page 74–81, Barcelona, Spain. ACL.
- Let your photos talk: Generating narrative paragraph for photo stream via bidirectional attention recurrent neural networks. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, (AAAI-17), page 8, San Francisco, California USA. AAAI Press.
- Playing atari with deep reinforcement learning. CoRR, abs/1312.5602.
- Reinforcement learning on graphs: A survey. IEEE Trans. Emerg. Top. Comput. Intell., 7(4):1065–1082.
- James Orr and Ayan Dutta. 2023. Multi-agent deep reinforcement learning for multi-robot applications: A survey. Sensors, 23(7):3625.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, PA, USA. ACL.
- Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, page 12, Vancouver, BC, Canada. Curran Associates, Inc.
- Mastering the game of stratego with model-free multiagent reinforcement learning. Science, 378(6623):990–996.
- High-accuracy model-based reinforcement learning, a survey. Artif. Intell. Rev., 56(9):9541–9573.
- Latent memory-augmented graph transformer for visual storytelling. In MM ’21: ACM Multimedia Conference, pages 4892–4901, Virtual Event. ACM.
- Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, ICML, pages 8748–8763, Virtual Event. PMLR.
- Automatic keyword extraction from individual documents.
- BLEURT: learning robust metrics for text generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL, pages 7881–7892, Online event. Association for Computational Linguistics.
- Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In 3rd International Conference on Learning Representations, ICLR, San Diego, CA, USA.
- Cider: Consensus-based image description evaluation. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pages 4566–4575, Boston, MA, USA. IEEE Computer Society.
- Hierarchical photo-scene encoder for album storytelling. In The Thirty-Third AAAI Conference on Artificial Intelligence, pages 8909–8916, Honolulu, Hawaii, USA. AAAI Press.
- Show, reward and tell: Automatic generation of narrative paragraph from photo stream by adversarial training. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), page 8, New Orleans, Louisiana, USA. AAAI Press.
- Storytelling from an image stream using scene graphs. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, pages 9185–9192, New York, NY, USA. AAAI Press.
- Imagine, reason and write: Visual storytelling with graph knowledge and relational reasoning. In Thirty-Fifth AAAI Conference on Artificial Intelligence, pages 3022–3029, Virtual Event. AAAI Press.
- Dingyi Yang and Qin Jin. 2023. Attractive storyteller: Stylized visual storytelling with unpaired text. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11053–11066, Toronto, Canada. Association for Computational Linguistics.
- Knowledgeable storyteller: A commonsense-driven generative model for visual storytelling. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI, pages 5356–5362, Macao, China. ijcai.org.
- Reinforcement learning in healthcare: A survey. ACM Comput. Surv., 55(2):5:1–5:36.
- Hierarchically-attentive RNN for album summarization and storytelling. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP, pages 966–971, Copenhagen, Denmark. The Association for Computational Linguistics.
- Bartscore: Evaluating generated text as text generation. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems, pages 27263–27277, virtual event. MIT Press.
- Bertscore: Evaluating text generation with BERT. In 8th International Conference on Learning Representations, ICLR, Addis Ababa, Ethiopia. OpenReview.net.
- Weiran Chen (2 papers)
- Xin Li (980 papers)
- Jiaqi Su (8 papers)
- Guiqian Zhu (1 paper)
- Ying Li (432 papers)
- Yi Ji (27 papers)
- Chunping Liu (3 papers)