Movie101v2: Improved Movie Narration Benchmark (2404.13370v2)
Abstract: Automatic movie narration aims to generate video-aligned plot descriptions to assist visually impaired audiences. Unlike standard video captioning, it involves not only describing key visual details but also inferring plots that unfold across multiple movie shots, presenting distinct and complex challenges. To advance this field, we introduce Movie101v2, a large-scale, bilingual dataset with enhanced data quality specifically designed for movie narration. Revisiting the task, we propose breaking down the ultimate goal of automatic movie narration into three progressive stages, offering a clear roadmap with corresponding evaluation metrics. Based on our new benchmark, we baseline a range of large vision-LLMs, including GPT-4V, and conduct an in-depth analysis of the challenges in narration generation. Our findings highlight that achieving applicable movie narration generation is a fascinating goal that requires significant research.
- Long-range multimodal pretraining for movie understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 13392–13403.
- Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966 (2023).
- Condensed movies: Story based retrieval with contextual embeddings. In Proceedings of the Asian Conference on Computer Vision.
- Multi-perspective video captioning. In Proceedings of the 29th ACM International Conference on Multimedia. 5110–5118.
- Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the ieee conference on computer vision and pattern recognition. 961–970.
- David Chen and William B Dolan. 2011. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies. 190–200.
- Knowledge enhanced model for live video comment generation. In 2023 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 2267–2272.
- Tvt: Two-view transformer network for video captioning. In Asian Conference on Machine Learning. PMLR, 847–862.
- Video captioning with guidance of multimodal latent topics. In Proceedings of the 25th ACM international conference on Multimedia. 1838–1846.
- A Thousand Frames in Just a Few Words: Lingual Description of Videos through Latent Topics and Sparse Object Stitching. In 2013 IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, June 23-28, 2013. IEEE Computer Society, 2634–2641. https://doi.org/10.1109/CVPR.2013.340
- ArcFace: Additive Angular Margin Loss for Deep Face Recognition. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019. Computer Vision Foundation / IEEE, 4690–4699. https://doi.org/10.1109/CVPR.2019.00482
- A neural multi-sequence alignment technique (neumatch). In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8749–8758.
- Gptscore: Evaluate as you desire. arXiv preprint arXiv:2302.04166 (2023).
- TALL: Temporal Activity Localization via Language Query. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017. IEEE Computer Society, 5277–5285. https://doi.org/10.1109/ICCV.2017.563
- Autoad ii: The sequel-who, when, and what in movie audio description. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 13645–13655.
- AutoAD: Movie description in context. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18930–18940.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.
- LoRA: Low-Rank Adaptation of Large Language Models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net. https://openreview.net/forum?id=nZeVKeeFYf9
- Movienet: A holistic dataset for movie understanding. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16. Springer, 709–727.
- Knowing yourself: Improving video caption via in-depth recap. In Proceedings of the 25th ACM international conference on Multimedia. 1906–1911.
- Dense-captioning events in videos. In Proceedings of the IEEE international conference on computer vision. 706–715.
- James Lakritz and Andrew Salway. 2006. The semi-automatic generation of audio description from screenplays. Dept. of Computing Technical Report CS-06-05, University of Surrey (2006).
- Tvr: A large-scale dataset for video-subtitle moment retrieval. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI 16. Springer, 447–463.
- PTVD: A Large-Scale Plot-Oriented Multimodal Dataset Based on Television Dramas. arXiv preprint arXiv:2306.14644 (2023).
- mplug: Effective and efficient vision-language learning by cross-modal skip-connections. arXiv preprint arXiv:2205.12005 (2022).
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning. PMLR, 19730–19742.
- Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355 (2023).
- Mvbench: A comprehensive multi-modal video understanding benchmark. arXiv preprint arXiv:2311.17005 (2023).
- Oscar: Object-semantics aligned pre-training for vision-language tasks. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16. Springer, 121–137.
- LLaMA-VID: An image is worth 2 tokens in large language models. arXiv preprint arXiv:2311.17043 (2023).
- Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out. 74–81.
- Swinbert: End-to-end transformers with sparse attention for video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 17949–17958.
- Visual instruction tuning. ArXiv preprint abs/2304.08485 (2023). https://arxiv.org/abs/2304.08485
- Sibnet: Sibling convolutional encoder for video captioning. In Proceedings of the 26th ACM international conference on Multimedia. 1425–1434.
- Univl: A unified video and language pre-training model for multimodal understanding and generation. arXiv preprint arXiv:2002.06353 (2020).
- Valley: Video assistant with large language model enhanced ability. arXiv preprint arXiv:2306.07207 (2023).
- Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424 (2023).
- OpenAI. 2022. Introducing Whisper. https://openai.com/research/whisper.
- OpenAI. 2023. Introducing ChatGPT. https://openai.com/blog/chatgpt.
- PaddleOCR. 2022. PaddleOCR. https://github.com/PaddlePaddle/PaddleOCR.
- Bleu: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, 311–318. https://doi.org/10.3115/1073083.1073135
- Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.
- A dataset for Movie Description. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015. IEEE Computer Society, 3202–3212. https://doi.org/10.1109/CVPR.2015.7298940
- Movie description. International Journal of Computer Vision 123 (2017), 94–120.
- Watch it twice: Video captioning with a refocused video encoder. In Proceedings of the 27th ACM international conference on multimedia. 818–826.
- Mad: A scalable dataset for language grounding in videos from movie audio descriptions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5026–5035.
- Moviechat: From dense token to sparse memory for long video understanding. arXiv preprint arXiv:2307.16449 (2023).
- Towards Diverse Paragraph Captioning for Untrimmed Videos. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021. Computer Vision Foundation / IEEE, 11245–11254. https://doi.org/10.1109/CVPR46437.2021.01109
- MovieLLM: Enhancing Long Video Understanding with AI-Generated Movies. arXiv preprint arXiv:2403.01422 (2024).
- Synopses of movie narratives: a video-language dataset for story understanding. arXiv preprint arXiv:2203.05711 (2022).
- Clip4caption: Clip for video caption. In Proceedings of the 29th ACM International Conference on Multimedia. 4858–4862.
- Using descriptive video services to create a large data source for video annotation research. ArXiv preprint abs/1503.01070 (2015). https://arxiv.org/abs/1503.01070
- CIDEr: Consensus-based image description evaluation. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015. IEEE Computer Society, 4566–4575. https://doi.org/10.1109/CVPR.2015.7299087
- Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision. Springer, 20–36.
- YouMakeup: A Large-Scale Domain-Specific Multimodal Dataset for Fine-Grained Semantic Comprehension. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, 5133–5143. https://doi.org/10.18653/v1/D19-1517
- Video captioning via hierarchical reinforcement learning. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4213–4222.
- Vatex: A large-scale, high-quality multilingual dataset for video-and-language research. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4581–4591.
- MSR-VTT: A Large Video Description Dataset for Bridging Video and Language. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016. IEEE Computer Society, 5288–5296. https://doi.org/10.1109/CVPR.2016.571
- Visual captioning at will: Describing images and videos guided by a few stylized sentences. In Proceedings of the 31st ACM International Conference on Multimedia. 5705–5715.
- mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178 (2023).
- mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration. arXiv preprint arXiv:2311.04257 (2023).
- Movie101: A New Movie Understanding Benchmark. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 4669–4684. https://doi.org/10.18653/v1/2023.acl-long.257
- Mm-narrator: Narrating long-form videos with multimodal in-context learning. arXiv preprint arXiv:2311.17435 (2023).
- Joint face detection and alignment using multitask cascaded convolutional networks. IEEE signal processing letters 23, 10 (2016), 1499–1503.
- Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675 (2019).
- Learning video representations from large language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6586–6597.
- Towards automatic learning of procedures from web instructional videos. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32.
- End-to-End Dense Video Captioning With Masked Transformer. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018. IEEE Computer Society, 8739–8748. https://doi.org/10.1109/CVPR.2018.00911
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. ArXiv preprint abs/2304.10592 (2023). https://arxiv.org/abs/2304.10592
- ScriptWriter: Narrative-Guided Script Generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 8647–8657. https://doi.org/10.18653/v1/2020.acl-main.765