Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Movie101v2: Improved Movie Narration Benchmark (2404.13370v2)

Published 20 Apr 2024 in cs.CV, cs.CL, and cs.MM

Abstract: Automatic movie narration aims to generate video-aligned plot descriptions to assist visually impaired audiences. Unlike standard video captioning, it involves not only describing key visual details but also inferring plots that unfold across multiple movie shots, presenting distinct and complex challenges. To advance this field, we introduce Movie101v2, a large-scale, bilingual dataset with enhanced data quality specifically designed for movie narration. Revisiting the task, we propose breaking down the ultimate goal of automatic movie narration into three progressive stages, offering a clear roadmap with corresponding evaluation metrics. Based on our new benchmark, we baseline a range of large vision-LLMs, including GPT-4V, and conduct an in-depth analysis of the challenges in narration generation. Our findings highlight that achieving applicable movie narration generation is a fascinating goal that requires significant research.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (70)
  1. Long-range multimodal pretraining for movie understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 13392–13403.
  2. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966 (2023).
  3. Condensed movies: Story based retrieval with contextual embeddings. In Proceedings of the Asian Conference on Computer Vision.
  4. Multi-perspective video captioning. In Proceedings of the 29th ACM International Conference on Multimedia. 5110–5118.
  5. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the ieee conference on computer vision and pattern recognition. 961–970.
  6. David Chen and William B Dolan. 2011. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies. 190–200.
  7. Knowledge enhanced model for live video comment generation. In 2023 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 2267–2272.
  8. Tvt: Two-view transformer network for video captioning. In Asian Conference on Machine Learning. PMLR, 847–862.
  9. Video captioning with guidance of multimodal latent topics. In Proceedings of the 25th ACM international conference on Multimedia. 1838–1846.
  10. A Thousand Frames in Just a Few Words: Lingual Description of Videos through Latent Topics and Sparse Object Stitching. In 2013 IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, June 23-28, 2013. IEEE Computer Society, 2634–2641. https://doi.org/10.1109/CVPR.2013.340
  11. ArcFace: Additive Angular Margin Loss for Deep Face Recognition. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019. Computer Vision Foundation / IEEE, 4690–4699. https://doi.org/10.1109/CVPR.2019.00482
  12. A neural multi-sequence alignment technique (neumatch). In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8749–8758.
  13. Gptscore: Evaluate as you desire. arXiv preprint arXiv:2302.04166 (2023).
  14. TALL: Temporal Activity Localization via Language Query. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017. IEEE Computer Society, 5277–5285. https://doi.org/10.1109/ICCV.2017.563
  15. Autoad ii: The sequel-who, when, and what in movie audio description. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 13645–13655.
  16. AutoAD: Movie description in context. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18930–18940.
  17. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.
  18. LoRA: Low-Rank Adaptation of Large Language Models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net. https://openreview.net/forum?id=nZeVKeeFYf9
  19. Movienet: A holistic dataset for movie understanding. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16. Springer, 709–727.
  20. Knowing yourself: Improving video caption via in-depth recap. In Proceedings of the 25th ACM international conference on Multimedia. 1906–1911.
  21. Dense-captioning events in videos. In Proceedings of the IEEE international conference on computer vision. 706–715.
  22. James Lakritz and Andrew Salway. 2006. The semi-automatic generation of audio description from screenplays. Dept. of Computing Technical Report CS-06-05, University of Surrey (2006).
  23. Tvr: A large-scale dataset for video-subtitle moment retrieval. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI 16. Springer, 447–463.
  24. PTVD: A Large-Scale Plot-Oriented Multimodal Dataset Based on Television Dramas. arXiv preprint arXiv:2306.14644 (2023).
  25. mplug: Effective and efficient vision-language learning by cross-modal skip-connections. arXiv preprint arXiv:2205.12005 (2022).
  26. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning. PMLR, 19730–19742.
  27. Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355 (2023).
  28. Mvbench: A comprehensive multi-modal video understanding benchmark. arXiv preprint arXiv:2311.17005 (2023).
  29. Oscar: Object-semantics aligned pre-training for vision-language tasks. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16. Springer, 121–137.
  30. LLaMA-VID: An image is worth 2 tokens in large language models. arXiv preprint arXiv:2311.17043 (2023).
  31. Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out. 74–81.
  32. Swinbert: End-to-end transformers with sparse attention for video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 17949–17958.
  33. Visual instruction tuning. ArXiv preprint abs/2304.08485 (2023). https://arxiv.org/abs/2304.08485
  34. Sibnet: Sibling convolutional encoder for video captioning. In Proceedings of the 26th ACM international conference on Multimedia. 1425–1434.
  35. Univl: A unified video and language pre-training model for multimodal understanding and generation. arXiv preprint arXiv:2002.06353 (2020).
  36. Valley: Video assistant with large language model enhanced ability. arXiv preprint arXiv:2306.07207 (2023).
  37. Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424 (2023).
  38. OpenAI. 2022. Introducing Whisper. https://openai.com/research/whisper.
  39. OpenAI. 2023. Introducing ChatGPT. https://openai.com/blog/chatgpt.
  40. PaddleOCR. 2022. PaddleOCR. https://github.com/PaddlePaddle/PaddleOCR.
  41. Bleu: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, 311–318. https://doi.org/10.3115/1073083.1073135
  42. Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.
  43. A dataset for Movie Description. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015. IEEE Computer Society, 3202–3212. https://doi.org/10.1109/CVPR.2015.7298940
  44. Movie description. International Journal of Computer Vision 123 (2017), 94–120.
  45. Watch it twice: Video captioning with a refocused video encoder. In Proceedings of the 27th ACM international conference on multimedia. 818–826.
  46. Mad: A scalable dataset for language grounding in videos from movie audio descriptions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5026–5035.
  47. Moviechat: From dense token to sparse memory for long video understanding. arXiv preprint arXiv:2307.16449 (2023).
  48. Towards Diverse Paragraph Captioning for Untrimmed Videos. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021. Computer Vision Foundation / IEEE, 11245–11254. https://doi.org/10.1109/CVPR46437.2021.01109
  49. MovieLLM: Enhancing Long Video Understanding with AI-Generated Movies. arXiv preprint arXiv:2403.01422 (2024).
  50. Synopses of movie narratives: a video-language dataset for story understanding. arXiv preprint arXiv:2203.05711 (2022).
  51. Clip4caption: Clip for video caption. In Proceedings of the 29th ACM International Conference on Multimedia. 4858–4862.
  52. Using descriptive video services to create a large data source for video annotation research. ArXiv preprint abs/1503.01070 (2015). https://arxiv.org/abs/1503.01070
  53. CIDEr: Consensus-based image description evaluation. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015. IEEE Computer Society, 4566–4575. https://doi.org/10.1109/CVPR.2015.7299087
  54. Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision. Springer, 20–36.
  55. YouMakeup: A Large-Scale Domain-Specific Multimodal Dataset for Fine-Grained Semantic Comprehension. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, 5133–5143. https://doi.org/10.18653/v1/D19-1517
  56. Video captioning via hierarchical reinforcement learning. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4213–4222.
  57. Vatex: A large-scale, high-quality multilingual dataset for video-and-language research. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4581–4591.
  58. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016. IEEE Computer Society, 5288–5296. https://doi.org/10.1109/CVPR.2016.571
  59. Visual captioning at will: Describing images and videos guided by a few stylized sentences. In Proceedings of the 31st ACM International Conference on Multimedia. 5705–5715.
  60. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178 (2023).
  61. mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration. arXiv preprint arXiv:2311.04257 (2023).
  62. Movie101: A New Movie Understanding Benchmark. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 4669–4684. https://doi.org/10.18653/v1/2023.acl-long.257
  63. Mm-narrator: Narrating long-form videos with multimodal in-context learning. arXiv preprint arXiv:2311.17435 (2023).
  64. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE signal processing letters 23, 10 (2016), 1499–1503.
  65. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675 (2019).
  66. Learning video representations from large language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6586–6597.
  67. Towards automatic learning of procedures from web instructional videos. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32.
  68. End-to-End Dense Video Captioning With Masked Transformer. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018. IEEE Computer Society, 8739–8748. https://doi.org/10.1109/CVPR.2018.00911
  69. Minigpt-4: Enhancing vision-language understanding with advanced large language models. ArXiv preprint abs/2304.10592 (2023). https://arxiv.org/abs/2304.10592
  70. ScriptWriter: Narrative-Guided Script Generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 8647–8657. https://doi.org/10.18653/v1/2020.acl-main.765
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com