Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Zero-Shot Long-Form Video Understanding through Screenplay (2406.17309v1)

Published 25 Jun 2024 in cs.CV

Abstract: The Long-form Video Question-Answering task requires the comprehension and analysis of extended video content to respond accurately to questions by utilizing both temporal and contextual information. In this paper, we present MM-Screenplayer, an advanced video understanding system with multi-modal perception capabilities that can convert any video into textual screenplay representations. Unlike previous storytelling methods, we organize video content into scenes as the basic unit, rather than just visually continuous shots. Additionally, we developed a ``Look Back'' strategy to reassess and validate uncertain information, particularly targeting breakpoint mode. MM-Screenplayer achieved highest score in the CVPR'2024 LOng-form VidEo Understanding (LOVEU) Track 1 Challenge, with a global accuracy of 87.5% and a breakpoint accuracy of 68.8%.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Yongliang Wu (10 papers)
  2. Bozheng Li (9 papers)
  3. Jiawang Cao (7 papers)
  4. Wenbo Zhu (17 papers)
  5. Yi Lu (145 papers)
  6. Weiheng Chi (3 papers)
  7. Chuyun Xie (1 paper)
  8. Haolin Zheng (2 papers)
  9. Ziyue Su (2 papers)
  10. Jay Wu (6 papers)
  11. Xu Yang (222 papers)
Citations (1)
X Twitter Logo Streamline Icon: https://streamlinehq.com