Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MM-VID: Advancing Video Understanding with GPT-4V(ision) (2310.19773v1)

Published 30 Oct 2023 in cs.CV

Abstract: We present MM-VID, an integrated system that harnesses the capabilities of GPT-4V, combined with specialized tools in vision, audio, and speech, to facilitate advanced video understanding. MM-VID is designed to address the challenges posed by long-form videos and intricate tasks such as reasoning within hour-long content and grasping storylines spanning multiple episodes. MM-VID uses a video-to-script generation with GPT-4V to transcribe multimodal elements into a long textual script. The generated script details character movements, actions, expressions, and dialogues, paving the way for LLMs to achieve video understanding. This enables advanced capabilities, including audio description, character identification, and multimodal high-level comprehension. Experimental results demonstrate the effectiveness of MM-VID in handling distinct video genres with various video lengths. Additionally, we showcase its potential when applied to interactive environments, such as video games and graphic user interfaces.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (12)
  1. Kevin Lin (98 papers)
  2. Faisal Ahmed (16 papers)
  3. Linjie Li (89 papers)
  4. Chung-Ching Lin (36 papers)
  5. Ehsan Azarnasab (2 papers)
  6. Zhengyuan Yang (86 papers)
  7. Jianfeng Wang (149 papers)
  8. Lin Liang (11 papers)
  9. Zicheng Liu (153 papers)
  10. Yumao Lu (8 papers)
  11. Ce Liu (51 papers)
  12. Lijuan Wang (133 papers)
Citations (45)
Youtube Logo Streamline Icon: https://streamlinehq.com