Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

M-LLM Based Video Frame Selection for Efficient Video Understanding (2502.19680v2)

Published 27 Feb 2025 in cs.CV and cs.AI

Abstract: Recent advances in Multi-Modal LLMs (M-LLMs) show promising results in video reasoning. Popular Multi-Modal LLM (M-LLM) frameworks usually apply naive uniform sampling to reduce the number of video frames that are fed into an M-LLM, particularly for long context videos. However, it could lose crucial context in certain periods of a video, so that the downstream M-LLM may not have sufficient visual information to answer a question. To attack this pain point, we propose a light-weight M-LLM -based frame selection method that adaptively select frames that are more relevant to users' queries. In order to train the proposed frame selector, we introduce two supervision signals (i) Spatial signal, where single frame importance score by prompting a M-LLM; (ii) Temporal signal, in which multiple frames selection by prompting LLM using the captions of all frame candidates. The selected frames are then digested by a frozen downstream video M-LLM for visual reasoning and question answering. Empirical results show that the proposed M-LLM video frame selector improves the performances various downstream video LLM (video-LLM) across medium (ActivityNet, NExT-QA) and long (EgoSchema, LongVideoBench) context video question answering benchmarks.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Kai Hu (55 papers)
  2. Feng Gao (239 papers)
  3. Xiaohan Nie (6 papers)
  4. Peng Zhou (136 papers)
  5. Son Tran (22 papers)
  6. Tal Neiman (7 papers)
  7. Lingyun Wang (16 papers)
  8. Mubarak Shah (207 papers)
  9. Raffay Hamid (12 papers)
  10. Bing Yin (56 papers)
  11. Trishul Chilimbi (22 papers)