Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

IDA-VLM: Towards Movie Understanding via ID-Aware Large Vision-Language Model (2407.07577v1)

Published 10 Jul 2024 in cs.CV and cs.AI

Abstract: The rapid advancement of Large Vision-LLMs (LVLMs) has demonstrated a spectrum of emergent capabilities. Nevertheless, current models only focus on the visual content of a single scenario, while their ability to associate instances across different scenes has not yet been explored, which is essential for understanding complex visual content, such as movies with multiple characters and intricate plots. Towards movie understanding, a critical initial step for LVLMs is to unleash the potential of character identities memory and recognition across multiple visual scenarios. To achieve the goal, we propose visual instruction tuning with ID reference and develop an ID-Aware Large Vision-LLM, IDA-VLM. Furthermore, our research introduces a novel benchmark MM-ID, to examine LVLMs on instance IDs memory and recognition across four dimensions: matching, location, question-answering, and captioning. Our findings highlight the limitations of existing LVLMs in recognizing and associating instance identities with ID reference. This paper paves the way for future artificial intelligence systems to possess multi-identity visual inputs, thereby facilitating the comprehension of complex visual narratives like movies.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Yatai Ji (15 papers)
  2. Shilong Zhang (32 papers)
  3. Jie Wu (230 papers)
  4. Peize Sun (33 papers)
  5. Weifeng Chen (22 papers)
  6. Xuefeng Xiao (51 papers)
  7. Sidi Yang (6 papers)
  8. Yujiu Yang (155 papers)
  9. Ping Luo (340 papers)
Citations (2)

Summary

We haven't generated a summary for this paper yet.