Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

OW-VISCapTor: Abstractors for Open-World Video Instance Segmentation and Captioning (2404.03657v2)

Published 4 Apr 2024 in cs.CV and cs.AI

Abstract: We propose the new task 'open-world video instance segmentation and captioning'. It requires to detect, segment, track and describe with rich captions never before seen objects. This challenging task can be addressed by developing "abstractors" which connect a vision model and a language foundation model. Concretely, we connect a multi-scale visual feature extractor and a LLM by developing an object abstractor and an object-to-text abstractor. The object abstractor, consisting of a prompt encoder and transformer blocks, introduces spatially-diverse open-world object queries to discover never before seen objects in videos. An inter-query contrastive loss further encourages the diversity of object queries. The object-to-text abstractor is augmented with masked cross-attention and acts as a bridge between the object queries and a frozen LLM to generate rich and descriptive object-centric captions for each detected object. Our generalized approach surpasses the baseline that jointly addresses the tasks of open-world video instance segmentation and dense video object captioning by 13% on never before seen objects, and by 10% on object-centric captions.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Anwesa Choudhuri (9 papers)
  2. Girish Chowdhary (69 papers)
  3. Alexander G. Schwing (62 papers)
Citations (2)