Towards Open-Vocabulary Video Instance Segmentation (2304.01715v2)

Published 4 Apr 2023 in cs.CV and cs.AI

Abstract: Video Instance Segmentation (VIS) aims at segmenting and categorizing objects in videos from a closed set of training categories, lacking the generalization ability to handle novel categories in real-world videos. To address this limitation, we make the following three contributions. First, we introduce the novel task of Open-Vocabulary Video Instance Segmentation, which aims to simultaneously segment, track, and classify objects in videos from open-set categories, including novel categories unseen during training. Second, to benchmark Open-Vocabulary VIS, we collect a Large-Vocabulary Video Instance Segmentation dataset (LV-VIS), that contains well-annotated objects from 1,196 diverse categories, significantly surpassing the category size of existing datasets by more than one order of magnitude. Third, we propose an efficient Memory-Induced Transformer architecture, OV2Seg, to first achieve Open-Vocabulary VIS in an end-to-end manner with near real-time inference speed. Extensive experiments on LV-VIS and four existing VIS datasets demonstrate the strong zero-shot generalization ability of OV2Seg on novel categories. The dataset and code are released here https://github.com/haochenheheda/LVVIS.

PDF Abstract

Towards Open-Vocabulary Video Instance Segmentation

This paper, "Towards Open-Vocabulary Video Instance Segmentation," addresses a significant limitation in current Video Instance Segmentation (VIS) techniques: their inability to handle objects from novel categories not seen during training. Traditional VIS methods operate within a closed set of predefined categories, which restricts their applicability in real-world scenarios where the models frequently encounter novel objects. The authors propose Open-Vocabulary Video Instance Segmentation (OV-VIS) as a solution to segment, track, and classify objects from open-set categories, encompassing both seen and unseen categories.

Contributions

This paper makes several critical contributions to the field of VIS:

Task Definition: The authors define the novel task of OV-VIS, emphasizing the need for models to generalize beyond the fixed vocabulary used during training, and address the practical challenges associated with this paradigm.
Dataset Construction: To facilitate benchmarking OV-VIS, a Large-Vocabulary Video Instance Segmentation dataset (LV-VIS) is constructed. LV-VIS contains objects from 1,196 diverse categories, far exceeding the category size of existing datasets such as YouTube-VIS and BURST. This dataset is crucial for evaluating models on their ability to generalize across unseen categories.
Model Architecture: The authors propose OV2Seg, an end-to-end model designed for OV-VIS. It features a Memory-Induced Transformer architecture that efficiently segments, tracks, and classifies objects in videos with near real-time inference speed. The model leverages class-independent object queries, a memory mechanism to capture long-term dependencies, and a pre-trained LLM for open-vocabulary classification.

Results and Findings

OV2Seg demonstrated strong performance on both base and novel categories within the LV-VIS dataset. Notably, it achieved a relative improvement of up to 120% on novel categories compared to two-stage baseline models like DetPro-XMem, which highlights its capability for zero-shot generalization to unseen categories. The Memory-Induced Transformer architecture contributed to robust tracking and classification by employing memory queries that encode video-level features and information pertinent to object identities.

Moreover, OV2Seg achieves a competitive inference speed of 20.1 fps using the ResNet-50 backbone, which is significant considering the complex task it performs. The model also performed admirably on existing VIS datasets such as YouTube-VIS2019, YouTube-VIS2021, BURST, and OVIS without additional video-specific training, illustrating its versatility across various domain-specific challenges.

Implications and Future Work

This research paves the way for more adaptable and efficient VIS systems that can dynamically handle the complexity inherent in open-world settings. The large-scale LV-VIS dataset offers an indispensable tool for training and evaluating models to enhance their generalization capabilities. Practically, OV2Seg opens new possibilities in applications where real-time accurate segmentation and classification in dynamic environments are paramount, such as autonomous driving and video surveillance.

Theoretically, this work indicates a promising direction for integrating vision and language more deeply, suggesting that the architecture utilized can be optimized further with advancements in LLMs and memory mechanisms to manage even broader vocabulary sets. Future exploration might focus on integrating semi-supervised learning techniques to overcome the limitations posed by sparsely annotated datasets and further improve the model's ability to discern subtle distinctions between visually similar categories. Additionally, refining the architecture to leverage more efficient computing paradigms could enhance its scalability and deployment in resource-constrained scenarios.

In conclusion, the proposed methodology and dataset significantly advance the frontier of video instance segmentation by extending its operational scope to open-vocabulary settings, offering valuable insights for researchers focusing on multi-modal AI systems.

PDF Markdown Bookmark Chat (Pro)

Authors (8)

Haochen Wang (64 papers)
Cilin Yan (8 papers)
Shuai Wang (466 papers)
Xiaolong Jiang (25 papers)
Yao Hu (106 papers)
Weidi Xie (132 papers)
Efstratios Gavves (101 papers)
Xu Tang (48 papers)

Citations (22)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - haochenheheda/LVVIS: Large-Vocabulary Video Instance Segmentation dataset (88 stars)