MOSE: A New Dataset for Video Object Segmentation in Complex Scenes

Published 3 Feb 2023 in cs.CV | (2302.01872v1)

Abstract: Video object segmentation (VOS) aims at segmenting a particular object throughout the entire video clip sequence. The state-of-the-art VOS methods have achieved excellent performance (e.g., 90+% J&F) on existing datasets. However, since the target objects in these existing datasets are usually relatively salient, dominant, and isolated, VOS under complex scenes has rarely been studied. To revisit VOS and make it more applicable in the real world, we collect a new VOS dataset called coMplex video Object SEgmentation (MOSE) to study the tracking and segmenting objects in complex environments. MOSE contains 2,149 video clips and 5,200 objects from 36 categories, with 431,725 high-quality object segmentation masks. The most notable feature of MOSE dataset is complex scenes with crowded and occluded objects. The target objects in the videos are commonly occluded by others and disappear in some frames. To analyze the proposed MOSE dataset, we benchmark 18 existing VOS methods under 4 different settings on the proposed MOSE dataset and conduct comprehensive comparisons. The experiments show that current VOS algorithms cannot well perceive objects in complex scenes. For example, under the semi-supervised VOS setting, the highest J&F by existing state-of-the-art VOS methods is only 59.4% on MOSE, much lower than their ~90% J&F performance on DAVIS. The results reveal that although excellent performance has been achieved on existing benchmarks, there are unresolved challenges under complex scenes and more efforts are desired to explore these challenges in the future. The proposed MOSE dataset has been released at https://henghuiding.github.io/MOSE.

Abstract PDF Upgrade to Chat

Citations (103)

View on Semantic Scholar

Summary

The paper introduces MOSE, a dataset that rigorously tests video object segmentation methods in realistic, complex scenarios with occlusions and dynamic changes.
The methodology benchmarks 18 state-of-the-art VOS algorithms across four settings, revealing significant performance declines under challenging conditions.
The results emphasize the need for improved object re-identification, occlusion handling, and long-term tracking techniques in video sequences.

MOSE: A New Dataset for Video Object Segmentation in Complex Scenes

The paper presents the development and evaluation of MOSE, a novel dataset designed to address the complexities in Video Object Segmentation (VOS) within intricate and realistic scenarios. As contemporary VOS methodologies achieve high performance on datasets where target objects stand isolated and prominent, this dataset revisits VOS with an emphasis on challenging environments. MOSE comprises 2,149 video clips featuring 5,200 objects from 36 categories, to thoroughly analyse object segmentation when faced with occlusions, crowds, and disappearance-reappearance dynamics. A notable characteristic of MOSE is its focus on complex scenes, providing a rigorous testbed for evaluating the robustness and comprehensiveness of existing VOS algorithms.

Dataset Significance

MOSE endeavors to fill a gap in VOS research by providing a dataset that mirrors real-world challenges. Existing datasets often feature videos where target objects are easily discernible; MOSE intentionally incorporates crowded scenes, occlusions, and targets disappearing and reappearing throughout sequences. This design choice prompts a re-evaluation of current state-of-the-art VOS methods, demanding improvements in object association, recognition of subtle features, and long-term temporal tracking. The dataset's extensive collection of 431,725 segmentation masks further assists in uncovering these intricacies by offering a comprehensive ground truth for benchmarking.

Benchmarking and Analysis

The research meticulously benchmarks 18 recent state-of-the-art VOS algorithms across four settings: semi-supervised with mask initialization, semi-supervised with box initialization, unsupervised, and interactive VOS. The performance of these methods significantly declines on MOSE, with the highest $\mathcal{J}%%%%0%%%%\mathcal{F}$ score in the semi-supervised setup reducing to 59.4%, compared to approximately 90% in dominant datasets like DAVIS. This stark contrast highlights unresolved challenges in VOS, particularly in scenarios where objects blend into dense environments or shift their appearance drastically. The results substantiate the hypothesis that present VOS methods lack the adeptness to effectively navigate complex real-world video sequences.

Implications and Future Directions

MOSE's development prompts the research community to pivot towards new mechanisms for handling video sequences with dynamic and obstructive elements. The dataset's focus stresses the necessity for:

Enhanced Object Re-identification: Refining association techniques to accurately track objects that temporarily disappear or change appearance across frames.
Occlusion Handling: Investigating methodologies to improve segmentation accuracy in scenarios containing heavy occlusions and indistinct object boundaries.
Focus on Small/Inconspicuous Objects: Addressing the challenges of detecting and tracking less salient objects within crowded scenes.
Crowd Analysis: Segmenting targets amidst a congregation of visually similar objects necessitates refined differential algorithms.
Long-term Video Adaptation: Developing algorithms capable of processing and analyzing extended video sequences without triple the computational expense.

In conclusion, the MOSE dataset provides crucial insights into the limitations of current VOS methods and lays the groundwork for future advancements. By highlighting real-world conditions and their complexities, this work catalyzes a necessary evolution in AI-based video understanding, compelling the development of models that are both computationally efficient and versatile across varied environments.

Markdown Report Issue