MOTS: Multi-Object Tracking and Segmentation (1902.03604v2)

Published 10 Feb 2019 in cs.CV

Abstract: This paper extends the popular task of multi-object tracking to multi-object tracking and segmentation (MOTS). Towards this goal, we create dense pixel-level annotations for two existing tracking datasets using a semi-automatic annotation procedure. Our new annotations comprise 65,213 pixel masks for 977 distinct objects (cars and pedestrians) in 10,870 video frames. For evaluation, we extend existing multi-object tracking metrics to this new task. Moreover, we propose a new baseline method which jointly addresses detection, tracking, and segmentation with a single convolutional network. We demonstrate the value of our datasets by achieving improvements in performance when training on MOTS annotations. We believe that our datasets, metrics and baseline will become a valuable resource towards developing multi-object tracking approaches that go beyond 2D bounding boxes. We make our annotations, code, and models available at https://www.vision.rwth-aachen.de/page/mots.

Citations (548)

View on Semantic Scholar

Summary

The paper presents a novel task that integrates pixel-level segmentation with multi-object tracking to overcome the limitations of bounding box annotations.
It introduces enhanced datasets with 65,213 pixel masks and new metrics like sMOTSA to comprehensively evaluate detection, tracking, and segmentation.
The baseline TrackR-CNN model demonstrates improved accuracy by jointly addressing detection, tracking, and segmentation within a unified framework.

Multi-Object Tracking and Segmentation (MOTS)

The paper introduces an extension to the standard multi-object tracking (MOT) task by integrating segmentation, resulting in a task termed Multi-Object Tracking and Segmentation (MOTS). The motivations for this task stem from the limitations of bounding box annotations in accurately capturing object interactions, especially in occluded or crowded scenes, where pixel-level precision becomes imperative.

Dataset and Annotation Process

To facilitate MOTS, the authors developed enhanced datasets by adding pixel-level annotations to existing MOT datasets, including KITTI and MOTChallenge. These annotations resulted in 65,213 pixel masks for 977 distinct objects across 10,870 video frames. The annotation process utilizes a semi-automatic pipeline that combines convolutional networks for initial mask generation with iterative human corrections, thereby striking a balance between automation and manual accuracy.

Evaluation Measures

The authors propose new evaluation metrics such as the soft Multi-Object Tracking and Segmentation Accuracy (sMOTSA) to comprehensively assess detection, tracking, and segmentation performance. These metrics extend the well-established CLEAR MOT metrics by incorporating pixel-level precision, thus ensuring that the complexities of MOTS are adequately captured.

Baseline Method and Experimental Results

The paper introduces TrackR-CNN, a baseline method built upon Mask R-CNN, augmented with temporal context via 3D convolutions and an association head. The system is designed to handle detection, tracking, and segmentation as interconnected tasks within a single framework. Notably, TrackR-CNN achieves higher sMOTSA and MOTSA scores compared to traditional methods that decouple these tasks. Experiments demonstrate the superiority of TrackR-CNN over purely bounding box-based approaches and underscore the potential benefits of joint training using the MOTS datasets.

Implications and Future Directions

By addressing both practical and theoretical gaps in current MOT methodologies, this research lays the groundwork for further advancements in pixel-level tracking. The datasets and benchmarks introduced are poised to become critical resources for advancing tracking technologies beyond 2D bounding boxes. Future work may explore optimizing computational efficiency, enhancing temporal integration, and extending these methods to broader applications beyond street scenarios.

In conclusion, the integration of segmentation with tracking in the MOTS task represents a significant step towards more nuanced and accurate computer vision models. The methodologies and resources introduced by the authors offer a platform for further exploration and innovation within the field.

PDF Markdown