TAO: A Large-Scale Benchmark for Tracking Any Object (2005.10356v1)

Published 20 May 2020 in cs.CV

Abstract: For many years, multi-object tracking benchmarks have focused on a handful of categories. Motivated primarily by surveillance and self-driving applications, these datasets provide tracks for people, vehicles, and animals, ignoring the vast majority of objects in the world. By contrast, in the related field of object detection, the introduction of large-scale, diverse datasets (e.g., COCO) have fostered significant progress in developing highly robust solutions. To bridge this gap, we introduce a similarly diverse dataset for Tracking Any Object (TAO). It consists of 2,907 high resolution videos, captured in diverse environments, which are half a minute long on average. Importantly, we adopt a bottom-up approach for discovering a large vocabulary of 833 categories, an order of magnitude more than prior tracking benchmarks. To this end, we ask annotators to label objects that move at any point in the video, and give names to them post factum. Our vocabulary is both significantly larger and qualitatively different from existing tracking datasets. To ensure scalability of annotation, we employ a federated approach that focuses manual effort on labeling tracks for those relevant objects in a video (e.g., those that move). We perform an extensive evaluation of state-of-the-art trackers and make a number of important discoveries regarding large-vocabulary tracking in an open-world. In particular, we show that existing single- and multi-object trackers struggle when applied to this scenario in the wild, and that detection-based, multi-object trackers are in fact competitive with user-initialized ones. We hope that our dataset and analysis will boost further progress in the tracking community.

Authors (5)

Achal Dave (31 papers)
Tarasha Khurana (8 papers)
Pavel Tokmakov (32 papers)
Cordelia Schmid (206 papers)
Deva Ramanan (152 papers)

Citations (160)

View on Semantic Scholar

Summary

The paper introduces TAO, a dataset with 2,907 videos and 833 object categories, enabling evaluation of diverse, long-term multi-object tracking performance.
It employs a bottom-up, federated annotation strategy that prioritizes dynamic objects and reduces manual labeling efforts.
Empirical analysis reveals that state-of-the-art trackers struggle with TAO’s extensive vocabulary, underscoring the need for improved tracking techniques.

Insights into TAO: A Benchmark for Tracking Any Object

The paper introduces the TAO (Tracking Any Object) benchmark, a comprehensive dataset designed for evaluating multi-object tracking systems with a focus on diversity and scale. TAO distinguishes itself from previous benchmarks by covering a wide range of object categories, promoting advancements in long-term and large-vocabulary object tracking under realistic conditions.

Dataset Overview

TAO is composed of 2,907 high-resolution videos sourced from a range of environments, representing a significant increase in both complexity and diversity compared to existing tracking datasets. This collection results in a novel category distribution, capturing everyday objects that pose distinctive tracking challenges. The dataset encompasses 833 object categories, an order of magnitude larger than previous benchmarks.

Methodology and Contributions

The paper highlights a bottom-up approach for identifying a broad vocabulary of object categories. Annotators labeled objects based on motion, emphasizing dynamic occurrences over a static set of predefined categories. This approach is inspired by previous methodologies used for large-scale image datasets like LVIS and COCO.

TAO's annotation process introduces a federated strategy, optimizing the tracking annotations for objects of high relevance in each video, thus efficiently utilizing manual labeling efforts. The evaluation protocol deploys the federated mAP, and other metrics, providing a robust measure of system performance across various scenarios. This facilitates accurate assessments of trackers under a large-vocabulary regime.

Empirical Analysis

The paper conducts evaluations using state-of-the-art trackers. A significant finding is the limited generalization of existing tracking algorithms when applied to TAO. Both single-object and multi-object trackers, traditionally robust under controlled conditions, exhibit decreased performance across TAO's diverse and challenging test suite. The empirical results underscore the pronounced difficulty posed by large-vocabulary tracking in dynamic environments.

Moreover, the paper indicates detection-based multi-object trackers are competitive with methods that require user-initialization, suggesting broader implications for the development of open-world tracking solutions.

Implications and Future Directions

TAO offers a substantial contribution to the tracking community by setting a new benchmark that rigorously tests the flexibility and generalization of object trackers. The extensive empirical analysis presented reveals crucial bottlenecks in current methodologies, emphasizing the need for innovative solutions capable of handling diverse and long-duration sequences effectively.

The paper suggests that advancements in combining instance segmentation, motion prediction, and long-term association could be pivotal for tackling the challenges outlined by TAO. By integrating such elements, future research may foster the development of more holistic tracking systems, capable of operating efficiently in real-world settings.

In conclusion, TAO elevates the standard for multi-object tracking benchmarks, providing both a rich dataset for evaluation and a robust framework for assessing progress in this field. Researchers are encouraged to adopt TAO in future tracking studies, with a view to closing the performance gaps identified and pushing the boundaries of what is achievable in real-world object tracking applications.