TransCenter: Transformers with Dense Representations for Multiple-Object Tracking (2103.15145v4)

Published 28 Mar 2021 in cs.CV

Abstract: Transformers have proven superior performance for a wide variety of tasks since they were introduced. In recent years, they have drawn attention from the vision community in tasks such as image classification and object detection. Despite this wave, an accurate and efficient multiple-object tracking (MOT) method based on transformers is yet to be designed. We argue that the direct application of a transformer architecture with quadratic complexity and insufficient noise-initialized sparse queries - is not optimal for MOT. We propose TransCenter, a transformer-based MOT architecture with dense representations for accurately tracking all the objects while keeping a reasonable runtime. Methodologically, we propose the use of image-related dense detection queries and efficient sparse tracking queries produced by our carefully designed query learning networks (QLN). On one hand, the dense image-related detection queries allow us to infer targets' locations globally and robustly through dense heatmap outputs. On the other hand, the set of sparse tracking queries efficiently interacts with image features in our TransCenter Decoder to associate object positions through time. As a result, TransCenter exhibits remarkable performance improvements and outperforms by a large margin the current state-of-the-art methods in two standard MOT benchmarks with two tracking settings (public/private). TransCenter is also proven efficient and accurate by an extensive ablation study and comparisons to more naive alternatives and concurrent works. For scientific interest, the code is made publicly available at https://github.com/yihongxu/transcenter.

Authors (6)

Yihong Xu (20 papers)
Yutong Ban (27 papers)
Guillaume Delorme (2 papers)
Chuang Gan (195 papers)
Daniela Rus (181 papers)
Xavier Alameda-Pineda (69 papers)

Citations (75)

View on Semantic Scholar

Summary

The paper introduces dense, image-related queries that replace sparse queries in transformers, enhancing detection in complex, crowded scenes.
It utilizes Query Learning Networks to convert multi-scale encoder features into pixel-level detection and efficient sparse tracking queries.
Empirical results on MOT17 and MOT20 show improved tracking accuracy and reduced false detections compared to state-of-the-art methods.

Transformers with Dense Representations for Multiple-Object Tracking: An Examination of TransCenter

The paper presents a detailed exploration of TransCenter, a novel transformer-based methodology for multiple-object tracking (MOT), which focuses on utilizing dense representations to effectively address the challenges encountered in crowded scenes. Leveraging transformers' ability to handle long-range dependencies, the authors innovate by integrating dense image-related queries in place of the conventional sparse and noise-initialized queries, hypothesizing that these dense queries enable more effective and robust object tracking.

Methodological Insights

TransCenter introduces several key architectural components. First, it discards the limited number of sparse queries that have been common in related works such as DETR-based architectures. Instead, TransCenter leverages dense, image-related queries—that is, queries that align with image resolution and structure—which facilitate exhaustive spatial representation and accurate segmentation of object features in crowded scenes.

The proposed Query Learning Networks (QLN) play a crucial role in translating the multi-scale features from the transformer encoder into dense detection queries and sparse tracking queries. The detection queries are specifically designed to operate at pixel-level resolution, which significantly enhances the method’s ability to detect overlapping objects in densely populated frames. On the other hand, sparse tracking queries enable efficient temporal association by leveraging prior information about object locations in previous frames, thereby reducing computational demands without sacrificing tracking fidelity.

Central to TransCenter is its decoder, which employs a deformable attention mechanism to maintain linear complexity concerning input sizes. The decoder processes queries for object detection and temporal association separately but efficiently, allowing dense and sparse representations to coexist harmoniously within the network architecture. The method's capability is further demonstrated through extensive ablation studies, which affirm the design choices made regarding the QLN and transformer decoder structures.

Empirical Performance

The empirical evaluation of TransCenter on benchmark datasets such as MOT17 and MOT20 demonstrates its superiority over state-of-the-art methods, accomplishing higher Multiple-Object Tracking Accuracy (MOTA) scores across several settings. Of note, TransCenter significantly reduces false negatives and false positives, achieving a fine balance between detection sufficiency and noise suppression—a critical advancement given the intricacies of crowded scenes. The paper also presents TransCenter's efficiency, upheld by the careful balance between dense detection and sparse tracking queries, alongside computational efficiency fostered by efficient query learning and attention mechanisms.

Theoretical and Practical Implications

The adoption of dense detection queries in TransCenter opens a new avenue for deploying transformers in computer vision tasks, particularly when objects densely populate a scene. The paper's insight into the utility of dense image-size queries disrupts conventional wisdom that has predominantly leaned on sparse representations within transformer architectures. From a theoretical lens, this approach not only mitigates missed detections in crowded areas but also suggests a pathway toward reducing the false detection rate through effective attention within global contexts.

Practically, these advancements make TransCenter a compelling choice for real-world applications where scene congestion is expected—for instance, in urban surveillance and autonomous driving—which may benefit immensely from TransCenter's robustness and efficiency in handling multiple-object scenarios.

Future Prospects and Speculation

The success illustrated by TransCenter in integrating transformer architectures with dense representations lays the groundwork for future research directions. Prospective studies might explore further optimization of the trade-off between dense and sparse query representations, potentially leading to even more efficient model architectures. Another exciting dimension relates to the adaptation of similar dense query mechanisms to other tasks involving complex spatiotemporal dynamics, such as video captioning or action recognition, where capturing nuanced inter-object interactions is paramount.

Overall, the research offers a substantial contribution to the MOT field, promoting further exploration of transformers combined with dense visual outputs, questioning existing paradigms, and setting new benchmarks in tracking accuracy and efficiency.

PDF Markdown

Related Papers

GitHub

GitHub - yihongXU/TransCenter: This is the official implementation of TransCenter (TPAMI). The code and pretrained models are now available here: https://gitlab.inria.fr/yixu/TransCenter_official. (111 stars)