End-to-End Video Object Detection with Spatial-Temporal Transformers (2105.10920v1)

Published 23 May 2021 in cs.CV

Abstract: Recently, DETR and Deformable DETR have been proposed to eliminate the need for many hand-designed components in object detection while demonstrating good performance as previous complex hand-crafted detectors. However, their performance on Video Object Detection (VOD) has not been well explored. In this paper, we present TransVOD, an end-to-end video object detection model based on a spatial-temporal Transformer architecture. The goal of this paper is to streamline the pipeline of VOD, effectively removing the need for many hand-crafted components for feature aggregation, e.g., optical flow, recurrent neural networks, relation networks. Besides, benefited from the object query design in DETR, our method does not need complicated post-processing methods such as Seq-NMS or Tubelet rescoring, which keeps the pipeline simple and clean. In particular, we present temporal Transformer to aggregate both the spatial object queries and the feature memories of each frame. Our temporal Transformer consists of three components: Temporal Deformable Transformer Encoder (TDTE) to encode the multiple frame spatial details, Temporal Query Encoder (TQE) to fuse object queries, and Temporal Deformable Transformer Decoder to obtain current frame detection results. These designs boost the strong baseline deformable DETR by a significant margin (3%-4% mAP) on the ImageNet VID dataset. TransVOD yields comparable results performance on the benchmark of ImageNet VID. We hope our TransVOD can provide a new perspective for video object detection. Code will be made publicly available at https://github.com/SJTU-LuHe/TransVOD.

PDF Abstract

End-to-End Video Object Detection with Spatial-Temporal Transformers

The paper "End-to-End Video Object Detection with Spatial-Temporal Transformers" presents TransVOD, an innovative approach towards video object detection (VOD) that leverages spatial-temporal Transformer architectures to eliminate the numerous hand-crafted components typically required in this domain. This work builds on the foundational DETR and Deformable DETR methodologies by adapting and expanding their capabilities to process video sequences effectively, addressing challenges such as motion blur, frame defocus, and occlusion that are often encountered in VOD tasks.

Conceptual Overview

TransVOD aims to transform the conventional VOD pipeline, traditionally characterized by complex and separate stages for feature aggregation and post-processing, into a streamlined, end-to-end learning task. It draws inspiration from the recent successes of Transformers in various vision tasks, utilizing their prowess for capturing long-range dependencies to model spatio-temporal contexts across video frames.

The key innovation of TransVOD lies in its architectural design, primarily its temporal Transformer, which aggregates both spatial object queries and feature memories. The temporal Transformer is composed of three principal components:

Temporal Deformable Transformer Encoder (TDTE): This encoder aggregates multi-frame spatial details, efficiently handling spatial redundancies while avoiding background interference by attending only to key sampling locations around reference points.
Temporal Query Encoder (TQE): By leveraging a sequence of refined self-attention layers, TQE enables the aggregation of instance-aware object queries across frames, enhancing the detection performance by exploiting temporal continuity of object appearances.
Temporal Deformable Transformer Decoder (TDTD): This decoder computes the final detection results for the current frame by integrating outputs from both the TDTE and the TQE, utilizing temporal dynamics appropriately to aid in task-specific decision making.

Experimental Results

TransVOD was empirically validated on the ImageNet VID benchmark, outperforming several state-of-the-art video object detectors with substantial improvements (3-4% mAP) over the baseline Deformable DETR. Specifically, TransVOD achieved 79.9% mAP with ResNet-50 and 81.9% mAP with ResNet-101, demonstrating its effective utilization of temporal information without relying on complex post-processing procedures.

Implications and Future Work

This research has significant implications for the development of future VOD systems by suggesting that end-to-end sequence prediction models can achieve or exceed state-of-the-art performances without intricate hand-crafted processes. The paper's approach of employing Transformers to unify spatial and temporal modeling avenues has paved the way for future work that could extend this methodology to broader video understanding tasks, such as video instance segmentation or action recognition.

Upcoming research could explore optimizing the Transformer architecture specifically for video modalities or scaling the model to handle higher resolution inputs. A potential avenue is further integration with unsupervised learning paradigms, using semi-supervised data to improve detection robustness in unseen real-world scenarios. Additionally, investigating the architecture's convergence behaviors and its implications on resource efficiency could contribute to its deployment in resource-constrained environments or edge devices.

In conclusion, TransVOD provides a significant leap toward simplifying and enhancing video object detection processes. By amalgamating the strengths of spatial-temporal Transformers, it proposes a potent framework that could inspire further advancements and applications across the spectrum of video analysis.

PDF Markdown Bookmark Chat (Pro)

Authors (10)

Lu He (15 papers)
Qianyu Zhou (40 papers)
Xiangtai Li (128 papers)
Li Niu (79 papers)
Guangliang Cheng (55 papers)
Xiao Li (354 papers)
Wenxuan Liu (28 papers)
Yunhai Tong (69 papers)
Lizhuang Ma (145 papers)
Liqing Zhang (80 papers)

Citations (87)

View on Semantic Scholar

End-to-End Video Object Detection with Spatial-Temporal Transformers (2105.10920v1)