End-to-end Temporal Action Detection with Transformer (2106.10271v4)

Published 18 Jun 2021 in cs.CV

Abstract: Temporal action detection (TAD) aims to determine the semantic label and the temporal interval of every action instance in an untrimmed video. It is a fundamental and challenging task in video understanding. Previous methods tackle this task with complicated pipelines. They often need to train multiple networks and involve hand-designed operations, such as non-maximal suppression and anchor generation, which limit the flexibility and prevent end-to-end learning. In this paper, we propose an end-to-end Transformer-based method for TAD, termed TadTR. Given a small set of learnable embeddings called action queries, TadTR adaptively extracts temporal context information from the video for each query and directly predicts action instances with the context. To adapt Transformer to TAD, we propose three improvements to enhance its locality awareness. The core is a temporal deformable attention module that selectively attends to a sparse set of key snippets in a video. A segment refinement mechanism and an actionness regression head are designed to refine the boundaries and confidence of the predicted instances, respectively. With such a simple pipeline, TadTR requires lower computation cost than previous detectors, while preserving remarkable performance. As a self-contained detector, it achieves state-of-the-art performance on THUMOS14 (56.7% mAP) and HACS Segments (32.09% mAP). Combined with an extra action classifier, it obtains 36.75% mAP on ActivityNet-1.3. Code is available at https://github.com/xlliu7/TadTR.

Authors (7)

Xiaolong Liu (55 papers)
Qimeng Wang (11 papers)
Yao Hu (106 papers)
Xu Tang (48 papers)
Shiwei Zhang (179 papers)
Song Bai (87 papers)
Xiang Bai (222 papers)

Citations (194)

View on Semantic Scholar

Summary

End-to-End Temporal Action Detection with Transformer

The paper introduces TadTR, a Transformer-based framework for temporal action detection (TAD) that seeks to address the complexity challenges involved in traditional temporal action detection methods. Historically, temporal action detection has been accomplished using complex pipelines that incorporate multiple stages and hand-designed operations such as non-maximal suppression (NMS) and anchor generation. These methods often limit flexibility and prevent end-to-end learning.

Key Contributions

End-to-End Design: The proposed TadTR is an end-to-end model that simplifies the TAD pipeline by eliminating hand-crafted components and stages. This design utilizes a set prediction (SP) approach inspired by the Detection Transformer (DETR), aiming for simplicity and flexibility in predicting action instances directly from learnable embeddings known as action queries.
Temporal Deformable Attention (TDA): A novel attention mechanism tailored for temporal activity detection, termed temporal deformable attention, is employed. This module selectively attends to a sparse set of video snippets, enhancing locality awareness by focusing on key temporal segments while maintaining computational efficiency.
Adaptation of Transformer for TAD: The Transformer architecture is adapted to better suit the TAD task through the incorporation of a temporal context encoder, segment refinement, and an actionness regression head designed to refine temporal boundaries and confidence scores of predicted action instances.
State-of-the-Art Performance: TadTR achieves strong numerical results on multiple benchmarks. It outperforms existing state-of-the-art methods on THUMOS14 and HACS Segments datasets, with a mean Average Precision (mAP) of 56.7% and 32.09% respectively. The model displays an ability to efficiently handle different levels of context, contributing to its competitive performance in TAD.
Comparison to Existing Methods: Unlike traditional methods that require multiple networks and rely on post-processing steps (e.g., NMS), TadTR directly predicts action instances, reducing computation costs while achieving flexibility and improved performance. This is evident in the reduction of data redundancy and more efficient computation, as exhibited by the strong runtime results.

Implications and Future Directions

This research contributes theoretically by providing an end-to-end framework that revisits the methodology around action query embeddings and context modeling for temporal sequences. Practically, the implementable, sparse-action detection approach with the transformer provides a framework conducive to video-based applications such as auto-editing, surveillance, and content recommendation systems.

Future developments may explore the joint optimization of video encoder and the detection head, leveraging the end-to-end capabilities of TadTR. In addition, while TadTR sets a performance baseline in TAD, further refinement in addressing challenges such as high density of actions per video and short-duration actions presents opportunities for improving detection accuracy across complex and varied datasets.

By refining the usage of video context and addressing accuracy versus computational time trade-offs, the paper highlights Transformers' potential as a promising architecture for direct sequence-to-action predictions in the field of video understanding. Thus, it positions TadTR as a critical examination in the direction of simplifying and enhancing temporal action detection for broader applications.

PDF Markdown

Related Papers

Find Related Papers

GitHub

GitHub - xlliu7/TadTR: [TIP 2022] End-to-end Temporal Action Detection with Transformer (143 stars)