TALLFormer: Temporal Action Localization with a Long-memory Transformer (2204.01680v2)

Published 4 Apr 2022 in cs.CV

Abstract: Most modern approaches in temporal action localization divide this problem into two parts: (i) short-term feature extraction and (ii) long-range temporal boundary localization. Due to the high GPU memory cost caused by processing long untrimmed videos, many methods sacrifice the representational power of the short-term feature extractor by either freezing the backbone or using a small spatial video resolution. This issue becomes even worse with the recent video transformer models, many of which have quadratic memory complexity. To address these issues, we propose TALLFormer, a memory-efficient and end-to-end trainable Temporal Action Localization Transformer with Long-term memory. Our long-term memory mechanism eliminates the need for processing hundreds of redundant video frames during each training iteration, thus, significantly reducing the GPU memory consumption and training time. These efficiency savings allow us (i) to use a powerful video transformer feature extractor without freezing the backbone or reducing the spatial video resolution, while (ii) also maintaining long-range temporal boundary localization capability. With only RGB frames as input and no external action recognition classifier, TALLFormer outperforms previous state-of-the-arts by a large margin, achieving an average mAP of 59.1% on THUMOS14 and 35.6% on ActivityNet-1.3. The code is public available: https://github.com/klauscc/TALLFormer.

Authors (2)

Feng Cheng (37 papers)
Gedas Bertasius (55 papers)

Citations (78)

View on Semantic Scholar

Summary

The paper introduces TallFormer, a novel transformer that leverages a long-memory mechanism to enable end-to-end temporal action localization on full-resolution videos without freezing the backbone.
It integrates a long-memory module with a temporal consistency module for efficient feature caching and accurate temporal boundary localization.
TallFormer outperforms previous models by achieving 59.1% mAP on THUMOS14 and 35.6% on ActivityNet-1.3 using solely RGB inputs.

Temporal Action Localization with a Long-memory Transformer

Overview

The paper "Tall: Temporal Action Localization with a Long-memory Transformer," authored by Feng Cheng and Gedas Bertasius from the University of North Carolina at Chapel Hill, addresses the challenge of temporal action localization (TAL) with untrimmed video data using a novel approach called TallFormer. Traditional methodologies often dismantle the TAL problem into two components—short-term feature extraction and long-range temporal boundary localization. However, due to computational resource constraints, particularly GPU memory, existing strategies tend to compromise the extraction capabilities by either freezing model backbones or using lower spatial video resolutions. This paper introduces an innovative memory-efficient solution that enables end-to-end training for full spatial resolution videos without such compromises.

Technical Contributions

TallFormer leverages a long-memory mechanism that eschews the necessity for processing redundant video frames extensively in each training iteration, thus curtailing GPU memory consumption effectively. Key points of contribution include:

Short-term Transformer Encoder: The approach samples a fraction of short video clips for feature extraction, alleviating the computational burden and stronger feature extraction without freezing the backbone or diminishing spatial resolution.
Long Memory Module (LMM): Incorporates a long-term memory caching system that holds previously computed features, enabling efficient feature extraction without repetitive computation or excessive backpropagation.
Temporal Consistency Module: Aids in normalizing the variance in feature distributions between newly computed and cached features, ensuring coherent temporal feature mapping across the video timeline.
Temporal Boundary Localization Module: Functions jointly as a boundary prediction and action classification framework, negating the dependence on external classifiers typically used for enhanced TAL accuracy in previous approaches.

Experimental Results

The proposed model demonstrates significant improvements in performance metrics, achieving an average mAP of 59.1% on THUMOS14 and 35.6% on ActivityNet-1.3, outperforming prior state-of-the-art models by 7.1% and 1.2%, respectively. Notably, TallFormer delivers these results using RGB inputs alone without the incorporation of optical flow, streamlining the processing pipeline while maintaining superior performance.

Implications and Future Prospects

The implications of TallFormer are both practical and theoretical. Practically, it introduces a fresher paradigm for TAL tasks that seamlessly balances computational efficiency and accuracy, rendering it suitable for deployment in environments with constrained hardware capabilities. Theoretically, the approach exemplifies how long-term memory caching can address computational inefficiencies in sequential data processing tasks. The future development in AI could see this technique expanded into other domains beyond video recognition, potentially bridging gaps in efficiency for tasks requiring extensive temporal feature extraction.

In conclusion, the research presented evidences how marrying transformer-based architectures with memory-efficient mechanisms can offer tangible advancements in video understanding tasks. As AI models continue to expand in size and complexity, the critical consideration of efficient memory utilization as demonstrated in TallFormer will likely underpin future breakthroughs in large-scale temporal action analysis and broader sequential reasoning applications.

PDF Markdown

Related Papers

GitHub

GitHub - klauscc/TALLFormer (52 stars)