Relaxed Transformer Decoders for Direct Action Proposal Generation (2102.01894v3)

Published 3 Feb 2021 in cs.CV

Abstract: Temporal action proposal generation is an important and challenging task in video understanding, which aims at detecting all temporal segments containing action instances of interest. The existing proposal generation approaches are generally based on pre-defined anchor windows or heuristic bottom-up boundary matching strategies. This paper presents a simple and efficient framework (RTD-Net) for direct action proposal generation, by re-purposing a Transformer-alike architecture. To tackle the essential visual difference between time and space, we make three important improvements over the original transformer detection framework (DETR). First, to deal with slowness prior in videos, we replace the original Transformer encoder with a boundary attentive module to better capture long-range temporal information. Second, due to the ambiguous temporal boundary and relatively sparse annotations, we present a relaxed matching scheme to relieve the strict criteria of single assignment to each groundtruth. Finally, we devise a three-branch head to further improve the proposal confidence estimation by explicitly predicting its completeness. Extensive experiments on THUMOS14 and ActivityNet-1.3 benchmarks demonstrate the effectiveness of RTD-Net, on both tasks of temporal action proposal generation and temporal action detection. Moreover, due to its simplicity in design, our framework is more efficient than previous proposal generation methods, without non-maximum suppression post-processing. The code and models are made available at https://github.com/MCG-NJU/RTD-Action.

Citations (168)

View on Semantic Scholar

Summary

The paper introduces RTD-Net, a Transformer-based framework that modifies DETR for direct temporal action proposal generation in videos.
RTD-Net incorporates a boundary attentive module and a relaxed matching scheme to address challenges in video data like oversmoothing and sparse annotations.
Experiments show RTD-Net achieves state-of-the-art performance on THUMOS14 and ActivityNet benchmarks without requiring NMS post-processing.

Overview of Relaxed Transformer Decoders for Direct Action Proposal Generation

This paper introduces RTD-Net, a framework designed for direct action proposal generation in video understanding, leveraging a Transformer-alike architecture. The paper aims to tackle the complexities involved in temporal action proposal generation, circumventing the limitations posed by conventional methods that rely on anchor windows or bottom-up boundary matching strategies. The authors propose three significant modifications to the standard DETR framework to adapt it specifically for video tasks, addressing the intrinsic challenges associated with modeling visual features and temporal boundaries in video data.

Key Improvements over Existing Frameworks

Boundary Attentive Module: The first enhancement involves substituting the Transformer encoder with a boundary attentive module. This change is motivated by the necessity to capture long-range temporal information more effectively while mitigating the oversmoothing issues typical of video data, characterized by its slowness prior. By accentuating boundary features, the module ensures that crucial temporal segments are adequately represented.
Relaxed Matching Scheme: The second crucial improvement comes in the form of a relaxed matching scheme. This adjustment is in response to the often ambiguous and sparse temporal annotations in video datasets. By implementing a more lenient criteria for matching predictions to ground truth instances, the framework alleviates strict one-to-one assignment requirements, thus enhancing the convergence and generalization capabilities of the model.
Three-Branch Detection Head: Lastly, the authors introduce a three-branch head design to improve proposal confidence estimation, specifically through predicting completeness. This addition allows for explicit modeling of the quality of temporal localization, capturing both the presence and the extent of overlap with the ground truth.

Experimental Validation and Performance

The effectiveness of RTD-Net is validated through extensive experiments on the THUMOS14 and ActivityNet-1.3 benchmarks. The experimental results demonstrate superior performance in both temporal action proposal generation and temporal action detection tasks. Notably, RTD-Net achieves state-of-the-art results without relying on non-maximum suppression post-processing, underscoring its efficiency in generating high-quality proposals.

Implications and Future Directions

The advancements presented in this paper have several implications for the development of AI systems capable of nuanced video understanding. Practically, the proposed framework offers a streamlined and efficient mechanism for action detection in untrimmed videos, suitable for large-scale video datasets prevalent in real-world applications. Theoretically, RTD-Net exemplifies the adaptability of Transformer architectures beyond conventional NLP tasks, paving the way for further exploration of Transformers in diverse domains requiring temporal sequence modeling.

Looking forward, future research could focus on refining the relaxation criteria to further enhance the robustness of the framework across varying data distributions and complexity levels. Additionally, integrating more sophisticated attention mechanisms could further optimize the extraction of meaningful temporal features from high-dimensional video data.

Overall, the paper provides a compelling contribution to video understanding in computer vision, advancing the capabilities of direct action proposal generation models, while opening new avenues for Transformer-based innovation in AI-driven video analytics.