Context-Aware RCNN: A Baseline for Action Detection in Videos (2007.09861v1)

Published 20 Jul 2020 in cs.CV

Abstract: Video action detection approaches usually conduct actor-centric action recognition over RoI-pooled features following the standard pipeline of Faster-RCNN. In this work, we first empirically find the recognition accuracy is highly correlated with the bounding box size of an actor, and thus higher resolution of actors contributes to better performance. However, video models require dense sampling in time to achieve accurate recognition. To fit in GPU memory, the frames to backbone network must be kept low-resolution, resulting in a coarse feature map in RoI-Pooling layer. Thus, we revisit RCNN for actor-centric action recognition via cropping and resizing image patches around actors before feature extraction with I3D deep network. Moreover, we found that expanding actor bounding boxes slightly and fusing the context features can further boost the performance. Consequently, we develop a surpringly effective baseline (Context-Aware RCNN) and it achieves new state-of-the-art results on two challenging action detection benchmarks of AVA and JHMDB. Our observations challenge the conventional wisdom of RoI-Pooling based pipeline and encourage researchers rethink the importance of resolution in actor-centric action recognition. Our approach can serve as a strong baseline for video action detection and is expected to inspire new ideas for this filed. The code is available at \url{https://github.com/MCG-NJU/CRCNN-Action}.

Authors (5)

Jianchao Wu (24 papers)
Zhanghui Kuang (16 papers)
Limin Wang (221 papers)
Wayne Zhang (42 papers)
Gangshan Wu (70 papers)

Citations (76)

View on Semantic Scholar

Summary

Context-Aware RCNN: A Baseline for Action Detection in Videos

Introduction

The research paper introduces a novel approach to enhance the performance of action detection in videos through the Context-Aware RCNN framework. This paper builds upon the conventional methods used for actor-centric action recognition, which often employ the RoI-Pooling strategy from Faster-RCNN. The authors present empirical evidence showing that the resolution of actors' bounding boxes critically affects recognition accuracy, especially when actors occupy small regions of the video frames. Consequently, they advocate for a method that preserves spatial details even in low-resolution frames, challenging the dominant paradigm in video action detection.

Key Contributions

The paper's primary contributions are twofold:

Critique of RoI-Pooling: The authors present evidence that conventional RoI-Pooling-based pipelines for action detection lose discriminative spatial details when actor bounding boxes are small. This observation is pivotal for scenarios requiring fine-grained discrimination between actions that involve subtle differences, such as distinguishing between "eat" and "smoke."
Context-Aware RCNN Framework: As a solution, the authors propose RCNN-like methods that crop and resize actor regions before extraction of features, maintaining spatial detail across frames. Moreover, they integrate context through scene and long-term features, culminating in a state-of-the-art baseline for action detection.

Experimental Results

Evaluation on benchmark datasets, AVA and JHMDB, demonstrates the efficacy of Context-Aware RCNN. In AVA, a dataset characterized by multi-label annotations and demanding actor-centric recognition, this approach achieves an mAP of 28.0%, outperforming Long-Term Feature Bank (LFB) by 2.2% with a comparable I3D ResNet-50 backbone on Kinetics-400 pre-training. On JHMDB, Context-Aware RCNN achieves an mAP of 79.2%, surpassing previous works that leverage both RGB and optical flow data.

Implications and Future Work

The findings underscore the need to reconsider the importance of spatial resolution in video-based action recognition tasks. Preserving such resolution enhances the model's sensitivity to finer details, which is imperative for distinguishing closely related actions. Furthermore, proper integration of contextual information emerges as a crucial factor in elevating action detection systems' performance.

Future research directions suggested by this work include exploring advanced means to capture and leverage context information effectively. Moreover, the authors hint at the need to refine models to identify fine details that characterize human activities, suggesting that innovation in feature extraction and representation may offer profound advancements in video action detection.

Conclusion

The paper provides a compelling argument for a shift in how researchers approach actor-centric action recognition in videos. By presenting a straightforward yet powerful baseline, the Context-Aware RCNN is poised to inspire further development and innovation within the field. The promising results obtained open up avenues for future exploration into high-resolution feature preservation and innovative context incorporation, highlighting the potential for significant advancements in action detection methodologies.

PDF Markdown

Related Papers

GitHub

GitHub - MCG-NJU/CRCNN-Action: Context-aware RCNN: a Baseline for Action Detection in Videos (52 stars)