Deep Analysis of CNN-based Spatio-temporal Representations for Action Recognition (2010.11757v4)

Published 22 Oct 2020 in cs.CV

Abstract: In recent years, a number of approaches based on 2D or 3D convolutional neural networks (CNN) have emerged for video action recognition, achieving state-of-the-art results on several large-scale benchmark datasets. In this paper, we carry out in-depth comparative analysis to better understand the differences between these approaches and the progress made by them. To this end, we develop an unified framework for both 2D-CNN and 3D-CNN action models, which enables us to remove bells and whistles and provides a common ground for fair comparison. We then conduct an effort towards a large-scale analysis involving over 300 action recognition models. Our comprehensive analysis reveals that a) a significant leap is made in efficiency for action recognition, but not in accuracy; b) 2D-CNN and 3D-CNN models behave similarly in terms of spatio-temporal representation abilities and transferability. Our codes are available at https://github.com/IBM/action-recognition-pytorch.

Authors (7)

Chun-Fu Chen (28 papers)
Rameswar Panda (79 papers)
Kandan Ramakrishnan (8 papers)
Rogerio Feris (105 papers)
John Cohn (4 papers)
Aude Oliva (42 papers)
Quanfu Fan (22 papers)

Citations (91)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Deep Analysis of CNN-based Spatio-temporal Representations for Action Recognition (2010.11757v4)

Summary

Related Papers