An Image is Worth 16x16 Words, What is a Video Worth? (2103.13915v2)

Published 25 Mar 2021 in cs.CV

Abstract: Leading methods in the domain of action recognition try to distill information from both the spatial and temporal dimensions of an input video. Methods that reach State of the Art (SotA) accuracy, usually make use of 3D convolution layers as a way to abstract the temporal information from video frames. The use of such convolutions requires sampling short clips from the input video, where each clip is a collection of closely sampled frames. Since each short clip covers a small fraction of an input video, multiple clips are sampled at inference in order to cover the whole temporal length of the video. This leads to increased computational load and is impractical for real-world applications. We address the computational bottleneck by significantly reducing the number of frames required for inference. Our approach relies on a temporal transformer that applies global attention over video frames, and thus better exploits the salient information in each frame. Therefore our approach is very input efficient, and can achieve SotA results (on Kinetics dataset) with a fraction of the data (frames per video), computation and latency. Specifically on Kinetics-400, we reach $80.5$ top-1 accuracy with $\times 30$ less frames per video, and $\times 40$ faster inference than the current leading method. Code is available at: https://github.com/Alibaba-MIIL/STAM

PDF Abstract

Evaluation of a Novel Action Recognition Model on UCF101 and Charades Datasets

The paper provides a detailed experimental evaluation of a new action recognition model across two significant datasets: UCF101 and Charades. These datasets are well-regarded within the field of computer vision for benchmarking and assessing the performance of action recognition algorithms. The focus of the paper lies in measuring the trade-off between computational efficiency and accuracy, aiming to contribute insights towards developing balanced models that cater to practical application needs in video analytics.

UCF101 Dataset Evaluation

The UCF101 dataset, a popular benchmark for action recognition, consists of varied action categories within videos. The paper presents comprehensive results, showing the model's performance compared to existing approaches. Key metrics, Top-1 Accuracy and Runtime in Video Per Second (VPS), are emphasized.

Performance Results: The proposed model, STAM-32, achieves a Top-1 Accuracy of 97.0%, surpassing several existing models, such as I3D and S3D, which achieved accuracies of 95.6% and 96.8% respectively. This outcome is notable as it signifies an enhancement, albeit marginal, over current methods, achieving only a slightly higher accuracy yet with a competitive runtime.
Computational Efficiency: The model's runtime of 10 VPS positions it between the more efficient FASTER32 at 2.8 VPS and more computationally intensive ECO at 20.8 VPS, illustrating a viable option for contexts where computational cost is a consideration.

These results highlight the model’s effective balance of efficiency and performance, a noteworthy consideration for deployment in real-time systems where resource constraints are prevalent.

Charades Dataset Evaluation

The Charades dataset, known for its complexity due to multiple labels per video and longer interaction ranges, presents a different set of challenges. The paper critiques its model's performance in comparison with state-of-the-art methods.

Performance Results: While the STAM-64 model does not achieve state-of-the-art accuracy (falling to 39.7% in Top-1 Accuracy), it interestingly requires far less computational resources. The computational demand, presented as FLOPS times views, is notably efficient at 1040 × 1 G in contrast to other methods, like SlowFast, which requires 234 × 30 G.
Focus on Efficiency: Despite not reaching the highest accuracy, the paper emphasizes the model's efficiency. This aspect is critical for applications needing rapid processing over high precision, such as preliminary video surveillance tasks.

Implications and Future Directions

The findings from this paper underscore a pivotal discourse in the domain of video action recognition—finding the equilibrium between accuracy and computational feasibility. The discussed methodologies suggest promising avenues for engineering algorithms catering to distinct operational environments where computational budgets are constrained.

From a theoretical standpoint, this research may motivate further exploration into optimizing neural architectures, advancing techniques such as model compression, and improving efficiency without significant sacrifices in recognition performance.

Practical implications include the potential applicability of these models in scenarios like automated video content analysis, where both efficiency and accuracy are paramount. Future research can delve into adaptive systems that dynamically balance these aspects based on contextual demands or hardware constraints.

In conclusion, the paper contributes to the incremental advancement in this field, offering both a rigorous evaluation of the new model's efficacy and inspiring subsequent research aimed at harmonizing computational efficiency with performance accuracy in action recognition systems.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Gilad Sharir (7 papers)
Asaf Noy (17 papers)
Lihi Zelnik-Manor (30 papers)

Citations (110)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - Alibaba-MIIL/STAM: Official implementation of "An Image is Worth 16x16 Words, What is a Video Worth?" (2021 paper) (219 stars)