Convolutional Two-Stream Network Fusion for Video Action Recognition (1604.06573v2)

Published 22 Apr 2016 in cs.CV

Abstract: Recent applications of Convolutional Neural Networks (ConvNets) for human action recognition in videos have proposed different solutions for incorporating the appearance and motion information. We study a number of ways of fusing ConvNet towers both spatially and temporally in order to best take advantage of this spatio-temporal information. We make the following findings: (i) that rather than fusing at the softmax layer, a spatial and temporal network can be fused at a convolution layer without loss of performance, but with a substantial saving in parameters; (ii) that it is better to fuse such networks spatially at the last convolutional layer than earlier, and that additionally fusing at the class prediction layer can boost accuracy; finally (iii) that pooling of abstract convolutional features over spatiotemporal neighbourhoods further boosts performance. Based on these studies we propose a new ConvNet architecture for spatiotemporal fusion of video snippets, and evaluate its performance on standard benchmarks where this architecture achieves state-of-the-art results.

Authors (3)

Christoph Feichtenhofer (52 papers)
Axel Pinz (6 papers)
Andrew Zisserman (248 papers)

Citations (2,555)

View on Semantic Scholar

Summary

Convolutional Two-Stream Network Fusion for Video Action Recognition

The research presented in this paper explores multiple methodologies for integrating spatial and temporal information through convolutional neural networks (ConvNets) for the purpose of video action recognition. The paper proposes a novel architecture that addresses key challenges in action recognition by improving the fusion strategy between spatial (appearance) and temporal (motion) cues. The proposed ConvNet architecture achieves state-of-the-art performance on standard benchmarks such as UCF-101 and HMDB51.

Key Contributions

The paper provides several key contributions:

Spatial-Fusion at Convolutional Layers: The paper demonstrates that fusing ConvNets at convolutional layers is more parameter-efficient and does not degrade performance. Specifically, spatial fusion at the last convolutional layer (ReLU5) through summation or convolution substantially reduces the number of parameters compared to fusion at the softmax layer, while maintaining high accuracy.
Temporal-Fusion through 3D Convolutions and Pooling: The proposed architecture employs 3D convolution and pooling to integrate temporal information effectively. This approach outperforms the more traditional 2D pooling methods by capturing spatiotemporal dynamics within local neighborhoods of the video sequence.
Hybrid Spatiotemporal Network: The method includes an efficient incorporation of both streams by maintaining spatial registration at the convolutional layer, followed by 3D convolutions and pooling to encapsulate motion information over time. This hybrid network allows for better utilization of both spatial and temporal features without substantially increasing computational complexity.
Robust Evaluation and Comparisons: Comprehensive evaluations on UCF-101 and HMDB51 datasets confirm that the proposed approach surpasses previous state-of-the-art methods, including the traditional two-stream approach and its variants. Notably, the selected fusion strategies show significant gains on action recognition performance.

Numerical Results and Comparison

The paper provides empirical evidence through robust evaluations:

Fusion Efficiency:

Spatial fusion at ReLU5 using summation or convolution achieves up to 85.96% accuracy on UCF-101 with a considerable reduction in model parameters compared to softmax layer fusion.

Temporal Pooling:

The use of 3D convolutions in conjunction with 3D pooling boosts recognition accuracy. The combination of 3D convolution and pooling after spatial fusion yielded 90.40% on UCF-101 and 58.63% on HMDB51, markedly better than using 2D pooling methods.

State-of-the-Art Performance:

On all three splits of UCF-101, the proposed model (with VGG-16) achieves an overall accuracy of 92.5%. For HMDB51, the combined representation of spatial (VGG-16) and temporal (VGG-16) models reaches 65.4%. These results signify a considerable improvement over various contemporary action recognition techniques.

Implications and Future Directions

The proposed fusion strategies offer both practical and theoretical insights. Practically, they provide a scalable approach to refine action recognition systems, particularly useful for applications in surveillance, sports analytics, and human-computer interaction where recognizing complex actions accurately is critical. Theoretically, the paper points to the importance of integrating spatial and temporal features at appropriate abstraction levels to capture the nuanced dynamics inherent in video data.

Looking forward, this research opens up several pathways:

Dataset Expansion:

The paper hints at the limitations posed by current datasets, emphasizing the potential for even greater advancements with larger, more diverse datasets containing high-quality annotations.

End-to-End Learning:

Future work could explore end-to-end training methodologies that leverage more sophisticated data augmentation techniques and advanced optimization routines to train deeper architectures more effectively.

Real-Time Action Recognition:

Optimizing the proposed architecture for real-time performance would be an invaluable contribution, especially for applications in robotics and real-time video analysis where quick decision-making based on action recognition is pivotal.

In conclusion, the paper presents a rigorous exploration of convolutional fusion techniques for video action recognition, leading to a highly efficient and performant model. The findings not only push the boundaries of current methodologies but also lay the groundwork for ongoing innovations in the field of computer vision.

PDF Markdown

Related Papers

Find Related Papers