Towards Good Practices for Very Deep Two-Stream ConvNets (1507.02159v1)

Published 8 Jul 2015 in cs.CV

Abstract: Deep convolutional networks have achieved great success for object recognition in still images. However, for action recognition in videos, the improvement of deep convolutional networks is not so evident. We argue that there are two reasons that could probably explain this result. First the current network architectures (e.g. Two-stream ConvNets) are relatively shallow compared with those very deep models in image domain (e.g. VGGNet, GoogLeNet), and therefore their modeling capacity is constrained by their depth. Second, probably more importantly, the training dataset of action recognition is extremely small compared with the ImageNet dataset, and thus it will be easy to over-fit on the training dataset. To address these issues, this report presents very deep two-stream ConvNets for action recognition, by adapting recent very deep architectures into video domain. However, this extension is not easy as the size of action recognition is quite small. We design several good practices for the training of very deep two-stream ConvNets, namely (i) pre-training for both spatial and temporal nets, (ii) smaller learning rates, (iii) more data augmentation techniques, (iv) high drop out ratio. Meanwhile, we extend the Caffe toolbox into Multi-GPU implementation with high computational efficiency and low memory consumption. We verify the performance of very deep two-stream ConvNets on the dataset of UCF101 and it achieves the recognition accuracy of $91.4\%$.

PDF Abstract

An Examination of Strategies for Enhancing Deep Two-Stream ConvNets in Video Action Recognition

The paper "Towards Good Practices for Very Deep Two-Stream ConvNets" by Wang et al. undertakes a structured exploration of improvements for deep convolutional networks in the domain of action recognition in videos. While deep convolutional networks (ConvNets) have significantly advanced object recognition in static images, their efficacy in video-based action recognition does not mirror this success. The authors attribute this divergence primarily to two factors: the comparatively shallow architecture of existing Two-Stream ConvNets and the limited size of action recognition datasets, which exacerbate the problem of overfitting.

To address these challenges, Wang et al. propose several modifications and techniques designed to adapt the deeply successful architectures from the image domain, such as VGGNet and GoogLeNet, into the field of video action recognition. The proposed approach, dubbed very deep two-stream ConvNets, is built upon incorporating enhanced depth using these architectures and implementing good practices for stabilizing training, thereby mitigating the effect of overfitting.

Key Contributions

Architectural Adaptation

Deep Architecture: By employing the architectural depth characteristics of VGGNet and GoogLeNet, the authors notably extend the two-stream model’s complexity to better accommodate complex action recognition tasks. The two-stream architecture involves distinct spatial and temporal networks, where the spatial net processes single-frame images and the temporal net analyzes stacked optical flow fields.
Network Depth Utilization: Two varieties of deep networks were explored: a 22-layer GoogLeNet model distinguished by its Inception modules, and VGGNet-16, known for its smaller stride and deep 19-layer architecture.

Training Enhancements

Pre-Training: Utilizing pre-trained ImageNet models, with modifications for temporal network inputs, improves initialization and addresses dataset size limitations.
Data Augmentation: Advanced data augmentation techniques, including corner and center cropping and multi-scale cropping, are leveraged to combat overfitting—providing increased input variation.
Optimized Learning Rates: The implementation of smaller learning rates compared to previous methods facilitates effective convergence given the network's complexity and pre-training.
Dropout Regularization: Enhanced dropout ratios safeguard against overfitting by reducing reliance on fully connected layer activations.
Multi-GPU Implementation: Extending the Caffe framework, the authors implement a multi-GPU training method that enhances computational efficiency and reduces memory usage, offering notable gains in processing speed and scalability.

Results and Implications

Evaluated on the UCF101 dataset, the very deep two-stream ConvNets achieved a significant recognition accuracy of 91.4%, outperforming traditional feature representations such as Improved Trajectories and several other deep learning approaches. This validation on UCF101 underscores the effectiveness of the proposed good practices and architectural choices.

The paper contributes valuable insights into the adoption of deep network architectures in video domains, showcasing how infrastructural design transfers from still images to motion-focused tasks. The work also sets a foundation for future explorations in video processing, potentially informing enhancements in other temporal recognition tasks and extending to real-time video processing applications where action recognition plays a critical role.

Future Directions

There is room for further optimization and exploration in several avenues: experimenting with additional deep architectures, addressing computational constraints, particularly in real-time applications, and integrating emerging machine learning paradigms such as transformer networks with existing models. The implications of this work suggest that with appropriate computational strategies, deeper architectures can effectively manage the complexity of video datasets, thus broadening the scope of machine learning applications in dynamic visual environments.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Limin Wang (221 papers)
Yuanjun Xiong (52 papers)
Zhe Wang (574 papers)
Yu Qiao (563 papers)

Citations (438)

View on Semantic Scholar