You Only Watch Once: A Unified CNN Architecture for Real-Time Spatiotemporal Action Localization (1911.06644v5)

Published 15 Nov 2019 in cs.CV

Abstract: Spatiotemporal action localization requires the incorporation of two sources of information into the designed architecture: (1) temporal information from the previous frames and (2) spatial information from the key frame. Current state-of-the-art approaches usually extract these information with separate networks and use an extra mechanism for fusion to get detections. In this work, we present YOWO, a unified CNN architecture for real-time spatiotemporal action localization in video streams. YOWO is a single-stage architecture with two branches to extract temporal and spatial information concurrently and predict bounding boxes and action probabilities directly from video clips in one evaluation. Since the whole architecture is unified, it can be optimized end-to-end. The YOWO architecture is fast providing 34 frames-per-second on 16-frames input clips and 62 frames-per-second on 8-frames input clips, which is currently the fastest state-of-the-art architecture on spatiotemporal action localization task. Remarkably, YOWO outperforms the previous state-of-the art results on J-HMDB-21 and UCF101-24 with an impressive improvement of ~3% and ~12%, respectively. Moreover, YOWO is the first and only single-stage architecture that provides competitive results on AVA dataset. We make our code and pretrained models publicly available.

Citations (125)

View on Semantic Scholar

Summary

The paper introduces YOWO, a one-stage CNN that integrates 2D spatial and 3D temporal feature extraction for efficient action localization.
It incorporates a Channel Fusion and Attention Mechanism (CFAM) to enhance multimodal feature integration, significantly boosting detection performance.
The model delivers real-time processing at 34 fps and achieves notable frame-mAP improvements on datasets like J-HMDB-21 and UCF101-24.

Overview of YOWO: A Unified CNN Architecture for Real-Time Spatiotemporal Action Localization

The paper "You Only Watch Once: A Unified CNN Architecture for Real-Time Spatiotemporal Action Localization" presents a comprehensive paper on spatiotemporal action localization within video streams, aiming to efficiently integrate temporal and spatial information. To address the limitations of existing multi-stage systems, the authors propose YOWO, a single-stage CNN architecture optimized for real-time performance.

Key Contributions

YOWO innovatively combines temporal data from sequences of frames with spatial details extracted from key frames through a unified network. This approach addresses the inefficiencies in two-stage frameworks where proposal generation and classification occur separately. YOWO instead integrates these processes, enabling end-to-end optimization. Here are the main components and contributions:

Architecture Design:
- YOWO employs a dual-branch design that concurrently extracts spatiotemporal features using a 3D-CNN and spatial features using a 2D-CNN.
- The architecture includes a Channel Fusion and Attention Mechanism (CFAM) that facilitates effective combination and enhancement of multimodal features.
Performance Metrics:
- The model achieves a throughput of 34 fps for 16-frame inputs, greatly surpassing the speed of contemporary methods.
- It records substantial improvements in action localization, with frame-mAP increases of approximately 3% on J-HMDB-21 and 12% on UCF101-24 datasets.
Extensibility:
- Beyond its primary RGB input capability, YOWO can incorporate other modalities such as optical flow and depth, demonstrating versatility.
- The architecture allows for easy substitution of its 2D and 3D components to meet different performance requirements.

Experimental Validation

YOWO was rigorously assessed across several datasets, including J-HMDB-21, UCF101-24, and AVA:

J-HMDB-21 and UCF101-24: YOWO outperformed existing models with a significant 3.3% and 12.2% improvement in frame-mAP, respectively.
AVA: It remains the first single-stage network to provide competitive results, marking a novel contribution in this domain.

The model's efficacy was further validated through detailed ablation studies, which emphasized the complementary strengths of its 2D and 3D branches and the influence of the CFAM module on feature integration.

Implications and Future Directions

The development of YOWO has broad implications for fields requiring real-time human action detection, such as HCI systems, UAV surveillance, and autonomous systems. Its unified design, capable of processing complex spatiotemporal interactions with high efficiency, sets a precedent for future research in video analysis.

Further work could explore integrating additional modalities and refining the CFAM to optimize feature fusion. Moreover, as computational resources evolve, leveraging higher-dimensional data or enhanced backbone architectures could be avenues for improving both performance and accuracy.

Conclusion

This paper contributes significantly to the field of spatiotemporal action localization by presenting a unified, efficient architecture that processes video streams in real-time while maintaining state-of-the-art performance. YOWO represents a critical advancement in simplifying and improving the detection of human actions from complex video data. The paper’s outcomes not only pave the way for future research into unified detection architectures but also hint at potential applications across an expanding range of technological domains.

PDF Markdown

Related Papers

GitHub

GitHub - wei-tim/YOWO: You Only Watch Once: A Unified CNN Architecture for Real-Time Spatiotemporal Action Localization (889 stars)