YOWO-Plus: An Incremental Improvement (2210.11219v1)

Published 20 Oct 2022 in cs.CV

Abstract: In this technical report, we would like to introduce our updates to YOWO, a real-time method for spatio-temporal action detection. We make a bunch of little design changes to make it better. For network structure, we use the same ones of official implemented YOWO, including 3D-ResNext-101 and YOLOv2, but we use a better pretrained weight of our reimplemented YOLOv2, which is better than the official YOLOv2. We also optimize the label assignment used in YOWO. To accurately detection action instances, we deploy GIoU loss for box regression. After our incremental improvement, YOWO achieves 84.9\% frame mAP and 50.5\% video mAP on the UCF101-24, significantly higher than the official YOWO. On the AVA, our optimized YOWO achieves 20.6\% frame mAP with 16 frames, also exceeding the official YOWO. With 32 frames, our YOWO achieves 21.6 frame mAP with 25 FPS on an RTX 3090 GPU. We name the optimized YOWO as YOWO-Plus. Moreover, we replace the 3D-ResNext-101 with the efficient 3D-ShuffleNet-v2 to design a lightweight action detector, YOWO-Nano. YOWO-Nano achieves 81.0 \% frame mAP and 49.7\% video frame mAP with over 90 FPS on the UCF101-24. It also achieves 18.4 \% frame mAP with about 90 FPS on the AVA. As far as we know, YOWO-Nano is the fastest state-of-the-art action detector. Our code is available on https://github.com/yjh0410/PyTorch_YOWO.

Citations (3)

View on Semantic Scholar

Summary

The paper introduces YOWO-Plus, enhancing real-time spatio-temporal action detection with optimized backbones, refined label assignments, and improved loss formulations.
It leverages a 3D-ResNext-101 and an advanced YOLOv2 for superior spatial and temporal feature extraction, achieving notable gains on UCF101-24 and AVA datasets.
A lightweight variant, YOWO-Nano, demonstrates high efficiency at 91 FPS while maintaining competitive accuracy, making it ideal for real-time video applications.

YOWO-Plus: Incremental Improvements in Spatio-Temporal Action Detection

The paper "YOWO-Plus: An Incremental Improvement" introduces a set of enhancements to the current real-time method for spatio-temporal action detection (STAD) known as YOWO. STAD is a significant area in video understanding with applications ranging from video surveillance to interactive gaming. The YOWO model, which stands for "You Only Watch Once," integrates both 3D and 2D backbones to process input video clips and employs a CFAM (Channel Fusion and Content Attention Module) for feature fusion.

Enhancements to YOWO

The primary improvements proposed for YOWO, leading to the development of YOWO-Plus, encompass three core aspects: backbone optimization, refined label assignment, and enhanced loss function formulation.

Better Backbone:
- The 3D backbone in YOWO is retained as 3D-ResNext-101, pretrained on the Kinetics dataset, capitalizing on its effective spatio-temporal feature extraction.
- For the 2D backbone, a reimplemented YOLOv2 pretrained on the COCO dataset is utilized. This version of YOLOv2 demonstrates superior performance, achieving 27% mAP on COCO, compared to the original implementation, thus enhancing spatial feature extraction.
Better Label Assignment:
- Instead of adhering to the YOLOv2 scheme where label assignment is based on the IoU with predicted bounding boxes, YOWO-Plus calculates IoU using anchor boxes.
- This adjustment allows for multiple positive samples per ground truth if the IoU exceeds a threshold of 0.5, potentially increasing detection accuracy.
Better Loss Function:
- The authors adopt the GIoU loss for bounding box regression, replacing the smooth L1 loss used in YOWO. The loss function is comprehensive, incorporating components for confidence, classification, and box regression that are weighted specifically for effective learning.

Introduction of YOWO-Nano

In pursuit of a lightweight detector, the paper introduces YOWO-Nano, which replaces the 3D backbone with the more efficient 3D-ShuffleNet-v2. Despite lower computational requirements, YOWO-Nano achieves commendable performance metrics, making it a compelling choice for applications with stringent real-time constraints.

Experimental Evaluation

YOWO-Plus and YOWO-Nano have been evaluated extensively on standard benchmarks, demonstrating significant improvements over the original YOWO model:

UCF101-24 Dataset: YOWO-Plus attained 84.9% frame mAP and 50.5% video mAP, surpassing the original YOWO’s metrics. In terms of efficiency, YOWO-Nano excels with a remarkable 91 FPS at 81.0% frame mAP and 49.7% video mAP.
AVA Dataset: On another challenging dataset, YOWO-Plus achieved a 21.6% frame mAP with 32 frames, outperforming the standard YOWO at both frame rates and accuracy. YOWO-Nano, with its lightweight architecture, demonstrated high efficiency while maintaining competitive performance.

Implications and Future Directions

The transformations to YOWO elucidate a methodical approach to enhancing real-time STAD capabilities, where nuanced architectural and operational adjustments yield measurable performance benefits. These enhancements underscore the potential of incremental improvements in model design and training practices.

Practically, the implications of this work are significant for deployments requiring real-time video analysis, such as automated surveillance systems and immersive gaming experiences. Theoretically, the insights into label assignment and loss function adjustments provide fertile ground for further research into optimizing detection methodologies.

Future developments in AI, particularly in STAD, may continue to leverage incremental improvements, like those elucidated with YOWO-Plus, to balance efficiency and accuracy. Emerging research might explore more efficient network designs or novel training regimes to further enhance the capabilities of action detection frameworks in increasingly complex and diverse video environments.

PDF Markdown

Related Papers

GitHub

GitHub - yjh0410/PyTorch_YOWO (88 stars)