Detecting events and key actors in multi-person videos (1511.02917v2)

Published 9 Nov 2015 in cs.CV and cs.AI

Abstract: Multi-person event recognition is a challenging task, often with many people active in the scene but only a small subset contributing to an actual event. In this paper, we propose a model which learns to detect events in such videos while automatically "attending" to the people responsible for the event. Our model does not use explicit annotations regarding who or where those people are during training and testing. In particular, we track people in videos and use a recurrent neural network (RNN) to represent the track features. We learn time-varying attention weights to combine these features at each time-instant. The attended features are then processed using another RNN for event detection/classification. Since most video datasets with multiple people are restricted to a small number of videos, we also collected a new basketball dataset comprising 257 basketball games with 14K event annotations corresponding to 11 event classes. Our model outperforms state-of-the-art methods for both event classification and detection on this new dataset. Additionally, we show that the attention mechanism is able to consistently localize the relevant players.

Authors (6)

Vignesh Ramanathan (14 papers)
Jonathan Huang (46 papers)
Sami Abu-El-Haija (23 papers)
Alexander Gorban (9 papers)
Kevin Murphy (87 papers)
Li Fei-Fei (199 papers)

Citations (200)

View on Semantic Scholar

Summary

Multi-Person Event Detection in Videos: A Novel Approach with Attention Mechanisms

The paper "Detecting events and key actors in multi-person videos" presents a sophisticated method for recognizing events in videos featuring multiple individuals, using attention mechanisms integrated with Recurrent Neural Networks (RNNs). The primary objective is to identify the subset of individuals who are crucial to specific events within a scene, without explicit annotations of these key actors during training or testing phases.

Methodology Overview

The authors propose a model implemented with RNNs to learn temporal patterns in the features of individuals tracked across video frames. By deploying an attention mechanism, the model dynamically focuses on the most relevant individuals for event identification at any given time instant. This is achieved without the need for explicit localization or identification of event participants during the training process, thereby addressing a notable challenge in processing videos with multiple active participants.

Datasets

To evaluate the model, the authors curated a new dataset, consisting of basketball games with detailed annotations. This dataset includes 257 basketball games with over 14,000 unique event annotations spanning 11 event classes. This provides a rich foundation for evaluating the presented model's efficacy against established benchmarks.

Experiments and Results

The model surpasses state-of-the-art methods in both event classification and detection tasks on the newly introduced dataset. It demonstrates superior capabilities in identifying key players involved in events such as successful 2-point shots, despite not being explicitly trained to detect these actors. Importantly, the attention mechanism consistently highlights relevant players, which is quantified and corroborated through precision-based metrics.

Implications and Future Directions

The research offers implications for practical applications in automated sports analysis, surveillance video interpretation, and other domains where multi-person interactions are prevalent. The ability to detect key actors without structured annotations presents significant utility in reducing the manual labor needed for data preparation, making the approach highly scalable.

Theoretically, the integration of attention with RNNs in handling complex, dynamic scenes showcases how neural networks can be fine-tuned to granular aspects of video analysis even when straightforward supervision is impractical. As the field progresses, such models can likely be expanded to accommodate more granular actions and scenarios beyond the sports domain to encompass a wider variety of real-world video data.

Conclusion

This paper contributes a robust framework that advances the understanding and capability of video event detection, specifically within the multifaceted scenes of team sports. Building on the paradigm of weakly supervised learning via attention, the method sets a groundwork for future improvements and applications across diverse domains requiring nuanced interpretation of human activities in videos.

PDF Markdown

Related Papers

Find Related Papers