Listen to Look: Action Recognition by Previewing Audio (1912.04487v3)

Published 10 Dec 2019 in cs.CV, cs.LG, cs.SD, and eess.AS

Abstract: In the face of the video data deluge, today's expensive clip-level classifiers are increasingly impractical. We propose a framework for efficient action recognition in untrimmed video that uses audio as a preview mechanism to eliminate both short-term and long-term visual redundancies. First, we devise an ImgAud2Vid framework that hallucinates clip-level features by distilling from lighter modalities---a single frame and its accompanying audio---reducing short-term temporal redundancy for efficient clip-level recognition. Second, building on ImgAud2Vid, we further propose ImgAud-Skimming, an attention-based long short-term memory network that iteratively selects useful moments in untrimmed videos, reducing long-term temporal redundancy for efficient video-level recognition. Extensive experiments on four action recognition datasets demonstrate that our method achieves the state-of-the-art in terms of both recognition accuracy and speed.

PDF Abstract

Efficient Action Recognition in Untrimmed Videos Using Audio Previews

In the pursuit of more efficient methodologies for action recognition in videos, Gao et al. present an innovative framework that leverages audio as a preview mechanism. The paper, titled "Listen to Look: Action Recognition by Previewing Audio," details a dual approach that addresses both clip-level and video-level redundancies in long, untrimmed videos. It comprehensively evaluates the balance between recognition accuracy and computational efficiency, positioning their approach as state-of-the-art in the domain of action recognition.

The cornerstone of their methodology is the IMGAUD2VID framework, which is introduced to address clip-level redundancy. This framework employs a distillation process whereby a complex video descriptor is approximated using just a single frame and its accompanying audio. This procedure effectively mitigates the resource-intensive nature of existing clip-based classifiers, such as 3D Convolutional Neural Networks (3D CNNs), without a notable compromise in accuracy. By distilling these expensive descriptors into lightweight image-audio pairs, the authors reduce the short-term temporal redundancy effectively.

Building upon this foundation, the authors introduce IMGAUD-SKIMMING, intended to handle long-term video redundancy. This approach utilizes an attention-based Long Short-Term Memory (LSTM) network to selectively cherry-pick relevant moments in untrimmed videos, skipping over less informative segments. This selection process benefits from the use of audio to identify key events, thus supporting efficient and accurate video-level recognition. The combination of these two techniques optimizes the trade-off between speed and performance, showcasing a significant reduction in computational demands while preserving high accuracy.

Through extensive experiments across four action recognition datasets—Kinetics-Sounds, Mini-Sports1M, ActivityNet, and UCF-101—the framework's competency is evident. It achieves competitive, if not superior, results in recognition accuracy when compared to traditional methods, highlighting the utility of audio in action recognition tasks. The methodological advancements are underpinned by strong experimental results that consistently show competitive performance across several metrics of interest.

The implications of this research are multifaceted. Practically, this framework can substantially reduce computational costs in applications such as video recommendation and summarization, presenting a sustainable solution to the growing video data influx. Theoretically, it underscores the potent role audio can play in multi-modal action recognition tasks, potentially paving the way for further exploration of cross-modal distillation mechanisms and their applications in spatiotemporal data processing.

Future developments in this field might explore the granularity of action understanding by coupling temporal frame selection with spatial region selection. Furthermore, extensions to multi-label scenarios and adaptation to other data modalities could enhance the applicability of their approach. The research navigates elegantly within the intersection of efficiency and advanced action understanding, embodying a promising direction for next-generation video processing systems.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Ruohan Gao (39 papers)
Tae-Hyun Oh (75 papers)
Kristen Grauman (136 papers)
Lorenzo Torresani (73 papers)

Citations (241)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos