FrameExit: Conditional Early Exiting for Efficient Video Recognition (2104.13400v1)

Published 27 Apr 2021 in cs.CV and cs.LG

Abstract: In this paper, we propose a conditional early exiting framework for efficient video recognition. While existing works focus on selecting a subset of salient frames to reduce the computation costs, we propose to use a simple sampling strategy combined with conditional early exiting to enable efficient recognition. Our model automatically learns to process fewer frames for simpler videos and more frames for complex ones. To achieve this, we employ a cascade of gating modules to automatically determine the earliest point in processing where an inference is sufficiently reliable. We generate on-the-fly supervision signals to the gates to provide a dynamic trade-off between accuracy and computational cost. Our proposed model outperforms competing methods on three large-scale video benchmarks. In particular, on ActivityNet1.3 and mini-kinetics, we outperform the state-of-the-art efficient video recognition methods with 1.3$\times$ and 2.1$\times$ less GFLOPs, respectively. Additionally, our method sets a new state of the art for efficient video understanding on the HVU benchmark.

Authors (3)

Amir Ghodrati (13 papers)
Babak Ehteshami Bejnordi (19 papers)
Amirhossein Habibian (21 papers)

Citations (77)

View on Semantic Scholar

Summary

FrameExit: Conditional Early Exiting for Efficient Video Recognition

The research paper titled "FrameExit: Conditional Early Exiting for Efficient Video Recognition" introduces a novel framework known as FrameExit, aimed at enhancing the efficiency of video recognition tasks. This framework is particularly relevant due to the increasing volume of video content and the corresponding demand for efficient and scalable methods for recognizing actions or events within videos. While current state-of-the-art models achieve high recognition performance, they incur high computational costs, which limit their application on devices with constrained resources. FrameExit addresses this issue by implementing a conditional early exiting mechanism that adapts the computational effort based on the complexity of the video content.

Methodological Innovations

The key innovation of FrameExit lies in its conditional early exiting strategy, which determines the minimum number of video frames required to make reliable inferences. Unlike existing approaches that require complex sampling mechanisms involving separate neural networks, FrameExit employs a deterministic sampling strategy combined with learned gating modules. These gates assess whether sufficient information has been obtained to make a confident prediction, allowing the model to process fewer frames for simpler videos and more frames for complex ones.

Components of FrameExit

Frame Sampling Policy: Instead of relying on complex policy functions parametrized by neural networks, FrameExit uses a simple, deterministic, and parameter-free function. This policy samples frames in a coarse-to-fine manner, enabling the model to access a broader temporal horizon.
Accumulated Feature Pooling: The framework introduces an accumulated feature pooling module to incrementally build a video representation as new frames are processed. This approach allows for efficient aggregation of temporal information, optimizing the model’s capability to make informed decisions with minimal computational overhead.
Gating Modules for Early Exiting: These lightweight modules are crucial to FrameExit’s operation, deciding when to halt further processing based on confidence levels derived from current and previous aggregated features. The gates are designed to learn the trade-off between accuracy and processing costs dynamically.

Empirical Evaluation

FrameExit was evaluated on three large-scale datasets: ActivityNet1.3, Mini-Kinetics, and HVU. The experiments demonstrated that FrameExit significantly outperforms state-of-the-art methods in terms of computational efficiency without compromising accuracy. Notably, on ActivityNet1.3 and Mini-Kinetics, FrameExit reduced computational costs by 1.3× and 2.1×, respectively, when compared to existing methods, while providing comparable or improved recognition performance.

Implications and Future Directions

The success of FrameExit showcases the potential of adaptive computation in video analysis. By tailoring the processing effort to the complexity of the video content, FrameExit ensures efficient resource utilization. This approach is particularly advantageous for applications in mobile devices or edge computing environments where computational resources are limited.

From a theoretical perspective, FrameExit suggests new pathways for efficient deep learning architectures, specifically in the domain of video processing. Future research could explore further enhancements, such as integrating multimodal data or extending the framework to other video-based tasks like video segmentation or anomaly detection.

Conclusion

The FrameExit framework advances the field of video recognition by introducing a model-agnostic, efficient, and effective methodology that leverages conditional early exiting. By dynamically adjusting the computational load based on the perceived complexity of the video, it provides a scalable solution suitable for real-world applications. Researchers and practitioners in AI can potentially build on this framework to develop more sophisticated adaptive video processing systems.

Related Papers

YouTube

Show All Videos