Glance and Focus Networks for Dynamic Visual Recognition (2201.03014v2)

Published 9 Jan 2022 in cs.CV, cs.AI, and cs.LG

Abstract: Spatial redundancy widely exists in visual recognition tasks, i.e., discriminative features in an image or video frame usually correspond to only a subset of pixels, while the remaining regions are irrelevant to the task at hand. Therefore, static models which process all the pixels with an equal amount of computation result in considerable redundancy in terms of time and space consumption. In this paper, we formulate the image recognition problem as a sequential coarse-to-fine feature learning process, mimicking the human visual system. Specifically, the proposed Glance and Focus Network (GFNet) first extracts a quick global representation of the input image at a low resolution scale, and then strategically attends to a series of salient (small) regions to learn finer features. The sequential process naturally facilitates adaptive inference at test time, as it can be terminated once the model is sufficiently confident about its prediction, avoiding further redundant computation. It is worth noting that the problem of locating discriminant regions in our model is formulated as a reinforcement learning task, thus requiring no additional manual annotations other than classification labels. GFNet is general and flexible as it is compatible with any off-the-shelf backbone models (such as MobileNets, EfficientNets and TSM), which can be conveniently deployed as the feature extractor. Extensive experiments on a variety of image classification and video recognition tasks and with various backbone models demonstrate the remarkable efficiency of our method. For example, it reduces the average latency of the highly efficient MobileNet-V3 on an iPhone XS Max by 1.3x without sacrificing accuracy. Code and pre-trained models are available at https://github.com/blackfeather-wang/GFNet-Pytorch.

Authors (7)

Gao Huang (178 papers)
Yulin Wang (45 papers)
Kangchen Lv (7 papers)
Haojun Jiang (13 papers)
Wenhui Huang (44 papers)
Pengfei Qi (10 papers)
Shiji Song (103 papers)

Citations (40)

View on Semantic Scholar

Summary

Glance and Focus Networks for Dynamic Visual Recognition

The paper presents a novel approach called Glance and Focus Network (GFNet) designed to enhance the efficiency of visual recognition models by leveraging the spatial redundancy in high-resolution images and videos. The authors propose a dynamic computational strategy that mimics the human visual system's ability to rapidly recognize key features.

Core Concepts and Methodology

GFNet introduces a sequential process of coarse-to-fine feature learning. It begins with a "glance" phase, where a low-resolution global perspective of the input image is processed to make an initial prediction. If confidence is inadequate, the model enters the "focus" stage, where it selectively attends to smaller, higher-resolution regions of interest to refine predictions. This adaptive process can terminate early based on prediction confidence, significantly reducing computational overhead.

The framework employs reinforcement learning to dynamically identify and focus on the most discriminative image regions, without requiring additional annotations beyond standard classification labels. Furthermore, GFNet is designed to be compatible with various backbone models such as MobileNets, EfficientNets, and TSM, offering flexibility in deployment across different tasks and environments.

Empirical Evaluation

Extensive experiments demonstrate GFNet's efficiency and adaptability across image and video recognition tasks. On ImageNet, GFNet reduces the average latency of MobileNet-V3 by 1.3x on an iPhone XS Max without losing accuracy. The approach significantly diminishes computational operations (FLOPs), achieving up to a 3x reduction in comparison to traditional models.

Implications and Future Directions

GFNet's ability to dynamically allocate computational resources has significant implications for real-world applications, especially where computational power and battery life are limited, such as on mobile devices and IoT systems. The framework's adaptability allows it to utilize available computational budgets effectively while maintaining robust performance.

Looking forward, GFNet could be extended to tackle other visual tasks like object detection and segmentation. Its reinforcement learning-based region proposal mechanism could also inspire advancements in weakly supervised learning and autonomous systems requiring efficient scene understanding.

Conclusion

GFNet provides a compelling solution to the challenge of spatial redundancy in visual recognition, marrying efficiency with accuracy through its innovative dynamic approach. Its general applicability and notable improvements in computational efficiency make it a substantial contribution to the field of dynamic neural networks and efficient deep learning.

PDF Markdown

Related Papers

Find Related Papers

GitHub

GitHub - blackfeather-wang/GFNet-Pytorch: A general framework for inferring CNNs efficiently. Reduce the inference latency of MobileNet-V3 by 1.3x on an iPhone XS Max without sacrificing accuracy. (182 stars)

Tweets

https://twitter.com/Ethan_smith_20/status/1794232729212760396