VOLO: Vision Outlooker for Visual Recognition (2106.13112v2)

Published 24 Jun 2021 in cs.CV

Abstract: Visual recognition has been dominated by convolutional neural networks (CNNs) for years. Though recently the prevailing vision transformers (ViTs) have shown great potential of self-attention based models in ImageNet classification, their performance is still inferior to that of the latest SOTA CNNs if no extra data are provided. In this work, we try to close the performance gap and demonstrate that attention-based models are indeed able to outperform CNNs. We find a major factor limiting the performance of ViTs for ImageNet classification is their low efficacy in encoding fine-level features into the token representations. To resolve this, we introduce a novel outlook attention and present a simple and general architecture, termed Vision Outlooker (VOLO). Unlike self-attention that focuses on global dependency modeling at a coarse level, the outlook attention efficiently encodes finer-level features and contexts into tokens, which is shown to be critically beneficial to recognition performance but largely ignored by the self-attention. Experiments show that our VOLO achieves 87.1% top-1 accuracy on ImageNet-1K classification, which is the first model exceeding 87% accuracy on this competitive benchmark, without using any extra training data In addition, the pre-trained VOLO transfers well to downstream tasks, such as semantic segmentation. We achieve 84.3% mIoU score on the cityscapes validation set and 54.3% on the ADE20K validation set. Code is available at \url{https://github.com/sail-sg/volo}.

Citations (284)

View on Semantic Scholar

Summary

The paper introduces a novel outlook attention mechanism that efficiently encodes fine-level details, surpassing conventional CNNs and ViTs.
The two-stage architecture combines localized feature encoding from small image patches with global self-attention for enhanced contextual understanding.
VOLO demonstrates impressive performance with 87.1% top-1 accuracy on ImageNet-1K and strong transferability to tasks like semantic segmentation.

VOLO: Vision Outlooker for Visual Recognition

The paper "VOLO: Vision Outlooker for Visual Recognition" presents a novel approach to improving visual recognition using an attention-based architecture, surpassing the performance of traditional Convolutional Neural Networks (CNNs) and existing Vision Transformers (ViTs). The authors introduce the Vision Outlooker (VOLO), a simple yet effective architecture that focuses on encoding fine-level features into token representations through a newly proposed outlook attention mechanism.

Core Contributions

The central innovation of this paper lies in its outlook attention mechanism, which addresses the limitations of ViTs in encoding fine-level features crucial for visual recognition. Unlike conventional self-attention that models global dependencies at a coarse level, outlook attention efficiently embeds local, fine-grained details into token representations. This novel approach alleviates the computational burden associated with traditional self-attention by circumventing expensive dot-product operations.

Architecture Design

VOLO's architecture is constructed with a two-stage approach:

Stage 1: Incorporates the outlook attention to encode fine-level token representations from smaller image patches (e.g., $8 \times 8$ ).
Stage 2: Utilizes self-attention to build global dependencies from downsampled coarse representations (e.g., $14 \times 14$ ).

The VOLO model family comprises several variants, such as VOLO-D1 to VOLO-D5, each differing in scale, parameter count, and computational efficiency.

Experimental Results

VOLO achieves significant performance gains on key benchmarks:

ImageNet-1K: Attains a groundbreaking 87.1% top-1 accuracy without additional training data, setting a new state-of-the-art.
Transferability: Demonstrates strong transfer performance to downstream tasks like semantic segmentation, achieving 84.3% mIoU on Cityscapes and 54.3% on ADE20K, outperforming prior models such as Swin Transformer.

Theoretical and Practical Implications

From a theoretical standpoint, the VOLO framework challenges existing paradigms by showing that attention-based models, when enhanced with fine-level context encoding, can outperform CNNs even without vast datasets. Practically, VOLO's parameter efficiency and scalability make it an attractive choice for deployment in real-world applications where computational resources are often a constraint.

Future Directions and Developments

The positive outcomes from VOLO suggest several avenues for future exploration:

Generalization to Other Modalities: Investigating the extension of outlook attention to non-visual contexts, such as natural language processing or multi-modal tasks.
Optimized Architectures: Further refining the architecture to enhance efficiency without sacrificing accuracy, especially in compute-constrained environments.
Extended Applications: Applying VOLO to other vision-related tasks, including object detection and video analysis, to benchmark its versatility and robustness.

In summary, the paper proposes a transformative step forward in vision recognition algorithms, offering a compelling alternative to traditional CNNs and vanilla ViTs by leveraging the strengths of localized attention mechanisms. VOLO marks a substantial advance in model architecture, providing a solid foundation for future innovations in deep learning and computer vision.

PDF Markdown

Related Papers

GitHub

GitHub - sail-sg/volo: VOLO: Vision Outlooker for Visual Recognition (944 stars)

YouTube

Show All Videos