Not All Images are Worth 16x16 Words: Dynamic Transformers for Efficient Image Recognition (2105.15075v2)

Published 31 May 2021 in cs.CV, cs.AI, and cs.LG

Abstract: Vision Transformers (ViT) have achieved remarkable success in large-scale image recognition. They split every 2D image into a fixed number of patches, each of which is treated as a token. Generally, representing an image with more tokens would lead to higher prediction accuracy, while it also results in drastically increased computational cost. To achieve a decent trade-off between accuracy and speed, the number of tokens is empirically set to 16x16 or 14x14. In this paper, we argue that every image has its own characteristics, and ideally the token number should be conditioned on each individual input. In fact, we have observed that there exist a considerable number of "easy" images which can be accurately predicted with a mere number of 4x4 tokens, while only a small fraction of "hard" ones need a finer representation. Inspired by this phenomenon, we propose a Dynamic Transformer to automatically configure a proper number of tokens for each input image. This is achieved by cascading multiple Transformers with increasing numbers of tokens, which are sequentially activated in an adaptive fashion at test time, i.e., the inference is terminated once a sufficiently confident prediction is produced. We further design efficient feature reuse and relationship reuse mechanisms across different components of the Dynamic Transformer to reduce redundant computations. Extensive empirical results on ImageNet, CIFAR-10, and CIFAR-100 demonstrate that our method significantly outperforms the competitive baselines in terms of both theoretical computational efficiency and practical inference speed. Code and pre-trained models (based on PyTorch and MindSpore) are available at https://github.com/blackfeather-wang/Dynamic-Vision-Transformer and https://github.com/blackfeather-wang/Dynamic-Vision-Transformer-MindSpore.

PDF Abstract

Overview of "Not All Images are Worth 16x16 Words: Dynamic Transformers for Efficient Image Recognition"

The paper "Not All Images are Worth 16x16 Words: Dynamic Transformers for Efficient Image Recognition" by Yulin Wang et al. presents a novel approach to enhancing the efficiency of Vision Transformers (ViT). ViT, since its inception, has revolutionized image recognition by utilizing the transformer architecture that was originally designed for NLP tasks. The traditional method of representing images involves splitting them into fixed-sized patches, typically 16x16 or 14x14, which are then used as tokens for transformer input. While increasing the number of tokens can enhance accuracy, it also exponentially increases computational load, presenting a challenge for practical deployment.

Key Innovations

The paper introduces a Dynamic Vision Transformer (DVT) framework, which dynamically adjusts the number of tokens on a per-image basis to optimize both accuracy and computational efficiency. The core concept hinges on the observation that not all images require fine-grained tokenization; some can be accurately classified with fewer tokens. This dynamic adjustment is achieved through a cascade of transformers, each configured with increasing token resolutions. The process halts once an image is confidently recognized, effectively reducing redundant computations for simpler images.

Key features of the DVT include:

Adaptive Token Configuration: Each image is represented using a variable number of tokens, minimizing unnecessary processing for simpler images.
Feature and Relationship Reuse: These mechanisms further enhance efficiency by reusing previously computed features and attention relationships across transformer layers, thus minimizing redundant computation while maintaining high accuracy.

Empirical Results

The paper presents extensive empirical validations across standard benchmarks such as ImageNet, CIFAR-10, and CIFAR-100. DVT demonstrates significant improvements over baseline models, achieving comparable or superior accuracy with substantially lower computational costs. For instance, on ImageNet, DVT achieves up to 3.6x less computational cost than its counterparts while maintaining high accuracy. Similarly, on CIFAR-10/100, DVT achieves competitive results with 3-9x reduction in FLOPs.

Practical Implications

The DVT framework is particularly appealing for real-world applications where computational resources are constrained, such as mobile devices or internet-of-things (IoT) applications. Its ability to adaptively allocate computational resources based on image complexity offers promising advancements in power efficiency, inference speed, and carbon footprint reduction. This approach also opens avenues for further research into adaptive computational models beyond computer vision, potentially benefiting diverse AI applications.

Theoretical Implications and Future Directions

Theoretically, the paper challenges the prevailing assumption that a uniform token granularity is optimal for model performance across all visual data. By introducing a dynamic approach, the authors contribute to a growing body of research advocating for more flexible, input-adaptive models. Future research could explore the integration of DVT with other AI tasks that utilize transformers, potentially leading to novel architectures in object detection, video processing, and multimodal learning.

In conclusion, this paper provides a significant step forward in enhancing the computational efficiency of vision transformers. The proposed dynamic framework not only improves practical applicability but also encourages further investigation into adaptive model architectures in the broader machine learning community.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Yulin Wang (45 papers)
Rui Huang (128 papers)
Shiji Song (103 papers)
Zeyi Huang (25 papers)
Gao Huang (178 papers)

Citations (167)

View on Semantic Scholar

Related Papers

Find Related Papers