All Tokens Matter: Token Labeling for Training Better Vision Transformers (2104.10858v3)

Published 22 Apr 2021 in cs.CV

Abstract: In this paper, we present token labeling -- a new training objective for training high-performance vision transformers (ViTs). Different from the standard training objective of ViTs that computes the classification loss on an additional trainable class token, our proposed one takes advantage of all the image patch tokens to compute the training loss in a dense manner. Specifically, token labeling reformulates the image classification problem into multiple token-level recognition problems and assigns each patch token with an individual location-specific supervision generated by a machine annotator. Experiments show that token labeling can clearly and consistently improve the performance of various ViT models across a wide spectrum. For a vision transformer with 26M learnable parameters serving as an example, with token labeling, the model can achieve 84.4% Top-1 accuracy on ImageNet. The result can be further increased to 86.4% by slightly scaling the model size up to 150M, delivering the minimal-sized model among previous models (250M+) reaching 86%. We also show that token labeling can clearly improve the generalization capability of the pre-trained models on downstream tasks with dense prediction, such as semantic segmentation. Our code and all the training details will be made publicly available at https://github.com/zihangJiang/TokenLabeling.

Citations (197)

View on Semantic Scholar

Summary

The paper introduces token labeling, a method that reformulates training loss using patch-level labels and boosts accuracy, with results up to 86.4% Top-1 on ImageNet.
It leverages dense, location-specific supervision to enrich model learning, improving performance on both image classification and downstream segmentation tasks.
Experiments validate the efficacy of token labeling across various model scales, establishing it as a robust technique for training more generalizable Vision Transformers.

Overview of "All Tokens Matter: Token Labeling for Training Better Vision Transformers"

The paper "All Tokens Matter: Token Labeling for Training Better Vision Transformers" introduces an innovative training approach named token labeling for vision transformers (ViTs). This new training objective diverges from conventional methods by utilizing all image patch tokens to compute loss, thus enhancing model performance in vision tasks.

Core Contributions

The primary contribution of this paper is the concept of token labeling, which transforms the image classification process into numerous token-level recognition tasks. This method assigns each image patch token a distinct, location-specific label generated by a machine annotator. By doing so, token labeling leverages dense supervision and enriches the learning experience for the model, leading to improved accuracy and generalization in ViTs.

Experimentation and Results

Through extensive experimentation, the authors demonstrate the efficacy of token labeling across various ViT models. For instance, a ViT equipped with 26M parameters achieved 84.4% Top-1 accuracy on ImageNet, outperforming comparable models. When the model size is increased to 150M parameters, accuracy further rises to 86.4%.

Furthermore, the paper shows that pretrained models using token labeling exhibit enhanced performance on downstream tasks involving dense predictions, such as semantic segmentation. This robust improvement is attributed to the location-specific information provided by token labels.

Technical Insights

Vision Transformer Architecture: The paper builds on the versatility of transformers, initially designed for NLP, and applies them to vision tasks, focusing on their capability to capture long-range dependencies through self-attention.
Token Labeling Technique: This involves reformulating the conventional classification training loss to consider individual patch tokens. Token labeling is applied as an auxiliary objective, enabling a more detailed object recognition process.
MixToken Augmentation: A modified version of CutMix, MixToken operates after patch embedding, allowing tokens to maintain clean content and consequently improving the token labeling process.

Practical and Theoretical Implications

The proposed token labeling improves learning efficiency and model performance without significantly escalating computational costs. The implication is particularly significant for tasks requiring fine-grained feature extraction and segmentation. Practically, this could translate to better performance in real-world applications involving image classification and object detection.

Theoretically, this work expands the understanding of how dense, location-specific supervision can benefit the training of neural networks, particularly transformers, challenging the traditional perspective that focuses heavily on global class token utilization.

Future Work

As a future direction, further exploration into other model architectures, such as CNNs and MLPs, can elucidate the wider applicability of token labeling. Moreover, investigating different types of machine annotators could refine the quality of generated score maps, providing even more precise supervision.

In summary, this paper presents a compelling case for the use of token labeling in training vision transformers, offering a novel approach to enhance model accuracy and generalization capabilities in various computer vision tasks. This introduction of token labeling signifies a noteworthy progression in the domain of vision transformers, promising improvements in both theoretical understanding and practical deployments in AI.

PDF Markdown

Related Papers

GitHub

GitHub - zihangJiang/TokenLabeling: Pytorch implementation of "All Tokens Matter: Token Labeling for Training Better Vision Transformers" (430 stars)