Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

TokenMix: Rethinking Image Mixing for Data Augmentation in Vision Transformers (2207.08409v3)

Published 18 Jul 2022 in cs.CV

Abstract: CutMix is a popular augmentation technique commonly used for training modern convolutional and transformer vision networks. It was originally designed to encourage Convolution Neural Networks (CNNs) to focus more on an image's global context instead of local information, which greatly improves the performance of CNNs. However, we found it to have limited benefits for transformer-based architectures that naturally have a global receptive field. In this paper, we propose a novel data augmentation technique TokenMix to improve the performance of vision transformers. TokenMix mixes two images at token level via partitioning the mixing region into multiple separated parts. Besides, we show that the mixed learning target in CutMix, a linear combination of a pair of the ground truth labels, might be inaccurate and sometimes counter-intuitive. To obtain a more suitable target, we propose to assign the target score according to the content-based neural activation maps of the two images from a pre-trained teacher model, which does not need to have high performance. With plenty of experiments on various vision transformer architectures, we show that our proposed TokenMix helps vision transformers focus on the foreground area to infer the classes and enhances their robustness to occlusion, with consistent performance gains. Notably, we improve DeiT-T/S/B with +1% ImageNet top-1 accuracy. Besides, TokenMix enjoys longer training, which achieves 81.2% top-1 accuracy on ImageNet with DeiT-S trained for 400 epochs. Code is available at https://github.com/Sense-X/TokenMix.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Jihao Liu (60 papers)
  2. Boxiao Liu (16 papers)
  3. Hang Zhou (166 papers)
  4. Hongsheng Li (340 papers)
  5. Yu Liu (786 papers)
Citations (60)

Summary

An Expert Analysis of "TokenMix: Rethinking Image Mixing for Data Augmentation in Vision Transformers"

The paper "TokenMix: Rethinking Image Mixing for Data Augmentation in Vision Transformers" explores the limitations of conventional data augmentation techniques like CutMix when applied to transformer-based architectures. Given the inherently different processing paradigms of Convolutional Neural Networks (CNNs) versus Vision Transformers, this paper provides valuable insights into the need for tailored augmentation techniques that exploit the architectural strengths of transformers.

Key Contributions and Methodology

Vision transformers have risen to prominence due to their ability to model long-range dependencies through global attention mechanisms from the outset. While traditional augmentation strategies such as CutMix, which were originally devised to enhance CNN performance, also found applicability with transformers, their benefits remained limited within this new architecture paradigm. This paper introduces TokenMix, a novel augmentation technique that operates at the token level rather than image patches, which aligns more naturally with the architecture of vision transformers.

TokenMix operates by mixing images at the token level rather than at the region level as in CutMix. This methodological switch aims to better harness transformers' global receptive fields by encouraging the network to focus on inter-token dependencies spread across the image rather than localized patches. Moreover, TokenMix includes a refinement in generating the associated learning targets: instead of relying on the CutMix approach, where labels are a direct ratio of the mixed regions, TokenMix proposes employing neural activation maps from a pre-trained model to produce more semantically meaningful labels. This ensures that the augmented labels better reflect the visual content and semantics present in the mixed image, reducing label noise and enhancing training stability.

Experimental Results and Implications

The empirical evaluation across multiple transformer architectures, including DeiT, Swin Transformer, and PVT, demonstrates the efficacy of TokenMix. Consistent performance gains were observed with improvements of about 1% in top-1 accuracy on ImageNet for the DeiT models. The enhancement in performance underscores TokenMix's suitability for vision transformers, showcasing its ability to facilitate these models to focus more keenly on relevant visual components.

Moreover, the results indicate enhanced robustness to occlusions, an essential property for models in dealing with real-world scenarios where image degeneration or obstruction is frequent. TokenMix's token-focused approach not only leverages the full capacity of attention mechanisms in transformers but also aligns with their design principle by distributing dependencies across the image. This token-level augmentation ensures a more comprehensive learning process, contributing to better generalization and performance on various tasks beyond image classification, such as semantic segmentation on ADE20K.

Theoretical and Practical Impact

Theoretically, TokenMix challenges the prevailing augmentation paradigms by emphasizing the necessity of architecture-specific strategies. It opens new avenues for research in adapting data augmentation techniques tailored specifically for attention-based mechanisms prominent in transformer architectures. Practically, it provides a more effective framework for training vision transformers, which are becoming ubiquitous in various vision applications.

In conclusion, TokenMix represents a considerable step forward in the evolution of data augmentation techniques for contemporary vision models. As AI continues to advance, such targeted improvements in training methodologies will be crucial in driving the next generation of models to new heights in robustness and accuracy. Future exploration may focus on further optimizing the neural activation maps for token label generation and investigating other token-level transformations that can leverage the architectural nuances of vision models.