Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Locality Guidance for Improving Vision Transformers on Tiny Datasets (2207.10026v1)

Published 20 Jul 2022 in cs.CV

Abstract: While the Vision Transformer (VT) architecture is becoming trendy in computer vision, pure VT models perform poorly on tiny datasets. To address this issue, this paper proposes the locality guidance for improving the performance of VTs on tiny datasets. We first analyze that the local information, which is of great importance for understanding images, is hard to be learned with limited data due to the high flexibility and intrinsic globality of the self-attention mechanism in VTs. To facilitate local information, we realize the locality guidance for VTs by imitating the features of an already trained convolutional neural network (CNN), inspired by the built-in local-to-global hierarchy of CNN. Under our dual-task learning paradigm, the locality guidance provided by a lightweight CNN trained on low-resolution images is adequate to accelerate the convergence and improve the performance of VTs to a large extent. Therefore, our locality guidance approach is very simple and efficient, and can serve as a basic performance enhancement method for VTs on tiny datasets. Extensive experiments demonstrate that our method can significantly improve VTs when training from scratch on tiny datasets and is compatible with different kinds of VTs and datasets. For example, our proposed method can boost the performance of various VTs on tiny datasets (e.g., 13.07% for DeiT, 8.98% for T2T and 7.85% for PVT), and enhance even stronger baseline PVTv2 by 1.86% to 79.30%, showing the potential of VTs on tiny datasets. The code is available at https://github.com/lkhl/tiny-transformers.

Locality Guidance for Enhancing Vision Transformers on Limited Data

Vision Transformers (VTs) have garnered considerable attention in the domain of computer vision due to their remarkable success in processing visual data via self-attention mechanisms. However, pure VT architectures encounter significant challenges when handling small datasets, and this paper presents a noteworthy methodology to address these limitations by integrating locality guidance.

Problem Analysis

The self-attention mechanism intrinsic to VTs is inherently global, making it difficult for these models to learn local information effectively when training data is sparse. This paper identifies the efficient integration of local information as a pivotal element missed by VTs when trained on limited datasets. The solution proposed is inspired by the hierarchical local-to-global information processing in Convolutional Neural Networks (CNNs), which allows CNNs to effectively understand images even with smaller datasets.

Proposed Method

The authors propose a straightforward yet effective technique involving locality guidance for VTs using a knowledge distillation approach from CNNs. A lightweight CNN is utilized to guide the VT in learning local information. This process involves mimicking features from an already trained CNN by VTs, effectively translating the hierarchical locality characteristics inherent in CNNs to VTs.

Key features of the method:

  • Dual-task learning paradigm: VTs learn both from locality-guided CNN features and through direct supervision for classification tasks.
  • Implementation simplicity: The locality guidance approach requires no structural changes to VTs, functioning as an auxiliary feature alignment mechanism during training alone.

Results

Extensive evaluations are performed across multiple VT architectures (such as DeiT, T2T, PVT, PiT, PVTv2, and ConViT) on datasets like CIFAR-100, Oxford Flowers, and Chaoyang, showcasing the adaptability of the method. The locality guidance strategy provides substantial performance improvements:

  • Performance gains are notable on tiny datasets—e.g., improvements of 13.07% for DeiT on CIFAR-100.
  • The approach enables VTs to reach, and often exceed, performance levels of baseline CNN models, reaffirming its potential as a VT enhancement strategy for limited-scale datasets.
  • The method demonstrates efficiency by significantly accelerating the convergence of VTs—results are comparable even when the training schedule is reduced by two-thirds.

Comparisons and Insights

The paper juxtaposes their locality guidance method against several other strategies. Compared to Liu et al.'s self-supervised auxiliary task and the distillation approach of DeiT, this method more effectively enhances locality learning due to its hierarchical feature alignment mechanism. Furthermore, attention statistics comparing models trained with and without locality guidance highlight improvements in local information processing within VTs, akin to those seen in CNNs.

Implications and Future Work

The findings underscore the importance of incorporating locality features within globally focused models like VTs, especially when large datasets are unavailable for pre-training. This locality-guided technique has implications beyond immediate performance improvements; it fosters the application of VTs in domains constrained by dataset size, like medical imaging, where collecting large datasets is infeasible.

Future avenues for research include examining variations in CNN architectures for locality guidance and exploring integration pathways that could lead to hybrid models combining the merits of CNNs and VTs seamlessly. There is also scope to extend this approach to other domains such as temporal data learning in video analysis using VTs.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Kehan Li (23 papers)
  2. Runyi Yu (13 papers)
  3. Zhennan Wang (12 papers)
  4. Li Yuan (141 papers)
  5. Guoli Song (10 papers)
  6. Jie Chen (602 papers)
Citations (37)