Locality Guidance for Enhancing Vision Transformers on Limited Data
Vision Transformers (VTs) have garnered considerable attention in the domain of computer vision due to their remarkable success in processing visual data via self-attention mechanisms. However, pure VT architectures encounter significant challenges when handling small datasets, and this paper presents a noteworthy methodology to address these limitations by integrating locality guidance.
Problem Analysis
The self-attention mechanism intrinsic to VTs is inherently global, making it difficult for these models to learn local information effectively when training data is sparse. This paper identifies the efficient integration of local information as a pivotal element missed by VTs when trained on limited datasets. The solution proposed is inspired by the hierarchical local-to-global information processing in Convolutional Neural Networks (CNNs), which allows CNNs to effectively understand images even with smaller datasets.
Proposed Method
The authors propose a straightforward yet effective technique involving locality guidance for VTs using a knowledge distillation approach from CNNs. A lightweight CNN is utilized to guide the VT in learning local information. This process involves mimicking features from an already trained CNN by VTs, effectively translating the hierarchical locality characteristics inherent in CNNs to VTs.
Key features of the method:
- Dual-task learning paradigm: VTs learn both from locality-guided CNN features and through direct supervision for classification tasks.
- Implementation simplicity: The locality guidance approach requires no structural changes to VTs, functioning as an auxiliary feature alignment mechanism during training alone.
Results
Extensive evaluations are performed across multiple VT architectures (such as DeiT, T2T, PVT, PiT, PVTv2, and ConViT) on datasets like CIFAR-100, Oxford Flowers, and Chaoyang, showcasing the adaptability of the method. The locality guidance strategy provides substantial performance improvements:
- Performance gains are notable on tiny datasets—e.g., improvements of 13.07% for DeiT on CIFAR-100.
- The approach enables VTs to reach, and often exceed, performance levels of baseline CNN models, reaffirming its potential as a VT enhancement strategy for limited-scale datasets.
- The method demonstrates efficiency by significantly accelerating the convergence of VTs—results are comparable even when the training schedule is reduced by two-thirds.
Comparisons and Insights
The paper juxtaposes their locality guidance method against several other strategies. Compared to Liu et al.'s self-supervised auxiliary task and the distillation approach of DeiT, this method more effectively enhances locality learning due to its hierarchical feature alignment mechanism. Furthermore, attention statistics comparing models trained with and without locality guidance highlight improvements in local information processing within VTs, akin to those seen in CNNs.
Implications and Future Work
The findings underscore the importance of incorporating locality features within globally focused models like VTs, especially when large datasets are unavailable for pre-training. This locality-guided technique has implications beyond immediate performance improvements; it fosters the application of VTs in domains constrained by dataset size, like medical imaging, where collecting large datasets is infeasible.
Future avenues for research include examining variations in CNN architectures for locality guidance and exploring integration pathways that could lead to hybrid models combining the merits of CNNs and VTs seamlessly. There is also scope to extend this approach to other domains such as temporal data learning in video analysis using VTs.