The paper "GLiT: Neural Architecture Search for Global and Local Image Transformer" presents a method for enhancing the architecture of transformers specifically for image recognition tasks through Neural Architecture Search (NAS). While transformers have showcased significant performance improvements in computer vision, their original design is tailored for NLP tasks, which may limit effectiveness in image-related applications.
To address this, the authors introduce a search space and algorithm designed to improve the visual representation capabilities of transformers. The core innovation of the paper is the locality module, which captures local image correlations with reduced computational demands. This module allows the search space to balance between global and local information while optimizing design choices at a low-level module-by-module basis.
One of the challenges is the vast search space posed by this free-form definition, making it computationally expensive and complex. The authors propose a hierarchical neural architecture search method that decomposes the search problem into two levels, handled separately by an evolutionary algorithm. This hierarchical approach efficiently navigates the search space to identify optimal transformer configurations for vision tasks.
The empirical validation of their approach is extensive. Using the ImageNet dataset, the authors demonstrate that their optimized transformers, named Global and Local Image Transformers (GLiTs), outperform established architectures such as the ResNet family (e.g., ResNet101) and the baseline Vision Transformer (ViT). The results indicate that the GLiTs are more discriminative and efficient, which underscores the efficacy of the proposed NAS method in finding superior transformer architectures for image recognition.