Convolution-enhanced Image Transformer (CeiT)
- Convolution-enhanced Image Transformer (CeiT) is a vision model that merges convolutional mechanisms with Transformers to better capture local spatial details.
- It employs a convolutional stem, locally-enhanced feed-forward layers, and layer-wise class token attention to improve feature extraction and overall performance.
- CeiT achieves faster convergence and higher accuracy on benchmarks like ImageNet while maintaining negligible computational overhead compared to traditional ViT/DeiT.
The Convolution-enhanced Image Transformer (CeiT) is a vision model architecture that integrates convolutional mechanisms into Transformer-based image classification pipelines. CeiT was developed to address the limitations of pure Transformers, such as ViT and DeiT, which exhibit deficiencies in learning local spatial structures and often require extensive training data or auxiliary supervision to reach CNN-level accuracy. By incorporating convolutional modules—specifically a convolutional stem for rich low-level feature representation, local mixing feed-forward layers, and a hierarchical class token attention mechanism—CeiT achieves improved data efficiency, faster convergence, and superior accuracy on standard benchmarks, all with negligible computational overhead relative to vanilla Vision Transformers (Yuan et al., 2021).
1. Architectural Foundation and Motivation
CeiT preserves the high-level pipeline of image Transformers such as ViT, comprising an image tokenization stem, a sequence of Transformer encoder blocks (each with Multi-Head Self-Attention and a feed-forward sub-layer), and a class readout token for prediction. The distinguishing feature is the explicit introduction of convolutional inductive biases, aiming to more faithfully extract and process local image patterns, mirroring the translation invariance and spatial coherence naturally encoded by CNNs.
Pure Transformer architectures for vision tasks tokenize the image into large patches (for instance, ) and treat all patches as exchangeable, relying on global self-attention and pointwise feed-forward nets. This results in the loss of fine-grained local structures. CeiT circumvents these deficiencies through three main modules: a convolutional Image-to-Tokens stem, Locally-enhanced Feed-Forward layers, and Layer-wise Class token Attention.
2. Image-to-Tokens (I2T) Module
CeiT's input stem diverges from ViT/DeiT’s raw patch embedding by employing a convolutional strategy that effectively downsamples and enriches low-level representations prior to patch extraction. The I2T module processes an image as follows:
where with in practical settings. After the stem, non-overlapping spatial patches of size (with , ) are flattened and projected to dimensions. The process yields patch tokens:
where is a learnable class token.
Empirical ablation demonstrates that the full I2T configuration (Conv7×7, stride 2 + MaxPool3×3, stride 2 + BN, ) yields a performance boost of top-1 accuracy over standard raw patch embedding within the DeiT-T backbone (Yuan et al., 2021).
3. Locally-Enhanced Feed-Forward (LeFF) Layers
CeiT replaces the pointwise feed-forward sub-layer typical of Transformers with a Locally-enhanced Feed-Forward (LeFF) module that injects local spatial awareness through depth-wise convolutions. For the patch tokens at layer :
Typically, is selected for the kernel size to balance accuracy and efficiency. Empirical studies reveal that LeFF with offers a top-1 accuracy improvement compared to a linear variant (). The additional computational cost from the depth-wise convolution (per block: FLOPs) is minor relative to the total cost of the Transformer block.
4. Layer-wise Class Token Attention (LCA)
While standard ViT and DeiT utilize only the final-layer class token for prediction, CeiT fuses class tokens from all Transformer layers using Layer-wise Class token Attention (LCA). Given the collection of intermediate class tokens
LCA treats the final class token as the query and computes attention over all previous class tokens:
This approach enriches the top-level class representation with hierarchical context at an additional cost of , which is negligible relative to the budget for self-attention across layers. Experimental results indicate a gain in top-1 accuracy when using LCA versus relying solely on the final class token.
5. Training Protocol and Optimization Dynamics
CeiT is trained on ImageNet-1K without recourse to external datasets or CNN teacher signals. The optimization pipeline includes random resized cropping, horizontal flipping, RandAugment, Mixup (), CutMix (), RepeatAugment, and label smoothing ($0.1$). The AdamW optimizer with weight decay $0.05$ is used. The learning rate is scheduled with a 5-epoch linear warmup and cosine decay, starting at for 300 epochs. For fine-tuning at higher resolution (), a learning rate of for 30 epochs is used.
CeiT models achieve convergence at faster rates compared to DeiT: CeiT-T matches DeiT-T’s 300-epoch accuracy in only 100 epochs (72.2% vs 65.3% top-1 accuracy). CeiT-B at 100 epochs matches DeiT-B at 300 epochs (81.8% top-1) (Yuan et al., 2021).
6. Empirical Performance and Comparative Evaluation
On ImageNet-1K (single-crop validation), the results are as follows:
| Model | Params | FLOPs | Top-1 |
|---|---|---|---|
| ResNet-50 | 25.6M | 4.1G | 76.7% |
| DeiT-T | 5.7M | 1.2G | 72.2% |
| DeiT-S | 22.1M | 4.5G | 79.9% |
| CeiT-T | 6.4M | 1.2G | 76.4% |
| CeiT-S | 24.2M | 4.5G | 82.0% |
CeiT-T nearly matches the accuracy of ResNet-50 at one-quarter of its computational footprint. CeiT-S (comparable in size to ResNet-50) outperforms both ResNet-50 and DeiT-S by and top-1, respectively. On seven downstream classification tasks, including iNaturalist’18/19, Cars, Flowers, Pets, and CIFAR-10/100, CeiT-S (224) surpasses DeiT-B, while CeiT-S@384 sets state-of-the-art scores on most splits (e.g., Cars: , CIFAR-100: ).
7. Computational Analysis and Ablations
The convolutional components introduce only marginal computational overhead:
- I2T stem: Feature extraction FLOPs are those of ViT’s patch embedding, as the feature map is downsampled by .
- LeFF: Additional depth-wise convolution costs (proportional to per block) are minimal compared to the FLOPs of the pointwise FFN in practice.
- LCA: The scaling is negligible relative to attention’s budget.
Ablations confirm each module’s effectiveness: using the full I2T increases top-1 by over the baseline; increasing LeFF kernel size to $3$ or $5$ yields an extra –; incorporating LCA provides a further improvement.
By combining lightweight convolutional stems, local mixing feed-forward networks, and hierarchical class token aggregation, CeiT recovers key inductive biases of CNNs, enabling better convergence, higher data efficiency, and improved accuracy relative to both ViT/DeiT and contemporary CNNs, while only marginally increasing computational requirements (Yuan et al., 2021).