Papers
Topics
Authors
Recent
Search
2000 character limit reached

Convolution-enhanced Image Transformer (CeiT)

Updated 29 January 2026
  • Convolution-enhanced Image Transformer (CeiT) is a vision model that merges convolutional mechanisms with Transformers to better capture local spatial details.
  • It employs a convolutional stem, locally-enhanced feed-forward layers, and layer-wise class token attention to improve feature extraction and overall performance.
  • CeiT achieves faster convergence and higher accuracy on benchmarks like ImageNet while maintaining negligible computational overhead compared to traditional ViT/DeiT.

The Convolution-enhanced Image Transformer (CeiT) is a vision model architecture that integrates convolutional mechanisms into Transformer-based image classification pipelines. CeiT was developed to address the limitations of pure Transformers, such as ViT and DeiT, which exhibit deficiencies in learning local spatial structures and often require extensive training data or auxiliary supervision to reach CNN-level accuracy. By incorporating convolutional modules—specifically a convolutional stem for rich low-level feature representation, local mixing feed-forward layers, and a hierarchical class token attention mechanism—CeiT achieves improved data efficiency, faster convergence, and superior accuracy on standard benchmarks, all with negligible computational overhead relative to vanilla Vision Transformers (Yuan et al., 2021).

1. Architectural Foundation and Motivation

CeiT preserves the high-level pipeline of image Transformers such as ViT, comprising an image tokenization stem, a sequence of LL Transformer encoder blocks (each with Multi-Head Self-Attention and a feed-forward sub-layer), and a class readout token for prediction. The distinguishing feature is the explicit introduction of convolutional inductive biases, aiming to more faithfully extract and process local image patterns, mirroring the translation invariance and spatial coherence naturally encoded by CNNs.

Pure Transformer architectures for vision tasks tokenize the image into large patches (for instance, 16×1616\times16) and treat all patches as exchangeable, relying on global self-attention and pointwise feed-forward nets. This results in the loss of fine-grained local structures. CeiT circumvents these deficiencies through three main modules: a convolutional Image-to-Tokens stem, Locally-enhanced Feed-Forward layers, and Layer-wise Class token Attention.

2. Image-to-Tokens (I2T) Module

CeiT's input stem diverges from ViT/DeiT’s raw patch embedding by employing a convolutional strategy that effectively downsamples and enriches low-level representations prior to patch extraction. The I2T module processes an image xRH×W×3x \in \mathbb{R}^{H \times W \times 3} as follows:

x=MaxPool3×3,s=2(BN(Conv7×7,s=2(x)))x' = \mathrm{MaxPool}_{3\times3,s=2}( \mathrm{BN}( \mathrm{Conv}_{7\times7,s=2}(x) ) )

where xR(H/4)×(W/4)×Dx' \in \mathbb{R}^{(H/4) \times (W/4) \times D} with D=32D=32 in practical settings. After the stem, non-overlapping spatial patches of size (P/S)×(P/S)(P/S) \times (P/S) (with P=16P=16, S=4S=4) are flattened and projected to CC dimensions. The process yields patch tokens:

Xp=FlattenPatch(x)RN×((P/S)2D)X_p = \text{FlattenPatch}(x') \in \mathbb{R}^{N \times ((P/S)^2 \cdot D)}

Zp(0)=XpWe+be,   WeR((P/S)2D)×CZ_p^{(0)} = X_p W_e + b_e,~~~W_e \in \mathbb{R}^{((P/S)^2 \cdot D) \times C}

Z(0)=[zc(0);Zp(0)]R(N+1)×CZ^{(0)} = [z_c^{(0)}; Z_p^{(0)}] \in \mathbb{R}^{(N+1) \times C}

where zc(0)z_c^{(0)} is a learnable class token.

Empirical ablation demonstrates that the full I2T configuration (Conv7×7, stride 2 + MaxPool3×3, stride 2 + BN, D=32D=32) yields a performance boost of +1.2%+1.2\% top-1 accuracy over standard raw patch embedding within the DeiT-T backbone (Yuan et al., 2021).

3. Locally-Enhanced Feed-Forward (LeFF) Layers

CeiT replaces the pointwise feed-forward sub-layer typical of Transformers with a Locally-enhanced Feed-Forward (LeFF) module that injects local spatial awareness through depth-wise convolutions. For the patch tokens XphRN×CX_p^h \in \mathbb{R}^{N \times C} at layer hh:

Xpl1=GELU(BN(XphW1+b1)) Xps=Reshape(Xpl1,(N,N,eC)),   e=4 Xpd=GELU(BN(DWConvk×k(Xps))) Xpf=Flatten(Xpd) Xpl2=GELU(BN(XpfW2+b2)) Z(h+1)=[xch;Xpl2]\begin{aligned} X_p^{l1} &= \operatorname{GELU}(\operatorname{BN}( X_p^h W_1 + b_1 )) \ X_p^s &= \operatorname{Reshape}( X_p^{l1}, (\sqrt{N}, \sqrt{N}, eC) ),~~~e=4 \ X_p^d &= \operatorname{GELU}(\operatorname{BN}(\operatorname{DWConv}_{k\times k}( X_p^s ))) \ X_p^f &= \operatorname{Flatten}( X_p^d ) \ X_p^{l2} &= \operatorname{GELU}(\operatorname{BN}( X_p^f W_2 + b_2 )) \ Z^{(h+1)} &= [ x_c^h ; X_p^{l2} ] \end{aligned}

Typically, k=3k=3 is selected for the kernel size to balance accuracy and efficiency. Empirical studies reveal that LeFF with k=3k=3 offers a +2.1%+2.1\% top-1 accuracy improvement compared to a linear variant (k=1k=1). The additional computational cost from the depth-wise convolution (per block: 4k2NC4k^2NC FLOPs) is minor relative to the total cost of the Transformer block.

4. Layer-wise Class Token Attention (LCA)

While standard ViT and DeiT utilize only the final-layer class token for prediction, CeiT fuses class tokens from all Transformer layers using Layer-wise Class token Attention (LCA). Given the collection of intermediate class tokens

XC=[zc(1),zc(2),,zc(L)]RL×CX_C = [z_c^{(1)}, z_c^{(2)}, \dots, z_c^{(L)}] \in \mathbb{R}^{L \times C}

LCA treats the final class token zc(L)z_c^{(L)} as the query and computes attention over all previous class tokens:

Q=zc(L)WQ K=XCWK V=XCWV A=softmax(QKTC)V z^c(L)=LN(zc(L)+A) zcout=LN(z^c(L)+FFN(z^c(L)))\begin{aligned} Q &= z_c^{(L)} W_Q \ K &= X_C W_K \ V &= X_C W_V \ A &= \operatorname{softmax}\left( \frac{Q K^T}{\sqrt{C}} \right)V \ \hat{z}_c^{(L)} &= \operatorname{LN}( z_c^{(L)} + A ) \ z_c^{\text{out}} &= \operatorname{LN}( \hat{z}_c^{(L)} + \mathrm{FFN}(\hat{z}_c^{(L)}) ) \end{aligned}

This approach enriches the top-level class representation with hierarchical context at an additional cost of O(LC2)O(LC^2), which is negligible relative to the O(N2C)O(N^2C) budget for self-attention across LL layers. Experimental results indicate a +0.6%+0.6\% gain in top-1 accuracy when using LCA versus relying solely on the final class token.

5. Training Protocol and Optimization Dynamics

CeiT is trained on ImageNet-1K without recourse to external datasets or CNN teacher signals. The optimization pipeline includes random resized cropping, horizontal flipping, RandAugment, Mixup (α=0.8\alpha=0.8), CutMix (α=1.0\alpha=1.0), RepeatAugment, and label smoothing ($0.1$). The AdamW optimizer with weight decay $0.05$ is used. The learning rate is scheduled with a 5-epoch linear warmup and cosine decay, starting at 1×1031 \times 10^{-3} for 300 epochs. For fine-tuning at higher resolution (384×384384 \times 384), a learning rate of 5×1065 \times 10^{-6} for 30 epochs is used.

CeiT models achieve convergence at 3×3\times faster rates compared to DeiT: CeiT-T matches DeiT-T’s 300-epoch accuracy in only 100 epochs (72.2% vs 65.3% top-1 accuracy). CeiT-B at 100 epochs matches DeiT-B at 300 epochs (81.8% top-1) (Yuan et al., 2021).

6. Empirical Performance and Comparative Evaluation

On ImageNet-1K (single-crop validation), the results are as follows:

Model Params FLOPs Top-1
ResNet-50 25.6M 4.1G 76.7%
DeiT-T 5.7M 1.2G 72.2%
DeiT-S 22.1M 4.5G 79.9%
CeiT-T 6.4M 1.2G 76.4%
CeiT-S 24.2M 4.5G 82.0%

CeiT-T nearly matches the accuracy of ResNet-50 at one-quarter of its computational footprint. CeiT-S (comparable in size to ResNet-50) outperforms both ResNet-50 and DeiT-S by +5.3%+5.3\% and +2.1%+2.1\% top-1, respectively. On seven downstream classification tasks, including iNaturalist’18/19, Cars, Flowers, Pets, and CIFAR-10/100, CeiT-S (224) surpasses DeiT-B, while CeiT-S@384 sets state-of-the-art scores on most splits (e.g., Cars: 94.1%94.1\%, CIFAR-100: 90.8%90.8\%).

7. Computational Analysis and Ablations

The convolutional components introduce only marginal computational overhead:

  • I2T stem: Feature extraction FLOPs are 1.1×\sim1.1\times those of ViT’s patch embedding, as the feature map is downsampled by 4×4\times.
  • LeFF: Additional depth-wise convolution costs (proportional to 4k2NC4k^2NC per block) are minimal compared to the 8(N+1)C28(N+1)C^2 FLOPs of the pointwise FFN in practice.
  • LCA: The O(LC2)O(LC^2) scaling is negligible relative to attention’s O(N2C)O(N^2C) budget.

Ablations confirm each module’s effectiveness: using the full I2T increases top-1 by +1.2%+1.2\% over the baseline; increasing LeFF kernel size to $3$ or $5$ yields an extra +2.1%+2.1\%+2.2%+2.2\%; incorporating LCA provides a further +0.6%+0.6\% improvement.

By combining lightweight convolutional stems, local mixing feed-forward networks, and hierarchical class token aggregation, CeiT recovers key inductive biases of CNNs, enabling better convergence, higher data efficiency, and improved accuracy relative to both ViT/DeiT and contemporary CNNs, while only marginally increasing computational requirements (Yuan et al., 2021).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Convolution-enhanced Image Transformer (CeiT).