Convolution-enhanced Image Transformer (CeiT)

Updated 29 January 2026

Convolution-enhanced Image Transformer (CeiT) is a vision model that merges convolutional mechanisms with Transformers to better capture local spatial details.
It employs a convolutional stem, locally-enhanced feed-forward layers, and layer-wise class token attention to improve feature extraction and overall performance.
CeiT achieves faster convergence and higher accuracy on benchmarks like ImageNet while maintaining negligible computational overhead compared to traditional ViT/DeiT.

The Convolution-enhanced Image Transformer (CeiT) is a vision model architecture that integrates convolutional mechanisms into Transformer-based image classification pipelines. CeiT was developed to address the limitations of pure Transformers, such as ViT and DeiT, which exhibit deficiencies in learning local spatial structures and often require extensive training data or auxiliary supervision to reach CNN-level accuracy. By incorporating convolutional modules—specifically a convolutional stem for rich low-level feature representation, local mixing feed-forward layers, and a hierarchical class token attention mechanism—CeiT achieves improved data efficiency, faster convergence, and superior accuracy on standard benchmarks, all with negligible computational overhead relative to vanilla Vision Transformers (Yuan et al., 2021).

1. Architectural Foundation and Motivation

CeiT preserves the high-level pipeline of image Transformers such as ViT, comprising an image tokenization stem, a sequence of $L$ Transformer encoder blocks (each with Multi-Head Self-Attention and a feed-forward sub-layer), and a class readout token for prediction. The distinguishing feature is the explicit introduction of convolutional inductive biases, aiming to more faithfully extract and process local image patterns, mirroring the translation invariance and spatial coherence naturally encoded by CNNs.

Pure Transformer architectures for vision tasks tokenize the image into large patches (for instance, $16\times16$ ) and treat all patches as exchangeable, relying on global self-attention and pointwise feed-forward nets. This results in the loss of fine-grained local structures. CeiT circumvents these deficiencies through three main modules: a convolutional Image-to-Tokens stem, Locally-enhanced Feed-Forward layers, and Layer-wise Class token Attention.

2. Image-to-Tokens (I2T) Module

CeiT's input stem diverges from ViT/DeiT’s raw patch embedding by employing a convolutional strategy that effectively downsamples and enriches low-level representations prior to patch extraction. The I2T module processes an image $x \in \mathbb{R}^{H \times W \times 3}$ as follows:

$x' = \mathrm{MaxPool}_{3\times3,s=2}( \mathrm{BN}( \mathrm{Conv}_{7\times7,s=2}(x) ) )$

where $x' \in \mathbb{R}^{(H/4) \times (W/4) \times D}$ with $D=32$ in practical settings. After the stem, non-overlapping spatial patches of size $(P/S) \times (P/S)$ (with $P=16$ , $S=4$ ) are flattened and projected to $C$ dimensions. The process yields patch tokens:

$X_p = \text{FlattenPatch}(x') \in \mathbb{R}^{N \times ((P/S)^2 \cdot D)}$

$Z_p^{(0)} = X_p W_e + b_e,~~~W_e \in \mathbb{R}^{((P/S)^2 \cdot D) \times C}$

$Z^{(0)} = [z_c^{(0)}; Z_p^{(0)}] \in \mathbb{R}^{(N+1) \times C}$

where $z_c^{(0)}$ is a learnable class token.

Empirical ablation demonstrates that the full I2T configuration (Conv7×7, stride 2 + MaxPool3×3, stride 2 + BN, $D=32$ ) yields a performance boost of $+1.2\%$ top-1 accuracy over standard raw patch embedding within the DeiT-T backbone (Yuan et al., 2021).

3. Locally-Enhanced Feed-Forward (LeFF) Layers

CeiT replaces the pointwise feed-forward sub-layer typical of Transformers with a Locally-enhanced Feed-Forward (LeFF) module that injects local spatial awareness through depth-wise convolutions. For the patch tokens $X_p^h \in \mathbb{R}^{N \times C}$ at layer $h$ :

$\begin{aligned} X_p^{l1} &= \operatorname{GELU}(\operatorname{BN}( X_p^h W_1 + b_1 )) \ X_p^s &= \operatorname{Reshape}( X_p^{l1}, (\sqrt{N}, \sqrt{N}, eC) ),~~~e=4 \ X_p^d &= \operatorname{GELU}(\operatorname{BN}(\operatorname{DWConv}_{k\times k}( X_p^s ))) \ X_p^f &= \operatorname{Flatten}( X_p^d ) \ X_p^{l2} &= \operatorname{GELU}(\operatorname{BN}( X_p^f W_2 + b_2 )) \ Z^{(h+1)} &= [ x_c^h ; X_p^{l2} ] \end{aligned}$

Typically, $k=3$ is selected for the kernel size to balance accuracy and efficiency. Empirical studies reveal that LeFF with $k=3$ offers a $+2.1\%$ top-1 accuracy improvement compared to a linear variant ( $k=1$ ). The additional computational cost from the depth-wise convolution (per block: $4k^2NC$ FLOPs) is minor relative to the total cost of the Transformer block.

4. Layer-wise Class Token Attention (LCA)

While standard ViT and DeiT utilize only the final-layer class token for prediction, CeiT fuses class tokens from all Transformer layers using Layer-wise Class token Attention (LCA). Given the collection of intermediate class tokens

$X_C = [z_c^{(1)}, z_c^{(2)}, \dots, z_c^{(L)}] \in \mathbb{R}^{L \times C}$

LCA treats the final class token $z_c^{(L)}$ as the query and computes attention over all previous class tokens:

$\begin{aligned} Q &= z_c^{(L)} W_Q \ K &= X_C W_K \ V &= X_C W_V \ A &= \operatorname{softmax}\left( \frac{Q K^T}{\sqrt{C}} \right)V \ \hat{z}_c^{(L)} &= \operatorname{LN}( z_c^{(L)} + A ) \ z_c^{\text{out}} &= \operatorname{LN}( \hat{z}_c^{(L)} + \mathrm{FFN}(\hat{z}_c^{(L)}) ) \end{aligned}$

This approach enriches the top-level class representation with hierarchical context at an additional cost of $O(LC^2)$ , which is negligible relative to the $O(N^2C)$ budget for self-attention across $L$ layers. Experimental results indicate a $+0.6\%$ gain in top-1 accuracy when using LCA versus relying solely on the final class token.

5. Training Protocol and Optimization Dynamics

CeiT is trained on ImageNet-1K without recourse to external datasets or CNN teacher signals. The optimization pipeline includes random resized cropping, horizontal flipping, RandAugment, Mixup ( $\alpha=0.8$ ), CutMix ( $\alpha=1.0$ ), RepeatAugment, and label smoothing ($0.1$). The AdamW optimizer with weight decay $0.05$ is used. The learning rate is scheduled with a 5-epoch linear warmup and cosine decay, starting at $1 \times 10^{-3}$ for 300 epochs. For fine-tuning at higher resolution ( $384 \times 384$ ), a learning rate of $5 \times 10^{-6}$ for 30 epochs is used.

CeiT models achieve convergence at $3\times$ faster rates compared to DeiT: CeiT-T matches DeiT-T’s 300-epoch accuracy in only 100 epochs (72.2% vs 65.3% top-1 accuracy). CeiT-B at 100 epochs matches DeiT-B at 300 epochs (81.8% top-1) (Yuan et al., 2021).

6. Empirical Performance and Comparative Evaluation

On ImageNet-1K (single-crop validation), the results are as follows:

Model	Params	FLOPs	Top-1
ResNet-50	25.6M	4.1G	76.7%
DeiT-T	5.7M	1.2G	72.2%
DeiT-S	22.1M	4.5G	79.9%
CeiT-T	6.4M	1.2G	76.4%
CeiT-S	24.2M	4.5G	82.0%

CeiT-T nearly matches the accuracy of ResNet-50 at one-quarter of its computational footprint. CeiT-S (comparable in size to ResNet-50) outperforms both ResNet-50 and DeiT-S by $+5.3\%$ and $+2.1\%$ top-1, respectively. On seven downstream classification tasks, including iNaturalist’18/19, Cars, Flowers, Pets, and CIFAR-10/100, CeiT-S (224) surpasses DeiT-B, while CeiT-S@384 sets state-of-the-art scores on most splits (e.g., Cars: $94.1\%$ , CIFAR-100: $90.8\%$ ).

7. Computational Analysis and Ablations

The convolutional components introduce only marginal computational overhead:

I2T stem: Feature extraction FLOPs are $\sim1.1\times$ those of ViT’s patch embedding, as the feature map is downsampled by $4\times$ .
LeFF: Additional depth-wise convolution costs (proportional to $4k^2NC$ per block) are minimal compared to the $8(N+1)C^2$ FLOPs of the pointwise FFN in practice.
LCA: The $O(LC^2)$ scaling is negligible relative to attention’s $O(N^2C)$ budget.

Ablations confirm each module’s effectiveness: using the full I2T increases top-1 by $+1.2\%$ over the baseline; increasing LeFF kernel size to $3$ or $5$ yields an extra $+2.1\%$ – $+2.2\%$ ; incorporating LCA provides a further $+0.6\%$ improvement.

By combining lightweight convolutional stems, local mixing feed-forward networks, and hierarchical class token aggregation, CeiT recovers key inductive biases of CNNs, enabling better convergence, higher data efficiency, and improved accuracy relative to both ViT/DeiT and contemporary CNNs, while only marginally increasing computational requirements (Yuan et al., 2021).

Markdown Upgrade to Chat

References (1)

Incorporating Convolution Designs into Visual Transformers (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Convolution-enhanced Image Transformer (CeiT).

Convolution-enhanced Image Transformer (CeiT)

1. Architectural Foundation and Motivation

2. Image-to-Tokens (I2T) Module

3. Locally-Enhanced Feed-Forward (LeFF) Layers

4. Layer-wise Class Token Attention (LCA)

5. Training Protocol and Optimization Dynamics

6. Empirical Performance and Comparative Evaluation

7. Computational Analysis and Ablations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

Convolution-enhanced Image Transformer (CeiT)

1. Architectural Foundation and Motivation

2. Image-to-Tokens (I2T) Module

3. Locally-Enhanced Feed-Forward (LeFF) Layers

4. Layer-wise Class Token Attention (LCA)

5. Training Protocol and Optimization Dynamics

6. Empirical Performance and Comparative Evaluation

7. Computational Analysis and Ablations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research