Papers
Topics
Authors
Recent
2000 character limit reached

Compact Convolutional Transformers (CCT)

Updated 4 January 2026
  • Compact Convolutional Transformers (CCT) are hybrid models that combine convolutional tokenization with transformer self-attention to efficiently capture local and global features.
  • They leverage convolutional layers to introduce spatial inductive bias, reducing parameter count and overfitting compared to traditional Vision Transformers.
  • Empirical studies demonstrate that CCT achieves state-of-the-art accuracy across modalities such as medical imaging, time-series, and spectrogram analysis with rapid convergence.

A Compact Convolutional Transformer (CCT) is a hybrid neural architecture designed to address the data inefficiency and parameter scaling limitations inherent in conventional Vision Transformers (ViTs) and pure-attention models. CCTs integrate a lightweight convolutional tokenizer for local feature extraction, followed by a stack of transformer encoder blocks with multi-head self-attention. Unlike canonical ViT designs, CCT forgoes patch-wise linear embedding in favor of convolutions, which inject spatial inductive bias, weight sharing, and translation equivariance. This design results in markedly improved performance, parameter efficiency, and rapid convergence when training on small or modestly-sized datasets across diverse domains, especially when labeled samples are limited. Empirical results demonstrate that CCT architectures can robustly match or surpass state-of-the-art CNNs and transformers on image, time-series, and spectrogram modalities, even without large-scale pretraining (Gao, 2023, Hassani et al., 2021, Bartusiak et al., 2022, Morelle et al., 10 Jan 2025, Yuhua, 2024).

1. Architectural Principles and Design

The canonical CCT architecture comprises two main stages: convolutional tokenization and transformer encoding. The convolutional tokenizer utilizes one or more Conv2D layers (typical kernel sizes: 3×3 or 7×7; strides: 1 or 2), each followed by ReLU activation and max-pooling. Let XRC×H×W\mathbf{X} \in \mathbb{R}^{C \times H \times W} denote the input batch. The convolutional tokenizer produces a sequence of NN tokens of embedding dimension dd:

Xp=Conv2D(X;k,s)+P\mathbf{X}_p = \mathrm{Conv2D}(\mathbf{X};\, k,s) + \mathbf{P}

where kk is the convolutional kernel size, ss is stride, and P\mathbf{P} is a learnable positional embedding matrix (PRN×d\mathbf{P} \in \mathbb{R}^{N \times d}) (Gao, 2023). The output N×dN \times d token matrix is input to a stack of LL transformer encoder blocks, each containing multi-head self-attention, residual connections, layer normalization, and a feed-forward MLP. In contrast with ViT, CCT dispenses with the [CLS] class token and instead employs sequence pooling, aggregating attended token features into a single vector via softmax-weighted sum or global average pooling (SeqPool) (Hassani et al., 2021).

Typical configurations (CCT-7/3×1, CCT-14/7×2) vary depth (L=2L=2–$14$), number of heads (h=2h=2–$6$), embed dimension (d=128d=128–$384$), and intermediate MLP width (dff=128d_{ff}=128–$1536$), with total parameters spanning 0.28M (very small) up to 22M (ImageNet scale) (Hassani et al., 2021, Gao, 2023, Yuhua, 2024).

2. Mathematical Foundations

CCT employs standard transformer mechanisms, adapted for compactness and local bias:

  • Tokenization: For each conv block, output token xnx_n is

xn=c=1Cu=1Khv=1KwWd,c,u,vXi+u1,j+v1,c+bdx_n = \sum_{c=1}^{C}\sum_{u=1}^{K_h}\sum_{v=1}^{K_w} W_{d,c,u,v} X_{i+u-1,j+v-1,c} + b_d

for kernel WW and bias bb (Morelle et al., 10 Jan 2025).

  • Multi-Head Self-Attention (MHSA): For block ll,

Qi=ZWiQ,Ki=ZWiK,Vi=ZWiV headi=softmax(QiKiTdk)Vi MHSA(Z)=[head1,...,headh]WOQ_i = Z W^Q_i, \quad K_i = Z W^K_i, \quad V_i = Z W^V_i \ \mathrm{head}_i = \mathrm{softmax}\Bigl(\frac{Q_i K_i^T}{\sqrt{d_k}}\Bigr) V_i \ \mathrm{MHSA}(Z) = [\mathrm{head}_1, ..., \mathrm{head}_h] W^O

with dk=d/hd_k = d/h, residual connection, layer normalization, and MLP (Gao, 2023, Hassani et al., 2021).

  • Pooling and Classification: The post-transformer token sequence is pooled:

z=1Nn=1NXn(L)z = \frac{1}{N}\sum_{n=1}^N X^{(L)}_n

and mapped to output logits by a final dense layer.

3. Data Efficiency and Regularization

A central principle of CCT is the introduction of convolutional inductive bias, which enables efficient training on small datasets and strong generalization. Compared to ViT’s per-patch linear projection, convolutional tokenizers drastically reduce parameter count and overfitting risk. For example, the BloodMNIST CCT achieves 92.49% accuracy and micro-AUC 0.9935 on only ~17,000 28×2828\times28 images, with validation accuracy exceeding 80% after five epochs and stabilizing by epoch 10 (Gao, 2023).

Parameter efficiency is further enhanced via:

  • Stochastic Depth: Layer-dropping probability up to 0.1 to regularize deep architectures.
  • Weight Decay & Label Smoothing: Weight decay λ1.2×104\lambda \sim 1.2 \times 10^{-4}, label smoothing ϵls=0.1\epsilon_{\mathrm{ls}}=0.1 to mitigate overfitting.
  • Robust Pooling: Sequence pooling and omission of class token prevent token collapse and facilitate efficient information aggregation.

Ablation studies show that CCT with convolutional tokenizer outperforms direct patch embedding ViTs, particularly in small to moderate sample regimes; replacing convolution with linear projection doubles parameter count and requires substantially more data for equivalent accuracy (Hassani et al., 2021, Gao, 2023).

4. Application Domains

CCT demonstrates superiority across varied data modalities and tasks:

  • Medical Imaging Classification: On BloodMNIST, CCT achieves high multiclass accuracy (92.49%, average per-class F1-score 0.92, AUC 0.9935) without pretraining (Gao, 2023).
  • Weakly Supervised Segmentation: CCT augments multiple instance learning (MIL) with attention-based global information exchange. When paired with SAM2 and Layer-wise Relevance Propagation (LRP), CCT yields 2–3× improvement in Dice scores (~0.33 vs MIL’s ~0.13) in OCT B-scan hyper-reflective foci localization, while ensuring spatial precision (Morelle et al., 10 Jan 2025).
  • Spectrogram Analysis: For synthesized speech detection (ASVspoof2019), CCT attains 92.13% accuracy and ROC AUC 0.9646, outperforming CNN and classical methods (Bartusiak et al., 2022).
  • Clinical Time-Series Prediction: Via transfer learning from ImageNet-pretrained CCT, mortality prediction from single-modality medical records attains AUROC 0.9415 with domain-specific augmentations and CamCenterLoss (Yuhua, 2024).
Class Precision Recall F1‑Score Support
0 0.90 0.86 0.88 656
1 0.99 0.98 0.99 689
2 0.95 0.87 0.91 246
... ... ... ... ...
Macro Avg 0.93 0.89 0.92 3,421

5. Comparative Analysis and Ablations

Extensive comparative and ablation studies validate CCT’s efficacy:

  • On CIFAR-10/100: CCT-7/3×1 (3.76M parameters) reaches 98.00% accuracy, outperforming CNNs (ResNet56 at 94.63%) and substantially trimming parameter burden compared to ViT-12/16 (85M params, 83.04%) (Hassani et al., 2021).
  • Flowers-102: 99.76% accuracy with ImageNet-pretrained CCT-14/7×2, matching best reported results (Hassani et al., 2021).
  • ImageNet-1k: CCT-14/7×2 (22.4M params) attains 80.67% top-1, competitive with ResNet50 and ViT-S (Hassani et al., 2021).

Sequence pooling replaces the [CLS] token and provides 2–4% accuracy boost over vanilla ViT in small-data regimes. Removal of positional embedding reduced test accuracy by only 0.2% for CCT (vs 7% for ViT), confirming convolutional features supply strong position bias (Hassani et al., 2021). Freezing the pretrained tokenizer during clinical time-series transfer increases AUROC by ~0.04 over fine-tuning (Yuhua, 2024).

6. Extensions, Limitations, and Future Directions

CCT is extensible to multiple domains (image, spectrogram, time-series) and robust to low sample sizes and data imbalance. Current limitations include generalization to diverse codecs/languages (in synthesized speech), and specificity of results to benchmarked datasets. Recommended directions:

A plausible implication is that the convolutional tokenizer’s spatial weight sharing and local bias provisioning inherently mitigate the sample complexity that renders pure transformers impractical for small datasets. This suggests broad applicability of CCT in biomedical image analysis, rare-event detection, and resource-limited settings.

7. References and Key Studies

  • Hassani et al.: “Escaping the Big Data Paradigm with Compact Transformers” (Hassani et al., 2021), foundational architecture, ablation, and comparative analysis.
  • Sun et al.: “More for Less: Compact Convolutional Transformers Enable Robust Medical Image Classification with Limited Data” (Gao, 2023), parameter efficiency and medical imaging application.
  • Ali et al.: “Weakly Supervised Segmentation of Hyper-Reflective Foci with Compact Convolutional Transformers and SAM2” (Morelle et al., 10 Jan 2025), weakly supervised segmentation, relevance propagation, and global self-attention.
  • Li et al.: “Synthesized Speech Detection Using Convolutional Transformer-Based Spectrogram Analysis” (Bartusiak et al., 2022), convolutional inductive bias in audio.
  • “Multi-modal Deep Learning” (Yuhua, 2024), transfer learning from pre-trained CCT to clinical time-series, augmentation, and CC-loss development.

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Compact Transformers (CCT).