Compact Convolutional Transformers (CCT)

Updated 4 January 2026

Compact Convolutional Transformers (CCT) are hybrid models that combine convolutional tokenization with transformer self-attention to efficiently capture local and global features.
They leverage convolutional layers to introduce spatial inductive bias, reducing parameter count and overfitting compared to traditional Vision Transformers.
Empirical studies demonstrate that CCT achieves state-of-the-art accuracy across modalities such as medical imaging, time-series, and spectrogram analysis with rapid convergence.

A Compact Convolutional Transformer (CCT) is a hybrid neural architecture designed to address the data inefficiency and parameter scaling limitations inherent in conventional Vision Transformers (ViTs) and pure-attention models. CCTs integrate a lightweight convolutional tokenizer for local feature extraction, followed by a stack of transformer encoder blocks with multi-head self-attention. Unlike canonical ViT designs, CCT forgoes patch-wise linear embedding in favor of convolutions, which inject spatial inductive bias, weight sharing, and translation equivariance. This design results in markedly improved performance, parameter efficiency, and rapid convergence when training on small or modestly-sized datasets across diverse domains, especially when labeled samples are limited. Empirical results demonstrate that CCT architectures can robustly match or surpass state-of-the-art CNNs and transformers on image, time-series, and spectrogram modalities, even without large-scale pretraining (Gao, 2023, Hassani et al., 2021, Bartusiak et al., 2022, Morelle et al., 10 Jan 2025, Yuhua, 2024).

1. Architectural Principles and Design

The canonical CCT architecture comprises two main stages: convolutional tokenization and transformer encoding. The convolutional tokenizer utilizes one or more Conv2D layers (typical kernel sizes: 3×3 or 7×7; strides: 1 or 2), each followed by ReLU activation and max-pooling. Let $\mathbf{X} \in \mathbb{R}^{C \times H \times W}$ denote the input batch. The convolutional tokenizer produces a sequence of $N$ tokens of embedding dimension $d$ :

$\mathbf{X}_p = \mathrm{Conv2D}(\mathbf{X};\, k,s) + \mathbf{P}$

where $k$ is the convolutional kernel size, $s$ is stride, and $\mathbf{P}$ is a learnable positional embedding matrix ( $\mathbf{P} \in \mathbb{R}^{N \times d}$ ) (Gao, 2023). The output $N \times d$ token matrix is input to a stack of $L$ transformer encoder blocks, each containing multi-head self-attention, residual connections, layer normalization, and a feed-forward MLP. In contrast with ViT, CCT dispenses with the [CLS] class token and instead employs sequence pooling, aggregating attended token features into a single vector via softmax-weighted sum or global average pooling (SeqPool) (Hassani et al., 2021).

Typical configurations (CCT-7/3×1, CCT-14/7×2) vary depth ( $L=2$ –$14$), number of heads ( $h=2$ –$6$), embed dimension ( $d=128$ –$384$), and intermediate MLP width ( $d_{ff}=128$ –$1536$), with total parameters spanning 0.28M (very small) up to 22M (ImageNet scale) (Hassani et al., 2021, Gao, 2023, Yuhua, 2024).

2. Mathematical Foundations

CCT employs standard transformer mechanisms, adapted for compactness and local bias:

Tokenization: For each conv block, output token $x_n$ is

$x_n = \sum_{c=1}^{C}\sum_{u=1}^{K_h}\sum_{v=1}^{K_w} W_{d,c,u,v} X_{i+u-1,j+v-1,c} + b_d$

for kernel $W$ and bias $b$ (Morelle et al., 10 Jan 2025).

Multi-Head Self-Attention (MHSA): For block $l$ ,

$Q_i = Z W^Q_i, \quad K_i = Z W^K_i, \quad V_i = Z W^V_i \ \mathrm{head}_i = \mathrm{softmax}\Bigl(\frac{Q_i K_i^T}{\sqrt{d_k}}\Bigr) V_i \ \mathrm{MHSA}(Z) = [\mathrm{head}_1, ..., \mathrm{head}_h] W^O$

with $d_k = d/h$ , residual connection, layer normalization, and MLP (Gao, 2023, Hassani et al., 2021).

Pooling and Classification: The post-transformer token sequence is pooled:

$z = \frac{1}{N}\sum_{n=1}^N X^{(L)}_n$

and mapped to output logits by a final dense layer.

Positional Encoding: Learned, absolute positional embeddings $P \sim \mathcal{N}(0, 0.02)$ are added after tokenization (Morelle et al., 10 Jan 2025, Hassani et al., 2021). CCT variants are robust to exclusion of positional encoding, typical test drops ≤0.2% accuracy (Hassani et al., 2021).

3. Data Efficiency and Regularization

A central principle of CCT is the introduction of convolutional inductive bias, which enables efficient training on small datasets and strong generalization. Compared to ViT’s per-patch linear projection, convolutional tokenizers drastically reduce parameter count and overfitting risk. For example, the BloodMNIST CCT achieves 92.49% accuracy and micro-AUC 0.9935 on only ~17,000 $28\times28$ images, with validation accuracy exceeding 80% after five epochs and stabilizing by epoch 10 (Gao, 2023).

Parameter efficiency is further enhanced via:

Stochastic Depth: Layer-dropping probability up to 0.1 to regularize deep architectures.
Weight Decay & Label Smoothing: Weight decay $\lambda \sim 1.2 \times 10^{-4}$ , label smoothing $\epsilon_{\mathrm{ls}}=0.1$ to mitigate overfitting.
Robust Pooling: Sequence pooling and omission of class token prevent token collapse and facilitate efficient information aggregation.

Ablation studies show that CCT with convolutional tokenizer outperforms direct patch embedding ViTs, particularly in small to moderate sample regimes; replacing convolution with linear projection doubles parameter count and requires substantially more data for equivalent accuracy (Hassani et al., 2021, Gao, 2023).

4. Application Domains

CCT demonstrates superiority across varied data modalities and tasks:

Medical Imaging Classification: On BloodMNIST, CCT achieves high multiclass accuracy (92.49%, average per-class F1-score 0.92, AUC 0.9935) without pretraining (Gao, 2023).
Weakly Supervised Segmentation: CCT augments multiple instance learning (MIL) with attention-based global information exchange. When paired with SAM2 and Layer-wise Relevance Propagation (LRP), CCT yields 2–3× improvement in Dice scores (~0.33 vs MIL’s ~0.13) in OCT B-scan hyper-reflective foci localization, while ensuring spatial precision (Morelle et al., 10 Jan 2025).
Spectrogram Analysis: For synthesized speech detection (ASVspoof2019), CCT attains 92.13% accuracy and ROC AUC 0.9646, outperforming CNN and classical methods (Bartusiak et al., 2022).
Clinical Time-Series Prediction: Via transfer learning from ImageNet-pretrained CCT, mortality prediction from single-modality medical records attains AUROC 0.9415 with domain-specific augmentations and CamCenterLoss (Yuhua, 2024).

Class	Precision	Recall	F1‑Score	Support
0	0.90	0.86	0.88	656
1	0.99	0.98	0.99	689
2	0.95	0.87	0.91	246
...	...	...	...	...
Macro Avg	0.93	0.89	0.92	3,421

5. Comparative Analysis and Ablations

Extensive comparative and ablation studies validate CCT’s efficacy:

On CIFAR-10/100: CCT-7/3×1 (3.76M parameters) reaches 98.00% accuracy, outperforming CNNs (ResNet56 at 94.63%) and substantially trimming parameter burden compared to ViT-12/16 (85M params, 83.04%) (Hassani et al., 2021).
Flowers-102: 99.76% accuracy with ImageNet-pretrained CCT-14/7×2, matching best reported results (Hassani et al., 2021).
ImageNet-1k: CCT-14/7×2 (22.4M params) attains 80.67% top-1, competitive with ResNet50 and ViT-S (Hassani et al., 2021).

Sequence pooling replaces the [CLS] token and provides 2–4% accuracy boost over vanilla ViT in small-data regimes. Removal of positional embedding reduced test accuracy by only 0.2% for CCT (vs 7% for ViT), confirming convolutional features supply strong position bias (Hassani et al., 2021). Freezing the pretrained tokenizer during clinical time-series transfer increases AUROC by ~0.04 over fine-tuning (Yuhua, 2024).

6. Extensions, Limitations, and Future Directions

CCT is extensible to multiple domains (image, spectrogram, time-series) and robust to low sample sizes and data imbalance. Current limitations include generalization to diverse codecs/languages (in synthesized speech), and specificity of results to benchmarked datasets. Recommended directions:

Further domain adaptation: Including multi-modal fusion of CCT with other modalities and architectures (Yuhua, 2024, Bartusiak et al., 2022).
Architectural ensemble: Combining CCT with CNNs or metadata features for improved robustness in speech and video (Bartusiak et al., 2022).
Augmented training: Utilizing advanced data augmentation such as PatchUp, CutMix, and mixup variants, as well as novel loss functions (CamCenterLoss, PatchUp Loss) for structured regularization (Yuhua, 2024).
Weakly supervised segmentation: Layer-wise relevance propagation and attention pooling can be extended to drive high-resolution masks via external segmenters, as shown with SAM2 (Morelle et al., 10 Jan 2025).

A plausible implication is that the convolutional tokenizer’s spatial weight sharing and local bias provisioning inherently mitigate the sample complexity that renders pure transformers impractical for small datasets. This suggests broad applicability of CCT in biomedical image analysis, rare-event detection, and resource-limited settings.

7. References and Key Studies

Hassani et al.: “Escaping the Big Data Paradigm with Compact Transformers” (Hassani et al., 2021), foundational architecture, ablation, and comparative analysis.
Sun et al.: “More for Less: Compact Convolutional Transformers Enable Robust Medical Image Classification with Limited Data” (Gao, 2023), parameter efficiency and medical imaging application.
Ali et al.: “Weakly Supervised Segmentation of Hyper-Reflective Foci with Compact Convolutional Transformers and SAM2” (Morelle et al., 10 Jan 2025), weakly supervised segmentation, relevance propagation, and global self-attention.
Li et al.: “Synthesized Speech Detection Using Convolutional Transformer-Based Spectrogram Analysis” (Bartusiak et al., 2022), convolutional inductive bias in audio.
“Multi-modal Deep Learning” (Yuhua, 2024), transfer learning from pre-trained CCT to clinical time-series, augmentation, and CC-loss development.

Markdown Upgrade to Chat

References (5)

More for Less: Compact Convolutional Transformers Enable Robust Medical Image Classification with Limited Data (2023)

Escaping the Big Data Paradigm with Compact Transformers (2021)

Synthesized Speech Detection Using Convolutional Transformer-Based Spectrogram Analysis (2022)

Weakly Supervised Segmentation of Hyper-Reflective Foci with Compact Convolutional Transformers and SAM2 (2025)

Multi-modal Deep Learning (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Compact Transformers (CCT).

Compact Convolutional Transformers (CCT)

1. Architectural Principles and Design

2. Mathematical Foundations

3. Data Efficiency and Regularization

4. Application Domains

Example Metrics from Medical Imaging (BloodMNIST, (Gao, 2023)):

5. Comparative Analysis and Ablations

6. Extensions, Limitations, and Future Directions

7. References and Key Studies

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Compact Convolutional Transformers (CCT)

1. Architectural Principles and Design

2. Mathematical Foundations

3. Data Efficiency and Regularization

4. Application Domains

Example Metrics from Medical Imaging (BloodMNIST, (Gao, 2023)):

5. Comparative Analysis and Ablations

6. Extensions, Limitations, and Future Directions

7. References and Key Studies

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics