Papers
Topics
Authors
Recent
2000 character limit reached

Compact Convolutional Transformer (CCT)

Updated 4 January 2026
  • Compact Convolutional Transformer (CCT) is a hybrid vision transformer that replaces fixed patch extraction with a compact convolutional tokenizer, combining inductive biases with transformer self-attention.
  • It employs a convolutional front-end followed by transformer encoder blocks, leading to faster convergence, enhanced generalization, and parameter efficiency in small- and medium-scale data regimes.
  • CCT has shown superior performance in image classification, spectrogram analysis, and clinical time-series tasks, outperforming baseline CNN and ViT models with fewer parameters and improved accuracy.

The Compact Convolutional Transformer (CCT) is a hybrid vision transformer architecture designed for robust performance and efficiency, particularly in small- to medium-scale data regimes. It replaces the standard patch extraction and linear projection of vanilla Vision Transformers (ViTs) with a compact convolutional tokenizer, thereby combining convolutional inductive biases with transformer self-attention. This design enables strong generalization and parameter efficiency, positioning CCT as a data- and resource-efficient alternative for image, time-series, and spectrogram analysis across modalities, including resource-limited clinical and biomedical scenarios (Hassani et al., 2021, Gao, 2023, Morelle et al., 10 Jan 2025, Bartusiak et al., 2022, Yuhua, 2024).

1. Core Architectural Principles

CCT begins by replacing the ViT’s fixed patch extraction with a small stack of 2D convolutional layers, termed the convolutional tokenizer. This operation maps an input (e.g., X∈RH×W×CX \in \mathbb{R}^{H \times W \times C}) to a sequence of tokens T∈RN×dT \in \mathbb{R}^{N \times d}, where NN is the number of tokens (determined by the spatial extent after convolution and pooling), and dd is the embedding dimension. Each convolutional tokenizer layer applies:

  • Conv2D(k×k,s)\text{Conv2D}(k \times k, s), where kk is kernel size and ss is stride,
  • Followed by a non-linearity (ReLU),
  • MaxPooling, to further reduce spatial dimensions and increase invariance.

After convolutional tokenization, a learnable positional embedding P∈RN×d\mathbf{P} \in \mathbb{R}^{N \times d} is added to the sequence, preserving positional information even when input resolutions vary.

The token sequence then passes through a stack of transformer encoder blocks. Each block consists of (1) LayerNorm, (2) Multi-Head Self-Attention (MHSA), (3) a residual connection, (4) another LayerNorm, (5) an MLP (typically using GELU activations and two linear layers), and (6) a second residual connection. The number of blocks (LL), embedding size (dd), head count (hh), and MLP dimension (dffd_{\mathrm{ff}}) are set based on the data scale and variant (e.g., CCT-7/3×1, CCT-14/7×2) (Hassani et al., 2021, Gao, 2023).

For sequence aggregation and classification, there is no [CLS] token. Instead, CCT typically applies sequence pooling (either attention pooling or global average pooling across token representations) before passing the pooled representation into an output head (e.g., classification, regression, or segmentation).

2. Mathematical Formulation and Module Functions

The convolutional tokenizer for an input XX (batch size BB):

Xp=Conv2D(X;k,s)+P\mathbf{X}_p = \mathrm{Conv2D}(X; k, s) + \mathbf{P}

where P\mathbf{P} is a learnable positional embedding. Post-tokenization, the tokenized batch becomes Xp∈RB×N×d\mathbf{X}_p \in \mathbb{R}^{B \times N \times d}.

Each transformer block, for layer ll, processes its input Z(l−1)Z^{(l-1)} as:

Y(l)=Z(l−1)+MHSA(LN(Z(l−1))), Z(l)=Y(l)+MLP(LN(Y(l)))\begin{aligned} &Y^{(l)} = Z^{(l-1)} + \mathrm{MHSA}(\mathrm{LN}(Z^{(l-1)})), \ &Z^{(l)} = Y^{(l)} + \mathrm{MLP}(\mathrm{LN}(Y^{(l)})) \end{aligned}

The MHSA splits the embedding into hh heads, each of dimension dk=d/hd_k = d/h, computes scaled dot-product attention, concatenates all heads, and applies an output projection:

Qi=ZWiQ,Ki=ZWiK,Vi=ZWiVQ_i = Z W^Q_i,\quad K_i = Z W^K_i,\quad V_i = Z W^V_i

headi=softmax(QiKi⊤dk)Vi\mathrm{head}_i = \mathrm{softmax}\left(\frac{Q_i K_i^\top}{\sqrt{d_k}}\right)V_i

MHSA(Z)=[head1,…,headh]WO\mathrm{MHSA}(Z) = [\mathrm{head}_1,\dots,\mathrm{head}_h] W^O

The feed-forward MLP applies: MLP(x)=W2GELU(W1x+b1)+b2\mathrm{MLP}(x) = W_2 \mathrm{GELU}(W_1 x + b_1) + b_2

After all encoder layers, sequence pooling is applied (e.g., SeqPool or attention pooling), and a final linear head produces logits or continuous outputs.

3. Parameter Efficiency and Inductive Bias

The convolutional tokenizer introduces translation-equivariance, weight sharing, and locality, reducing the number of learnable parameters compared to patch-based tokenization in ViT. Empirically, replacing convolutional tokenizers with linear patch projection can double parameter count with no advantage on small data (Hassani et al., 2021, Gao, 2023). The convolutional front-end enhances generalization in limited-sample regimes by enforcing spatial inductive bias, facilitating faster convergence and strong regularization.

Regularization techniques used in CCT include stochastic depth (layer-dropping), weight decay, and label smoothing. For example, a maximal stochastic depth of 0.1, weight decay of 1.2×10−41.2 \times 10^{-4}, and label smoothing ϵls=0.1\epsilon_{\mathrm{ls}} = 0.1 are employed to reduce overfitting (Gao, 2023).

Ablation studies show that:

  • Adding the convolutional tokenizer yields accuracy improvements of +2–3% and reduces reliance on positional embeddings (CCT variants lose only ~0.2% when dropping PE, ViT drops ~7%) (Hassani et al., 2021).
  • Sequence pooling outperforms fixed [CLS] tokens, with gains of +2–4% on CIFAR-10/100 (Hassani et al., 2021).

4. Empirical Performance and Application Domains

CCT demonstrates robust performance in diverse domains:

Image Classification: On BloodMNIST (MedMNIST v2, 28×2828 \times 28 images, 8 classes, N≈17N \approx 17K), CCT achieves 92.49% accuracy and micro-average ROC AUC of 0.9935, significantly exceeding baseline ViT performance under limited-data conditions (Gao, 2023).

Small-Scale Vision: On CIFAR-10, CCT-7/3×1 (3.76M parameters) achieves 98% accuracy, surpassing ResNet-56 (94.63%) and significantly outperforming ViT-12/16 (83.04%) with 20x fewer parameters than ViT (Hassani et al., 2021). On Flowers-102, CCT-14/7×2 achieves 99.76% top-1 accuracy with ImageNet pretraining.

Spectrogram Analysis and Speech Detection: For synthesized speech detection, CCT with two 3×3 convolutional layers and two transformer layers attains 92.13% accuracy (ROC AUC 0.9646), substantially outperforming both shallow and deep CNN baselines and classical machine learning models (Bartusiak et al., 2022).

Clinical Time-Series Representation: By employing an ImageNet-pretrained CCT-14/7×2 backbone with frozen convolutional tokenizer and fine-tuned transformer, CCT yields an AUROC of 0.9415 for clinical time-series mortality prediction on MIMIC-III, outperforming ResNet+StageNet, Tab Transformer, and FT-Transformer backbones (Yuhua, 2024).

Weakly Supervised Medical Segmentation: In OCT hyper-reflective foci segmentation, CCT enables full-resolution inference and yields Dice scores of 0.33 (vs. 0.11 for MIL), with CCT’s attention allowing global context and avoiding border artifacts (Morelle et al., 10 Jan 2025).

5. Comparative Analysis and Ablation

Table: CCT vs. Key Baselines on Small-Scale Vision Tasks

Model Params CIFAR-10 (%) Flowers-102 (%) ImageNet-1k (%)
ViT-12/16 85M 83.0 — —
ResNet-56 0.85M 94.6 — —
CCT-7/3×1 3.76M 98.0 — —
CCT-14/7×2 (pretrained) 22.4M — 99.76 82.71
ResNet50 25.6M — — 77.2

CCT consistently reaches or surpasses CNN and ViT architectures with fewer parameters across tasks.

6. Extensions: Transfer Learning, Sequence Pooling, and Specialized Losses

CCT adapts effectively to transfer learning in low-data regimes. For clinical time-series, convolutional tokenizer and transformer parameters are initialized from ImageNet pretraining and then partially frozen during fine-tuning, preserving learned visual inductive biases (Yuhua, 2024). Specialized augmentation (PatchUp Soft) and custom losses (CamCenterLoss) further enhance performance by improving intra-class feature compactness and calibrating attention to critical regions.

Sequence pooling (SeqPool) is the standard aggregation method in CCT. It applies a learned, softmax-weighted sum over tokens—empirically superior to fixed [CLS] tokens across a range of benchmarks (Hassani et al., 2021).

7. Impact, Limitations, and Prospects

CCT has enabled robust and scalable transformer performance across diverse domains on modest computational budgets and with limited data. Its efficacy for small medical images, full-resolution weakly supervised medical segmentation, spectrogram-based deepfake detection, and clinical time-series demonstrates broad applicability (Hassani et al., 2021, Gao, 2023, Bartusiak et al., 2022, Yuhua, 2024, Morelle et al., 10 Jan 2025). However, extension to more heterogeneous modalities, larger-scale datasets, and cross-modal integration require further empirical validation.

Reported limitations include the dependence on fixed input preprocessing (e.g., spectrogram size), dataset-specific architectural tuning, and the need for further generalization studies to new domains and out-of-distribution data (Bartusiak et al., 2022). No ablation studies have been published specifically comparing CCT’s convolutional tokenizer versus linear projection on biomedical or clinical data, although prior work suggests replacing the convolutional tokenizer with linear projection increases parameter count by ∼\sim2× without benefit unless trained on very large datasets (Hassani et al., 2021, Gao, 2023).

CCT’s hybrid design and empirical performance directly address the challenge of data scarcity in scientific and biomedical domains, providing a template for resource-efficient and generalization-focused transformer architectures.

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Compact Convolutional Transformer (CCT).

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube