Papers
Topics
Authors
Recent
2000 character limit reached

Native Vision Transformer (ViT) Overview

Updated 26 November 2025
  • Native Vision Transformer (ViT) is a pure transformer-based model that tokenizes images into fixed-size patches to perform image recognition without convolutional inductive bias.
  • It processes patch embeddings with a standard transformer encoder architecture featuring multi-head self-attention, Pre-LayerNorm, and residual connections to capture both global and local features.
  • The model achieves state-of-the-art results on benchmarks by leveraging large-scale pre-training and scaling depth, width, and patch size for optimal accuracy and computational efficiency.

The Native Vision Transformer (ViT) is a pure Transformer encoder model, originally devised for natural language processing, adapted for image recognition tasks without convolutional architectures. By tokenizing images into a sequence of fixed-size non-overlapping patches and processing them directly with the standard Transformer encoder architecture, ViT demonstrates that explicit convolutional inductive bias is unnecessary for high-accuracy image classification, provided large-scale pre-training. When pre-trained on extensive datasets such as ImageNet-21k or JFT-300M and transferred to benchmarks including ImageNet, CIFAR-100, and VTAB, ViT attains state-of-the-art results with reduced computational training costs relative to high-capacity convolutional neural networks (CNNs) (Dosovitskiy et al., 2020).

1. Patch Embedding and Input Representation

ViT operates by partitioning an input image xRH×W×Cx\in\mathbb{R}^{H\times W\times C} into a regular grid of non-overlapping patches of size P×PP \times P, yielding

N=H×WP2N = \frac{H\times W}{P^2}

distinct patches. Each patch, patchi(x)RP2C\mathrm{patch}_i(x)\in\mathbb{R}^{P^2C}, is linearly projected via ER(P2C)×DE \in \mathbb{R}^{(P^2C)\times D} into a DD-dimensional embedding:

xpi=Epatchi(x)RD.x_p^i = E \cdot \mathrm{patch}_i(x) \in \mathbb{R}^D.

A learnable classification token [CLS]RD[{\tt CLS}]\in\mathbb{R}^D is prepended, and learnable positional encodings EposR(N+1)×DE_{\text{pos}} \in \mathbb{R}^{(N+1)\times D} are added, forming the input sequence:

z0=[[CLS];xp1;;xpN]+Epos.z_0 = \left[\,[{\tt CLS}];\,x_p^1;\,\dots;\,x_p^N\,\right] + E_\mathrm{pos}.

2. Transformer Encoder Architecture

The Transformer encoder processes the (N+1)(N+1)-length sequence through LL identical layers, each comprising:

  • Pre-LayerNorm
  • Multi-Head Self-Attention (MSA) with HH heads:

MSA(X)=[head1;;headH]WO\mathrm{MSA}(X) = [\,\mathrm{head}_1;\dots;\mathrm{head}_H\,] W^O

where

headi=softmax(QiKiTdk)Vi,\mathrm{head}_i = \mathrm{softmax}\left(\frac{Q_i K_i^T}{\sqrt{d_k}}\right) V_i,

and Qi=XWiQQ_i = X W^Q_i, Ki=XWiKK_i = X W^K_i, Vi=XWiVV_i = X W^V_i, with dk=D/Hd_k = D/H.

  • Residual connection
  • Pre-LayerNorm
  • Two-layer MLP:

MLP(X)=W2(GELU(W1X))+b2,W1RrD×D,W2RD×rD\mathrm{MLP}(X) = W_2\,\bigl(\mathrm{GELU}(W_1 X)\bigr) + b_2,\quad W_1 \in \mathbb{R}^{rD \times D},\, W_2 \in \mathbb{R}^{D \times rD}

  • Residual connection

The stack updates as

z=z1+MSA(LN(z1)) z=z+MLP(LN(z))\begin{aligned} z'_\ell & = z_{\ell-1} + \mathrm{MSA}(\mathrm{LN}(z_{\ell-1})) \ z_\ell & = z'_\ell + \mathrm{MLP}(\mathrm{LN}(z'_\ell)) \end{aligned}

After LL layers, the output associated with the [CLS][{\tt CLS}] token is used for image-level classification, following a final LayerNorm and linear classification head.

3. Model Variants and Scaling Dimensions

ViT explores the scaling of depth, width, and patch size. The canonical configurations, using the “ViT-X/PP” notation (X{B,L,H}X\in\{\text{B,L,H}\} for Base, Large, Huge; PP is patch size), are summarized below.

Model Layers (LL) Dim (DD) Heads (HH) Params Patch Size (PP)
ViT-B/16 12 768 12 86M 16
ViT-L/16 24 1024 16 307M 16
ViT-H/14 32 1280 16 632M 14

Smaller patches (lower PP) lead to longer token sequences (NN) and higher spatial resolution, increasing computational cost. Depth (number of layers) is the most effective axis for scaling model capacity, followed by increasing DD. Reducing PP also yields accuracy gains without more parameters.

4. Training and Transfer Protocols

ViT models are pre-trained on large datasets: ImageNet-21k (14M images, 21k classes) and JFT-300M (303M images, 18k classes, de-duplicated vs. downstream sets). The optimization strategy employs Adam (β1=0.9,β2=0.999\beta_1 = 0.9, \beta_2 = 0.999), weight decay of ≈0.1, and large batch size (4096). Training uses a warm-up period, followed by linear or cosine learning rate decay, and dropout ≈0.1 on MLPs (optionally none for the largest models).

Fine-tuning is performed by discarding the pre-trained classification head and adding a zero-initialized linear layer (D×KD \times K, for KK classes). Typically, fine-tuning operates at higher input resolution (384–512 px), 2D-interpolating the positional embeddings. Fine-tuning uses SGD with momentum 0.9, typical batch sizes of 512, cosine learning rate decay, and no weight decay. Standard hyperparameter grids are applied for learning rate search.

ImageNet fine-tuning occurs over 20,000 steps; smaller target sets use 500–10,000 steps. When pre-trained only on ImageNet (1.3M images), ViT-Large models underperform relative to CNNs. Pre-training on larger datasets (IN-21k, JFT-300M) is critical for strong performance (Dosovitskiy et al., 2020).

5. Performance and Computational Characteristics

Empirical evaluations demonstrate:

  • ViT-H/14 achieves 88.55% top-1 on ImageNet, 90.72% on Imagenet-Real, 94.55% on CIFAR-100, and 77.63% on VTAB-19 after JFT-300M pre-training.
  • ViT-L/16 achieves 87.76% on ImageNet, surpassing ResNet152×4 (BiT-L, 87.54%) while utilizing only 0.68k TPUv3-core-days (vs. 9.9k for BiT-L).
  • ViT-L/16 pre-trained only on ImageNet-21k achieves 85.30% top-1 in 0.23k core-days.

ViT models require 2–4× fewer pre-training FLOPs than comparably accurate ResNets (R50–R200×3). Hybrid ViT+CNN architectures offer advantages only at smaller compute budgets. Inference speed on TPUv3 is comparable to ResNets, with higher memory efficiency, permitting larger per-core batch sizes (Dosovitskiy et al., 2020).

6. Key Insights: Data, Inductive Bias, and Attention

The data scale used for pre-training is critical. On large datasets (14M–300M images), ViT models rapidly surpass CNNs, indicating that scale outweighs explicit convolutional inductive bias. Overfitting arises on small datasets, confirmed by few-shot linear probe experiments.

Model performance correlates more with total pre-training compute than with parameter count. Depth scaling (up to 32 layers, with diminishing returns beyond 64) is especially effective. Visualizations of self-attention show that some heads operate globally from early layers, while others are local; the receptive field size grows systematically with depth. Attention rollout analyses reveal that the [CLS][{\tt CLS}]-to-patch attention vector localizes on semantically salient regions.

Positional embeddings are critical: simple learnable 1D embeddings suffice, with no observed benefit for 2D or relative positional embeddings. All positional encoding variants outperform models without positional bias by a wide margin.

Hybrid models, combining a shallow ResNet stem with ViT, are beneficial primarily at low compute but offer no significant advantage when scaling to large data and model sizes.

Preliminary masked-patch prediction (BERT-style) self-supervision achieves 79.9% top-1 for ViT-B/16 on ImageNet (vs. 75.9% from scratch), but remains ∼4% below supervised pre-training (Dosovitskiy et al., 2020).

7. Significance and Future Directions

ViT demonstrates that a pure Transformer architecture, with minimal vision-specific inductive bias, can achieve and surpass state-of-the-art accuracy in image classification, provided access to large-scale training data and compute. This finding challenges the necessity of convolutional priors in computer vision, and enables Transformer-only architectures for tasks beyond classification, such as detection, segmentation, and self-supervision. A plausible implication is that Transformer-based architectures, with sufficient scale, may continue to subsume traditionally convolutional approaches as resources and datasets grow (Dosovitskiy et al., 2020).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Native Vision Transformer (ViT).