Vision Transformers (ViT) Overview

Updated 13 October 2025

Vision Transformers (ViT) are neural network architectures that reinterpret images as sequences of patch embeddings combined with positional encodings.
They use multiheaded self-attention within Transformer encoder blocks to capture global spatial relationships, offering scalability on large datasets.
ViT models enable fine-tuning via positional encoding interpolation, though their performance often relies on extensive pre-training data compared to CNNs.

Vision Transformers (ViT) are a class of neural network architectures that adapt the standard Transformer model—originally introduced for natural language processing—to visual data, notably by reinterpreting images as sequences of patch embeddings. Unlike conventional convolutional neural networks (CNNs), which utilize local connectivity and translation equivariance through convolutional kernels, ViTs process images as flat sequences of tokens corresponding to non-overlapping image patches. This approach replaces hand-crafted spatial inductive bias with data-driven learning of spatial and contextual relationships via self-attention, enabling superior scalability and flexibility for large-scale visual tasks.

1. Image Patch Embedding and Input Representation

ViT models begin by partitioning an input image $x \in \mathbb{R}^{H \times W \times C}$ (height $H$ , width $W$ , channels $C$ ) into $N$ non-overlapping patches of size $P \times P$ , where $N = HW / P^2$ . Each flattened patch is linearly projected into a $D$ -dimensional embedding via a matrix $E \in \mathbb{R}^{(P^2 C)\times D}$ . The resulting sequence of patch embeddings is concatenated with a learnable classification token ( $x_{\text{class}}$ ) and combined with a positional encoding $E_{\text{pos}} \in \mathbb{R}^{(N+1) \times D}$ to preserve spatial structure:

$z_0 = [x_{\text{class}};\ x_p^1 E;\ x_p^2 E;\ \ldots;\ x_p^N E] + E_{\text{pos}}$

The patch embedding stage is crucial as it adapts the Transformer to visual input and establishes the sequence of tokens for subsequent processing.

2. Transformer Encoder Architecture

The sequence of embeddings is input to a stack of $L$ identical Transformer encoder blocks, each consisting of multiheaded self-attention (MSA) and multilayer perceptron (MLP) sub-blocks, both preceded by layer normalization (LN) and coupled via residual connections. The computations in layer $\ell$ are: $\begin{align*} z'_\ell &= \text{MSA}(\text{LN}(z_{\ell-1})) + z_{\ell-1} \ z_\ell &= \text{MLP}(\text{LN}(z'_\ell)) + z'_\ell \end{align*}$ Classification is performed using the representation corresponding to the class token after $L$ layers: $y = \text{LN}(z_L^0)$ This architecture enables global content-dependent modeling, eschewing spatial locality constraints.

3. Multiheaded Self-Attention Mechanism

The fundamental computational primitive of ViT is multiheaded self-attention (MSA). For a sequence $z \in \mathbb{R}^{N \times D}$ ,

$[q, k, v] = z \cdot U_{qkv},\ \quad U_{qkv} \in \mathbb{R}^{D \times 3D_h}$

For $k$ heads ( $D_h = D / k$ ), each attention matrix is computed as: $A = \operatorname{softmax}\left(\frac{q\ k^\mathrm{T}}{\sqrt{D_h}}\right),\qquad \text{SA}(z) = A v$ Final MSA output is a concatenation of individual head outputs, projected via $U_{\text{msa}} \in \mathbb{R}^{k D_h \times D}$ . This mechanism provides long-range, dynamic receptive fields—any patch can attend to any other, facilitating flexible spatial reasoning without explicit local bias.

4. Training, Fine-Tuning, and Resolution Adjustment

ViTs are typically pre-trained with supervised objectives on massive datasets (e.g., ImageNet-21k, JFT-300M) and fine-tuned on specific downstream tasks. Adaptation to different image resolutions during fine-tuning is performed by interpolating the learned positional embeddings $E_{\text{pos}}$ to match new grid sizes, since increased input resolution increases patch sequence length. The prediction head (e.g., a two-layer MLP) used during pre-training is replaced by a classifier adapted to the number of classes in the downstream dataset. The only explicit spatial inductive bias incorporated is the use of positional encodings and their 2D interpolation during transfer.

5. Empirical Results and Comparative Analysis

ViT demonstrates state-of-the-art or competitive performance when trained or pre-trained on sufficiently large data: | Benchmark | Model/Variant | Top-1 Accuracy (%) | |---------------|--------------------|----------------------| | ImageNet | ViT-H/14 | ~88.55 | | CIFAR-100 | Not specified | ~94.55 |

ViT models outperform comparably sized CNNs (such as ResNet) on large-scale benchmarks, particularly as dataset size increases, and are more computationally efficient in pre-training (e.g., 2.5k TPUv3-core-days vs. 9.9k for large ResNets). When training data is limited, CNNs' strong local inductive biases aid performance, but ViTs surpass them as the scale grows. ViTs also exhibit high transferability and generalization in mid-sized and small benchmarks, aided by pre-training.

6. Inductive Bias, Scalability, and Design Impact

ViT's methodology forgoes architectural inductive biases typical of CNNs (locality, translation equivariance), relying on data and model capacity to learn all spatial relationships. The only exceptions are the positional encodings and, optionally, 2D interpolation for fine-tuning. With access to large pre-training datasets, ViT models show that locality and other hand-crafted priors are not strictly necessary for high computer vision performance. Scalability—a property inherited from the original Transformer—enables straightforward model expansion and data parallelism, with efficiency gains in terms of resource use (e.g., core-days).

7. Practical Implementation and Limitations

In practice, deploying ViT involves:

Preprocessing input images to the required patch size grid, linear patch embedding, and addition of class and positional tokens.
Implementing the Transformer encoder stack as outlined by $(1)-(4)$ above, ensuring constant latent dimensionality across layers.
Carefully designing positional encoding interpolation for fine-tuning at higher image resolutions.
Adopting large-scale supervised pre-training, or leveraging existing pre-trained ViT models for efficient transfer.

The principal caveat is that performance is highly dependent on pre-training data volume—without massive datasets, CNNs can still dominate on tasks with limited data. Another consideration is that while ViTs are computationally efficient in pre-training relative to comparable CNNs, the attention mechanism incurs quadratic complexity with respect to sequence length; model and patch size selection must balance sequence length, resource constraints, and desired resolution.

ViT marks a paradigm shift, demonstrating that direct application of the transformer architecture—without convolution or hierarchical design—can match or exceed convolutional approaches in vision when provided sufficient data and capacity. This has opened research avenues toward minimal-bias, highly scalable architectures for a variety of computer vision problems.

PDF Markdown Chat (Pro)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Vision Transformers (ViT).