Vision Transformer (ViT-B/14) Overview

Updated 30 June 2026

ViT-B/14 is a transformer-based vision architecture that processes images as sequential patch tokens and leverages multi-head self-attention for efficient image recognition.
It segments images into fixed-size patches, projecting them into an embedding space with learnable position encodings to capture spatial relationships.
Enforcing orthogonality in self-attention projections enhances stability, accelerates convergence, and boosts accuracy by preserving the feature geometry.

A Vision Transformer (ViT-B/14) is a pure transformer architecture for image recognition that processes images as sequences of fixed-size patches, where each patch is a token analogous to a word in natural language processing. ViT-B/14 specifically employs a 12-layer transformer encoder (base variant) with a hidden dimension of 768, using input patches of size 14×14 pixels. In its canonical form, ViT-B/14 achieves state-of-the-art results in large data regimes and can be further optimized with geometric constraints, such as orthogonality in projection matrices, as demonstrated by O-ViT-B/14 (Dosovitskiy et al., 2020, Fei et al., 2022).

1. Patch Embedding and Input Construction

ViT-B/14 segments a color image $x \in \mathbb{R}^{H \times W \times 3}$ into non-overlapping 14×14 patches. Each patch is flattened to a vector in $\mathbb{R}^{588}$ and projected into a 768-dimensional embedding space via a learnable matrix $E \in \mathbb{R}^{588 \times 768}$ . A learnable 1D position embedding $E_{\text{pos}} \in \mathbb{R}^{(N+1) \times 768}$ encodes spatial information, where $N = (H \cdot W)/(14^2)$ and an additional class token $x_{\text{cls}} \in \mathbb{R}^{768}$ is prepended to the patch sequence. The input sequence is

$z_0 = [x_{\text{cls}}; x_1 E; \ldots; x_N E] + E_{\text{pos}},$

where $z_0 \in \mathbb{R}^{(N+1) \times 768}$ (Dosovitskiy et al., 2020).

2. Transformer Encoder Architecture

ViT-B/14 uses $L=12$ identical transformer blocks, each comprising:

Multi-Head Self-Attention: $n_h=12$ attention heads, each operating on 64-dimensional subspaces ( $\mathbb{R}^{588}$ 0).
Feedforward MLP: Two linear layers with a GELU nonlinearity, expanding the hidden dimension by a factor of 4 ( $\mathbb{R}^{588}$ 1 dimensions).
Pre-LayerNorm and Residual Connections: LayerNorm is applied before each sublayer, with residual connections around multi-head self-attention (MHSA) and the MLP (Dosovitskiy et al., 2020, Fei et al., 2022).

The forward computations at block $\mathbb{R}^{588}$ 2 are: $\mathbb{R}^{588}$ 3 with the final output being the LayerNorm of the class token after $\mathbb{R}^{588}$ 4 blocks.

3. Self-Attention Mechanism

For each transformer block, the self-attention equations are: $\mathbb{R}^{588}$ 5 where $\mathbb{R}^{588}$ 6 and $\mathbb{R}^{588}$ 7. The outputs of all heads are concatenated and mapped to the original hidden size via $\mathbb{R}^{588}$ 8 (Dosovitskiy et al., 2020).

4. Effect of Patch Size and Computational Complexity

Adopting a patch size of 14×14 increases patch count $\mathbb{R}^{588}$ 9 for fixed image size (e.g., for $E \in \mathbb{R}^{588 \times 768}$ 0 images, $E \in \mathbb{R}^{588 \times 768}$ 1 rather than $E \in \mathbb{R}^{588 \times 768}$ 2 for 16×16 patches). This yields finer spatial granularity but increases computational complexity, as attention map size and computation scale as $E \in \mathbb{R}^{588 \times 768}$ 3; reducing patch size from 16 to 14 increases memory and compute cost by ∼70%. The patch-projection matrix $E \in \mathbb{R}^{588 \times 768}$ 4 adapts its input dimension accordingly: $E \in \mathbb{R}^{588 \times 768}$ 5 for $E \in \mathbb{R}^{588 \times 768}$ 6 (Dosovitskiy et al., 2020).

5. Orthogonality in Self-Attention: O-ViT-B/14

Standard ViT uses unconstrained linear projections for $E \in \mathbb{R}^{588 \times 768}$ 7, $E \in \mathbb{R}^{588 \times 768}$ 8, and $E \in \mathbb{R}^{588 \times 768}$ 9, which can introduce scale ambiguity and distort the intrinsic geometry of the patch embeddings. The O-ViT-B/14 variant constrains $E_{\text{pos}} \in \mathbb{R}^{(N+1) \times 768}$ 0, $E_{\text{pos}} \in \mathbb{R}^{(N+1) \times 768}$ 1, $E_{\text{pos}} \in \mathbb{R}^{(N+1) \times 768}$ 2 to the orthogonal group $E_{\text{pos}} \in \mathbb{R}^{(N+1) \times 768}$ 3 via a Cayley-type map $E_{\text{pos}} \in \mathbb{R}^{(N+1) \times 768}$ 4, where $E_{\text{pos}} \in \mathbb{R}^{(N+1) \times 768}$ 5 is a skew-symmetric matrix ( $E_{\text{pos}} \in \mathbb{R}^{(N+1) \times 768}$ 6) [Avron–Golub '53]. These orthogonal projections preserve inner products: for any $E_{\text{pos}} \in \mathbb{R}^{(N+1) \times 768}$ 7,

$E_{\text{pos}} \in \mathbb{R}^{(N+1) \times 768}$ 8

ensuring that distances and angles between embedded patches remain unchanged. Orthogonality is enforced efficiently: optimization occurs over unconstrained skew-symmetric matrices $E_{\text{pos}} \in \mathbb{R}^{(N+1) \times 768}$ 9, $N = (H \cdot W)/(14^2)$ 0, $N = (H \cdot W)/(14^2)$ 1, with the Cayley map serving as a smooth and surjective parametrization. No explicit manifold optimization or retraction steps are required; standard optimizers such as Adam are sufficient (Fei et al., 2022).

6. Empirical Performance and Training Considerations

ViT-B/14 demonstrates high accuracy in large-scale data regimes, outperforming convolutional networks at scale. However, standard ViT-B/14 is under-regularized with limited data (e.g., ImageNet-1K). The O-ViT-B/14 variant yields consistent 1–4% absolute top-1 accuracy improvements across CIFAR-10, CIFAR-100, ImageNet-1K (fine-tuned), SVHN, and face recognition subsets. Notably, O-ViT-B/14 raises CIFAR-10 accuracy from 92.1% to 95.7% (+3.6%), and SVHN from 96.5% to 98.2% (+1.7%). O-ViT-B/14 achieves 90% of its final accuracy 10–20% fewer epochs than vanilla ViT-B/14, with negligible extra compute; the matrix inversion introduces only 1–2% runtime overhead per block (Fei et al., 2022).

Model	Accuracy Gain	Convergence Speed	Runtime Overhead
O-ViT-B/14	+1–4%	10–20% faster	+1–2%
ViT-B/14	—	—	—

7. Geometric and Optimization Implications

Enforcing orthogonality preserves the relational geometry of patch-token feature spaces. In standard ViTs, the learned projections can arbitrarily scale or distort the space, introducing scale ambiguity that can lead to softmax saturation and instabilities in gradient backpropagation. By constraining $N = (H \cdot W)/(14^2)$ 2/ $N = (H \cdot W)/(14^2)$ 3/ $N = (H \cdot W)/(14^2)$ 4 projections to be orthogonal, O-ViT-B/14 eliminates such ambiguity: attention is determined solely by angular similarity, and the feature manifold's geometry is preserved. Empirically, these geometric constraints yield more stable gradients, faster convergence, and higher accuracy, particularly in small-data or noisy settings where conventional ViT geometry would be more fragile (Fei et al., 2022).

In summary, ViT-B/14 is a scalable transformer-based vision architecture characterized by image-to-patch tokenization, multi-head self-attention, and deep MLP blocks, whose performance can be substantially enhanced by imposing orthogonality constraints on attention projections. This geometric regularization eliminates scale ambiguities, preserves the feature space, and leads to measurable empirical improvements with minimal computational cost (Dosovitskiy et al., 2020, Fei et al., 2022).

Markdown Report Issue Upgrade to Chat

References (2)

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (2020)

O-ViT: Orthogonal Vision Transformer (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Vision Transformer (ViT-B/14).

Vision Transformer (ViT-B/14) Overview

1. Patch Embedding and Input Construction

2. Transformer Encoder Architecture

3. Self-Attention Mechanism

4. Effect of Patch Size and Computational Complexity

5. Orthogonality in Self-Attention: O-ViT-B/14

6. Empirical Performance and Training Considerations

7. Geometric and Optimization Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Vision Transformer (ViT-B/14) Overview

1. Patch Embedding and Input Construction

2. Transformer Encoder Architecture

3. Self-Attention Mechanism

4. Effect of Patch Size and Computational Complexity

5. Orthogonality in Self-Attention: O-ViT-B/14

6. Empirical Performance and Training Considerations

7. Geometric and Optimization Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research