Papers
Topics
Authors
Recent
Search
2000 character limit reached

Vision Transformer (ViT-B/14) Overview

Updated 30 June 2026
  • ViT-B/14 is a transformer-based vision architecture that processes images as sequential patch tokens and leverages multi-head self-attention for efficient image recognition.
  • It segments images into fixed-size patches, projecting them into an embedding space with learnable position encodings to capture spatial relationships.
  • Enforcing orthogonality in self-attention projections enhances stability, accelerates convergence, and boosts accuracy by preserving the feature geometry.

A Vision Transformer (ViT-B/14) is a pure transformer architecture for image recognition that processes images as sequences of fixed-size patches, where each patch is a token analogous to a word in natural language processing. ViT-B/14 specifically employs a 12-layer transformer encoder (base variant) with a hidden dimension of 768, using input patches of size 14×14 pixels. In its canonical form, ViT-B/14 achieves state-of-the-art results in large data regimes and can be further optimized with geometric constraints, such as orthogonality in projection matrices, as demonstrated by O-ViT-B/14 (Dosovitskiy et al., 2020, Fei et al., 2022).

1. Patch Embedding and Input Construction

ViT-B/14 segments a color image x∈RH×W×3x \in \mathbb{R}^{H \times W \times 3} into non-overlapping 14×14 patches. Each patch is flattened to a vector in R588\mathbb{R}^{588} and projected into a 768-dimensional embedding space via a learnable matrix E∈R588×768E \in \mathbb{R}^{588 \times 768}. A learnable 1D position embedding Epos∈R(N+1)×768E_{\text{pos}} \in \mathbb{R}^{(N+1) \times 768} encodes spatial information, where N=(H⋅W)/(142)N = (H \cdot W)/(14^2) and an additional class token xcls∈R768x_{\text{cls}} \in \mathbb{R}^{768} is prepended to the patch sequence. The input sequence is

z0=[xcls;x1E;…;xNE]+Epos,z_0 = [x_{\text{cls}}; x_1 E; \ldots; x_N E] + E_{\text{pos}},

where z0∈R(N+1)×768z_0 \in \mathbb{R}^{(N+1) \times 768} (Dosovitskiy et al., 2020).

2. Transformer Encoder Architecture

ViT-B/14 uses L=12L=12 identical transformer blocks, each comprising:

The forward computations at block R588\mathbb{R}^{588}2 are: R588\mathbb{R}^{588}3 with the final output being the LayerNorm of the class token after R588\mathbb{R}^{588}4 blocks.

3. Self-Attention Mechanism

For each transformer block, the self-attention equations are: R588\mathbb{R}^{588}5 where R588\mathbb{R}^{588}6 and R588\mathbb{R}^{588}7. The outputs of all heads are concatenated and mapped to the original hidden size via R588\mathbb{R}^{588}8 (Dosovitskiy et al., 2020).

4. Effect of Patch Size and Computational Complexity

Adopting a patch size of 14×14 increases patch count R588\mathbb{R}^{588}9 for fixed image size (e.g., for E∈R588×768E \in \mathbb{R}^{588 \times 768}0 images, E∈R588×768E \in \mathbb{R}^{588 \times 768}1 rather than E∈R588×768E \in \mathbb{R}^{588 \times 768}2 for 16×16 patches). This yields finer spatial granularity but increases computational complexity, as attention map size and computation scale as E∈R588×768E \in \mathbb{R}^{588 \times 768}3; reducing patch size from 16 to 14 increases memory and compute cost by ∼70%. The patch-projection matrix E∈R588×768E \in \mathbb{R}^{588 \times 768}4 adapts its input dimension accordingly: E∈R588×768E \in \mathbb{R}^{588 \times 768}5 for E∈R588×768E \in \mathbb{R}^{588 \times 768}6 (Dosovitskiy et al., 2020).

5. Orthogonality in Self-Attention: O-ViT-B/14

Standard ViT uses unconstrained linear projections for E∈R588×768E \in \mathbb{R}^{588 \times 768}7, E∈R588×768E \in \mathbb{R}^{588 \times 768}8, and E∈R588×768E \in \mathbb{R}^{588 \times 768}9, which can introduce scale ambiguity and distort the intrinsic geometry of the patch embeddings. The O-ViT-B/14 variant constrains Epos∈R(N+1)×768E_{\text{pos}} \in \mathbb{R}^{(N+1) \times 768}0, Epos∈R(N+1)×768E_{\text{pos}} \in \mathbb{R}^{(N+1) \times 768}1, Epos∈R(N+1)×768E_{\text{pos}} \in \mathbb{R}^{(N+1) \times 768}2 to the orthogonal group Epos∈R(N+1)×768E_{\text{pos}} \in \mathbb{R}^{(N+1) \times 768}3 via a Cayley-type map Epos∈R(N+1)×768E_{\text{pos}} \in \mathbb{R}^{(N+1) \times 768}4, where Epos∈R(N+1)×768E_{\text{pos}} \in \mathbb{R}^{(N+1) \times 768}5 is a skew-symmetric matrix (Epos∈R(N+1)×768E_{\text{pos}} \in \mathbb{R}^{(N+1) \times 768}6) [Avron–Golub '53]. These orthogonal projections preserve inner products: for any Epos∈R(N+1)×768E_{\text{pos}} \in \mathbb{R}^{(N+1) \times 768}7,

Epos∈R(N+1)×768E_{\text{pos}} \in \mathbb{R}^{(N+1) \times 768}8

ensuring that distances and angles between embedded patches remain unchanged. Orthogonality is enforced efficiently: optimization occurs over unconstrained skew-symmetric matrices Epos∈R(N+1)×768E_{\text{pos}} \in \mathbb{R}^{(N+1) \times 768}9, N=(H⋅W)/(142)N = (H \cdot W)/(14^2)0, N=(H⋅W)/(142)N = (H \cdot W)/(14^2)1, with the Cayley map serving as a smooth and surjective parametrization. No explicit manifold optimization or retraction steps are required; standard optimizers such as Adam are sufficient (Fei et al., 2022).

6. Empirical Performance and Training Considerations

ViT-B/14 demonstrates high accuracy in large-scale data regimes, outperforming convolutional networks at scale. However, standard ViT-B/14 is under-regularized with limited data (e.g., ImageNet-1K). The O-ViT-B/14 variant yields consistent 1–4% absolute top-1 accuracy improvements across CIFAR-10, CIFAR-100, ImageNet-1K (fine-tuned), SVHN, and face recognition subsets. Notably, O-ViT-B/14 raises CIFAR-10 accuracy from 92.1% to 95.7% (+3.6%), and SVHN from 96.5% to 98.2% (+1.7%). O-ViT-B/14 achieves 90% of its final accuracy 10–20% fewer epochs than vanilla ViT-B/14, with negligible extra compute; the matrix inversion introduces only 1–2% runtime overhead per block (Fei et al., 2022).

Model Accuracy Gain Convergence Speed Runtime Overhead
O-ViT-B/14 +1–4% 10–20% faster +1–2%
ViT-B/14 — — —

7. Geometric and Optimization Implications

Enforcing orthogonality preserves the relational geometry of patch-token feature spaces. In standard ViTs, the learned projections can arbitrarily scale or distort the space, introducing scale ambiguity that can lead to softmax saturation and instabilities in gradient backpropagation. By constraining N=(Hâ‹…W)/(142)N = (H \cdot W)/(14^2)2/N=(Hâ‹…W)/(142)N = (H \cdot W)/(14^2)3/N=(Hâ‹…W)/(142)N = (H \cdot W)/(14^2)4 projections to be orthogonal, O-ViT-B/14 eliminates such ambiguity: attention is determined solely by angular similarity, and the feature manifold's geometry is preserved. Empirically, these geometric constraints yield more stable gradients, faster convergence, and higher accuracy, particularly in small-data or noisy settings where conventional ViT geometry would be more fragile (Fei et al., 2022).


In summary, ViT-B/14 is a scalable transformer-based vision architecture characterized by image-to-patch tokenization, multi-head self-attention, and deep MLP blocks, whose performance can be substantially enhanced by imposing orthogonality constraints on attention projections. This geometric regularization eliminates scale ambiguities, preserves the feature space, and leads to measurable empirical improvements with minimal computational cost (Dosovitskiy et al., 2020, Fei et al., 2022).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Vision Transformer (ViT-B/14).