Papers
Topics
Authors
Recent
Search
2000 character limit reached

ViT-Small/Tiny: Compact Vision Transformers

Updated 30 January 2026
  • ViT-Small/Tiny are compact Vision Transformer variants defined by reduced embedding dimensions, fewer layers, and lower computational costs compared to larger models.
  • They employ advanced training protocols including learned augmentations, self-supervised pretraining, and distillation to boost performance on limited datasets.
  • Empirical results show that these models achieve competitive accuracy on benchmarks like TinyImageNet and CIFAR while operating with lower latency and FLOPs.

A "ViT-Small" or "ViT-Tiny" refers to scaled-down variants of the Vision Transformer (ViT) architecture that are optimized for resource-constrained environments, small datasets, or latency-critical inference. These models are explicitly characterized by reductions in embedding dimension, number of heads, network depth, or stage widths relative to the canonical ViT-Base/16. Such configurations are now widely used across small- and medium-scale vision tasks, with a variety of empirical design strategies and advanced training protocols developed to maximize the efficiency–accuracy trade-off. Representative architectures include: (1) width/depth scaled vanilla ViTs, (2) tiny windowed/hierarchical ViTs such as Swin-Tiny/Small and TinyViT, (3) NAS-derived ViTs, (4) plug-in modules for channel/memory efficiency, and (5) distilled and self-supervised variants enabling training on small or tiny datasets.

1. Architectural Definitions and Scaling Parameters

ViT-Tiny and ViT-Small configurations are strictly defined by adjustments to core ViT hyperparameters: embedding dimension DD, number of encoder layers LL, head count HH, and MLP hidden size. The canonical DeiT family (and by extension, most literature) defines:

Variant Layers (LL) Embedding (DD) Heads (HH) MLP Hidden Params FLOPs (G)
ViT-Tiny 12 192 3 768 5.7 M 1.3
ViT-Small 12 384 6 1536 22 M 4.6
ViT-Base 12 768 12 3072 86 M 17.6

For hybrid/hierarchical tiny ViTs (e.g., TinyViT, Swin-Tiny), additional reductions in window/patch size, stage width, and MBConv expansions are performed. For instance, TinyViT-5M uses four stages with D=[64,128,160,320]D=[64,128,160,320] and L=[2,2,6,2]L=[2,2,6,2] for a total of 5.4M parameters (Wu et al., 2022).

2. Training Protocols and Specialized Optimization for Small/Tiny ViTs

Small ViTs risk optimization instability and overfitting when trained on "tiny" datasets or with weak supervision, due to the lack of convolutional inductive bias and limited representational “headroom.” Modern training regimes for ViT-Tiny/Small incorporate:

3. Empirical Efficiency and Performance Trade-offs

Performance metrics for ViT-Small/Tiny variants are benchmarked over a range of vision tasks and datasets: ImageNet-1K, CIFAR-10/100, TinyImageNet, medical datasets (DermaMNIST), and application-specific tasks (semiconductor defect classification, digital microscopy autofocusing).

  • On TinyImageNet, ViT-Tiny (\sim5.5M params, 1.1G FLOPs) achieves 86.1% top-1, while ViT-Small (\sim21.7M, 4.25G) reaches 88.7%, both with sub-5 ms inference latency (Amangeldi et al., 13 May 2025).
  • For wafer map defect detection, ViT-Tiny (5.7M, 1.2G) delivers 98.4% F1, outperforming both ViT-Base and heavy CNN models with vastly reduced compute (Mohammad et al., 3 Apr 2025).
  • On CIFAR-10/100, MAE-pretrained ViT-Tiny (3.64M, 0.26G) achieves 96.4% / 78.2%, outperforming equivalently sized CNNs and prior "light" transformers at the same MAC count (Tan, 2024).
  • Mobile NAS-derived variants (ElasticViT-Tiny/Small) achieve 61.1–77.2% top-1 on ImageNet at 37–218 MFLOPs with real-time Pixel-device inference (Tang et al., 2023).
  • Plug-and-play enhancements such as Channel Shuffle can add +2–3% top-1 to vanilla Tiny ViTs (e.g., 72.2% \rightarrow 74.4% for DeiT-Tiny at 1.3G MACs) (Xu et al., 2023).

4. Methodological Extensions: NAS, Modularization, and Flexible Slicing

Tiny and Small ViTs are effective testbeds for methodological advances in vision transformer research, owing to stringent parameter and efficiency budgets:

  • NAS for ViT (ElasticViT) defines “tiny”/“small” search spaces over width, depth, MLP-ratio, patch size, and operator types, using conflict-aware sampling to optimize mobile-aware supernets (Tang et al., 2023).
  • Modularization: The Channel Shuffle module enriches local feature mixing with negligible cost, partitioning the channels into attended/idle groups and using interleaved shuffling for information flow (Xu et al., 2023).
  • Mixture-of-Experts (Mobile V-MoE) architectures use image-level super-class routers to activate sparse expert MLPs at minimal overhead; dense ViT-Tiny (12×192) is outperformed by its MoE variant by +3.39% on ImageNet (59.5→62.9%) at comparable FLOPs (Daxberger et al., 2023).
  • Slicing/Scala: A full ViT can be "sliced" into Tiny and Small variants via width selection (e.g., r=0.25,0.5), activated for flexible dynamic inference. Isolated activation and scale-coordination losses ensure subnets trained jointly match the performance of independent models (ViT-Tiny r=0.25 achieves 58.7% top-1 at 0.4G FLOPs, ViT-Small r=0.5 gives 68.3% at 1.3G) (Zhang et al., 2024).

5. Training on Small and Tiny Datasets

Unlike standard ViT/DeiT pipelines, which rely on large-scale pretraining (e.g., ImageNet-1K or JFT-300M), several papers have demonstrated robust training of ViT-Small/Tiny directly on "tiny" datasets using self-distillation, minimal scaling, or advanced augmentation.

  • Gani et al. show that a DINO-style SSL scheme on the same small dataset yields stable, high-performing ViT-Tiny models (e.g., CIFAR-10 96.41%, CIFAR-100 79.15%, TinyImageNet 63.36%, all at ≈2.8M params) (Gani et al., 2022).
  • MAE pretraining with high mask ratio (75%) on minimally upscaled CIFAR resolves low-data generalization for ViT-Tiny (0.26G MAC) without requiring external pretraining (Tan, 2024).
  • TinyMIM demonstrates that token-relation distillation from large MIM-pretrained teachers is essential: ViT-Tiny/16 reaches 75.8% (+4.2pp over direct-MAE) or 79.6% (w/deeper distill) on ImageNet-1K (Ren et al., 2023).
  • Semiconductor defect classification with ViT-Tiny illustrates the suitability of such models for industrial-scale tasks with limited or skewed data, maintaining performance and F1 under subsampled training (Mohammad et al., 3 Apr 2025).

6. Practical Guidelines, Limitations, and Applicability

Major practical design advice for small/tiny ViTs includes:

  • Strong, learned augmentations (AutoAugment, CutMix) are mandatory—disable at ≈–1% per component (Wu, 7 Jan 2025).
  • Use learnable positional encodings and, when compressing further, multi-CLS tokens to preserve global context (Wu, 7 Jan 2025).
  • Low-rank compression in attention (applied only to Q) can conserve memory without significant accuracy drop (–0.3% for Q, –1.9% for QKV) (Wu, 7 Jan 2025).
  • Pretraining distillation unlocks higher representational capacity (e.g., TinyViT, TinyMIM, Mobile V-MoE), yielding top-1 gains of 2–4pp over non-distilled tiny ViTs (Wu et al., 2022, Ren et al., 2023, Daxberger et al., 2023).
  • Trade-off curves: ViT-Tiny is best where extreme model/cost constraints dominate, at a modest accuracy cost; ViT-Small offers an optimal balance of accuracy and efficiency for many applied domains (Amangeldi et al., 13 May 2025).
  • The gains of channel shuffle or MoE modules diminish as network size increases; applicability is maximized in ≤1.5G MACs regimes (Xu et al., 2023, Daxberger et al., 2023).
  • Shared-weight slicing (Scala) enables a single “supernet” to host Tiny and Small variants for flexible resource scaling, with matched accuracy to independently trained models (Zhang et al., 2024).

Key limitations are the inherent fragility of tiny ViTs to overfitting, volatile convergence absent careful initialization or distillation, and the saturation of improvements when further downsized (below r=0.25 slicing or under ≥90% parameter pruning) (Ren et al., 2023, Zhang et al., 2024). In scenarios where compute and memory are severely constrained, compact CNNs or MobileNetV3 may still outperform in latency-critical applications, though tiny ViTs now regularly match or exceed accuracy for the same parameter budget (Tang et al., 2023, Mohammad et al., 3 Apr 2025).

7. Applications and Future Directions

ViT-Tiny/Small models have become common baselines in:

  • Mobile and embedded computer vision (industrial inspection, microscopy autofocus, resource-limited detection) (Cuenat et al., 2022, Mohammad et al., 3 Apr 2025)
  • Medical image classification with low data regimes (DermaMNIST: ViT-Tiny 79.86%, ViT-Small 81.56%) (Amangeldi et al., 13 May 2025)
  • Automated NAS for device-specific deployment—ElasticViT-Tiny/Small variants achieve SOTA accuracy/latency across heterogeneous mobile CPUs/GPUs (Tang et al., 2023)
  • Flexible, dynamic inference and model compression pipelines via one-shot slimmable ViTs (Zhang et al., 2024)

Ongoing research focuses on closing the accuracy gap to much larger ViTs under low-parameter constraints, improving robustness via advanced distillation (e.g., TinyMIM), modular attention/locality inductive bias learning, and universal, shared-weight model architectures with sliceable subnets for “anytime” vision deployment.


References:

(Wu et al., 2022, Amangeldi et al., 13 May 2025, Tan, 2024, Xu et al., 2023, Wu, 7 Jan 2025, Ren et al., 2023, Gani et al., 2022, Tang et al., 2023, Daxberger et al., 2023, Mohammad et al., 3 Apr 2025, Zhang et al., 2024, Cuenat et al., 2022)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Vit-Small/Tiny.