Patch/Token-based Transformers

Updated 8 June 2026

Patch/Token-based Transformers are methods that discretize structured data into tokens, enabling efficient self-attention for modalities like images, time series, audio, and geometry.
Adaptive tokenization approaches, including dynamic patching and learned pattern tokens, adjust token sizes based on local content complexity to optimize model efficiency.
Techniques such as token pruning, hierarchical merging, and CLS specialization enhance scalability, robustness, and performance across various domains.

Patch/token-based Transformers represent a fundamental approach for adapting Transformer architectures to modalities such as images, time series, audio, geometric data, and beyond, by discretizing the input into a set of fixed, learned, or adaptively sized tokens. This discretization serves to convert high-dimensional, structured data into a sequence amenable to self-attention. The design and implementation of patch-based tokenization are critical determinants of model efficiency, representational fidelity, scalability, and robustness, and have stimulated diverse research directions across vision, time series, audio, geometry, and mixed-modal benchmarks.

1. Canonical Patch Tokenization: Fixed Grids and Its Limitations

The canonical Vision Transformer (ViT) tokenizes an image $X\in\mathbb{R}^{H\times W\times C}$ into $N=(H/p)\cdot(W/p)$ non-overlapping $p\times p$ patches. Each patch $x_i\in\mathbb{R}^{p^2\cdot C}$ is flattened and projected via a learned linear embedding $E\in\mathbb{R}^{D\times p^2\cdot C}$ to yield token $z_i^0=E x_i + e_{pos}(i)$ , where $D$ is the model width and $e_{pos}(i)$ is a positional embedding. A special class token $[CLS]$ is typically prepended. The resulting token sequence of length $N+1$ serves as input to $N=(H/p)\cdot(W/p)$ 0 layers of multi-head self-attention and MLPs, with [CLS] features used for classification and patch tokens optionally used for dense prediction (Qin et al., 2021).

While effective for classification, fixed-grid patching introduces several representational bottlenecks:

Patch boundaries may bisect semantically or structurally coherent regions, degrading spatial or object continuity (Li et al., 2023).
The fixed number of tokens scales quadratically with image resolution, directly inflating computational and memory cost due to the $N=(H/p)\cdot(W/p)$ 1 self-attention overhead (Choudhury et al., 20 Oct 2025).
Rigid token counts cannot adapt to areas of varying complexity or redundancy, leading to wasted computation in smooth regions and insufficient resolution in complex ones.

Imposing constant-size patches in other modalities—such as fixed-length temporal patches for time series or fixed 2D spectrogram tiles for audio—similarly faces fundamental trade-offs in fidelity, efficiency, and downstream model quality (Ankireddy et al., 11 Mar 2026, Lee et al., 2 Apr 2025).

2. Adaptive and Content-aware Tokenization

To address the rigidity of uniform patchification, multiple approaches introduce adaptive, heterogeneous, or learned tokenization schemes.

Dynamic or Content-aware Patching: In time series, TimeSqueeze applies a pointwise state-space encoder (Mamba) to extract full-resolution features, then segments sequences based on local signal complexity using a relative deviation criterion. Short, information-dense segments and long, redundant intervals yield variable-length patch tokens, each anchored to boundary embeddings and paired with original time indices. This dynamic patching preserves critical structure with major memory and efficiency gains ( $N=(H/p)\cdot(W/p)$ 2 faster pretraining, $N=(H/p)\cdot(W/p)$ 3 higher data efficiency, and $N=(H/p)\cdot(W/p)$ 4 less GPU memory vs. pointwise) (Ankireddy et al., 11 Mar 2026).
Adaptive Patch Sizes in Vision: Adaptive Patch Transformers (APT) recursively partition images using multiple discrete patch sizes determined by local content entropy. Homogeneous regions are assigned larger tokens, while complex regions retain smaller tokens. This strategy reduces token counts by 20–45%, accelerates both training and inference by 40–86% on large ViT models, and preserves downstream accuracy within $N=(H/p)\cdot(W/p)$ 5 of baseline after a single epoch of fine-tuning (Choudhury et al., 20 Oct 2025).
Learned Pattern Tokens: Patternformer replaces regular grids with adaptive pattern extraction from a CNN backbone (e.g., ResNet). Each output channel forms a pattern map (interpretable as a soft, learned token), projected to token space and then processed by a shallow Transformer. This preserves semantic continuity and allows sequence length $N=(H/p)\cdot(W/p)$ 6, yielding SOTA accuracy on CIFAR and competitive ImageNet results with fewer, more informative tokens (Li et al., 2023).
Differentiable Hierarchical Visual Tokenization: dHT performs hierarchical, pixel-level, differentiable superpixel merging guided by an information criterion balancing fit and model complexity. Resulting superpixels are mean-injected into region features and rasterized into ViT-compatible tokens, offering seamless retrofitting for pretrained models and improved accuracy/efficiency trade-offs in classification and dense prediction (Aasan et al., 4 Nov 2025).
Dynamic Patch Scheduling at Inference: In diffusion-based image generation, DDiT dynamically chooses patch sizes at each denoising timestep, using coarse tokens when evolving slowly and fine tokens near completion. This schedule, determined by third-order finite differences of latent features, achieves up to $N=(H/p)\cdot(W/p)$ 7 acceleration with negligible perceptual degradation (Kim et al., 19 Feb 2026).

3. Multimodal and Structured Tokenization

Patch/token-based Transformers generalize beyond vision:

Time Series: Both fixed and dynamic segment patching have been explored; CNN-based patch tokenizers extract local dynamics, followed by a Transformer for long-range inter-patch dependency modeling. Decoupling local representation from global modeling leads to robust and stable forecasting even in the presence of structured dynamic/static variations (Ankireddy et al., 11 Mar 2026, Nagrath, 18 Jan 2026).
Audio: ViT-based audio classification uses 2D Mel-spectrograms split into patches; token pruning selects the $N=(H/p)\cdot(W/p)$ 8 most salient tokens (using attention- or energy-based metrics) at intermediate layers. Pruning $N=(H/p)\cdot(W/p)$ 9 of tokens reduces computation with $p\times p$ 0 loss in classification accuracy; both high- and low-intensity tokens are needed, with patch importance differing markedly between speech and general audio (Lee et al., 2 Apr 2025).
Geometric Data and PDEs: Patch tokenization on 3D meshes is achieved by geometry-aware spectral coarsening (e.g., algebraic multigrid), forming variable-sized patches (clusters) aligned with Laplace–Beltrami eigenfunctions. Per-patch isometry-invariant heat kernel signatures serve as embeddings, and self-attention is masked by geodesic distance to encode local/global context. MeshTok extends this paradigm to PDE surrogates, allocating tokens via AMR-inspired refinement by local gradient/Laplacian energy, combining multi-scale tokens in a unified attention sequence. Both yield superior accuracy–efficiency trade-offs (Farazi et al., 2024, Zhao et al., 3 Jun 2026).
Medical Imaging/Registration: In unsupervised echocardiography registration, patch-based MLPs and Transformers, using standard tokenization, outperform CNN baselines, with the act of patchification itself driving gains in the physiologically plausible deformation field structure (Wang et al., 2022).

4. Token Reduction, Merging, and Specialization

Token overhead from patchification motivates several reduction and merging techniques:

Token Pruning: TopK-based pruning at intermediate Transformer blocks in ViT (vision/audio) retains the most informative tokens, substantially reducing computation and memory; simple statistics (intensity/variance) correlate with attention-based importance but do not fully substitute at deeper pruning rates (Lee et al., 2 Apr 2025).
PatchMerger: A learned bottom-up soft-attention module aggregates $p\times p$ 1 tokens to $p\times p$ 2 mid-sequence, with each new token computed as a weighted sum of all input tokens. This leads to nearly $p\times p$ 3 speedup with negligible accuracy loss ( $p\times p$ 4) on both upstream and downstream vision tasks (Renggli et al., 2022).
Differentiable or Hierarchical Pruning: dHT adaptively merges superpixel tokens in a differentiable fashion and supports retrofitting to pretrained backbones without additional architectural components (Aasan et al., 4 Nov 2025).
CLS/Patch Specialization: Explicit separation of [CLS] and patch token computation paths (specialized LayerNorm and early QKV projections) yields up to $p\times p$ 5 mIoU points in segmentation with $p\times p$ 6 classification accuracy loss, with only $p\times p$ 7 increase in parameters. The separation reduces friction between global and local feature learning; block-wise ablations reveal that specialization in the first $p\times p$ 8 blocks suffices for maximal dense performance improvement (Marouani et al., 9 Feb 2026).

5. Robustness, Selectivity, and Architectural Inductive Bias

Adversarial Robustness: Patch-based ViTs are highly sensitive to sparse, block-based adversarial token attacks: corrupting a single patch token can halve robust accuracy, whereas CNNs retain higher robustness under similar attacks. PatchCensor provides a certified defense by exhaustively mutating attention masks to exclude any possible patch, achieving certified accuracies up to 69.4% on ImageNet at 2%-pixel patch size without retraining (Joshi et al., 2021, Huang et al., 2021).
Negative Patch Augmentation: Introducing patch-based negative samples (e.g., patch shuffling, rotation) during training penalizes reliance on non-robust, patch-surviving features, boosting ViT robustness by 1–2% in OOD benchmarks without harming in-distribution accuracy (Qin et al., 2021).
Token Interaction and Semantic Diffusion: Research identifies 'semantic diffusion' (excessive global mixing of class/scene semantics into all patch tokens) as a central cause of dense prediction degradation. Replacing softmax attention with entmax-1.5 introduces sparsity in token interactions, improving segmentation mIoU by 1–6 points with no accuracy drop and forming sharper, more localized patch features (Su, 22 May 2026).
Jumbo CLS Tokens and Extended Global Representation: Widening the CLS token (Jumbo) and providing it with a dedicated FFN (split/reassemble) significantly improves low-width ViT accuracy and downstream performance, outperforming the register trick or specialized mobile ViTs. Jumbo enables support for masked autoencoding and token dropping, extending token-based approaches to SSL and time series (Fuller et al., 20 Feb 2025).
Mixture-of-Experts Decoders: Specialized decoders take in multi-scale tokens and mix them per-pixel via learned gating, promoting expert specialization and mitigating interference typical in classical ViT encoders (Ou et al., 2022).

6. Applications, Generalizations, and Transfer

Patch/token-based Transformers are now established not only in image recognition, segmentation, and detection, but also in medical imaging (registration, lesion detection), time series forecasting, audio classification, molecular property prediction, and neural PDE surrogates. Transfer protocols include adaptively fitting tokenizers (dHT) to pretrained backbones, fine-tuning with adaptive patching schemes (APT), and parameter-efficient fine-tuning via token-parameterized rank-1 patches (e.g., LoRA, prompt tuning) (Goldwaser et al., 22 Nov 2025). The modular design of patch/token-based approaches—comprising tokenization, embedding, attention, reduction, and specialized decoding—supports rapid adaptation across domains and model scales.

A plausible implication is that future research will further close the gap between tokenization granularity, semantic content, and task requirements, using heterogeneity, sparsity, and learned region shapes as first-class architectural ingredients for efficiency and fidelity. Patch/token-based schemes thus provide a central design axis for scalable, adaptable, and robust Transformer models.