Patch-Based Tokenization

Updated 18 March 2026

Patch-based tokenization is a method that divides structured data into contiguous patches, enabling discrete token formation for transformer models in vision, video, and time series.
Adaptive strategies like superpixel and hierarchical tokenization enhance semantic alignment and reduce computational overhead by tailoring patch size and resolution to content.
Emerging applications extend this approach to language and temporal signals, with ongoing research focused on end-to-end learnable, multi-resolution tokenization frameworks.

Patch-based tokenization is a class of strategies for converting structured input data—most notably images, video, time series, and even language—into discrete token sequences suitable for transformer-based models. Its central principle is to subdivide the input into contiguous, fixed- or variable-size patches, each treated as an atomic unit for embedding, attention, or further processing. Patch-based tokenization underpins Vision Transformers (ViTs) and many recent multimodal and temporal models, and has been generalized into content-adaptive, multi-resolution, and hierarchical schemes to bridge the gap between local pattern efficiency and semantic alignment across data domains (Lew et al., 2024, Schmidt et al., 10 Jun 2025, Jang et al., 2024, Dolga et al., 17 Oct 2025, Ronen et al., 2023, Aasan et al., 2024, Aasan et al., 4 Nov 2025, Chen et al., 2024, Zhang et al., 2024, Nagrath, 18 Jan 2026, Bumb et al., 15 Jun 2025, Kim et al., 19 Feb 2026). Below is an in-depth survey of the methodologies, implications, and variants of patch-based tokenization.

1. Canonical Patch-Based Tokenization: Foundations and Limitations

Patch-based tokenization was popularized in visual modeling through ViTs, where an input image $x \in \mathbb{R}^{H \times W \times C}$ is divided into $N = (H \cdot W)/P^2$ non-overlapping square patches $x^{(i)} \in \mathbb{R}^{P \times P \times C}$ of fixed spatial dimensions. Each patch is flattened and projected into a $D$ -dimensional embedding via a learnable linear operator:

$z^{(i)} = \mathrm{flatten}(x^{(i)}) \cdot E + \mathrm{Pos}^{(i)}$

where $E$ is the patch embedding matrix and $\mathrm{Pos}^{(i)}$ is the positional encoding. The resulting token sequence, along with (optionally) a global [CLS] token, is passed through the transformer encoder.

Although grid-based patch tokenization is efficient for capturing local structure and enables computational scaling (since attention cost is $O(N^2)$ ), it enforces strict regularity. Individual patches may straddle object boundaries, mix heterogeneous semantics, or underrepresent small salient structures, a departure from the atomic-concept alignment achieved by NLP tokenizers (Lew et al., 2024, Aasan et al., 2024, Chen et al., 2024). This heterogeneous mixing undermines interpretability and may limit precision in dense or structured tasks.

2. Content-Adaptive and Hierarchical Patch Tokenization

To address the semantic rigidity of fixed patches, recent methods substitute uniform grid partitioning with content-adaptive strategies, notably superpixel tokenization and hierarchical segmentations.

Superpixel Tokenization: Using pixel-affinity clustering (e.g., SLIC or graph-based merges), an image is partitioned into variable-shape, content-aligned superpixels. Token embeddings are computed as pooled statistics (mean, max) of learned local and positional features within each superpixel (Lew et al., 2024, Aasan et al., 2024, Aasan et al., 4 Nov 2025). The resulting tokens show improved semantic purity, enabling ViT models (SuiT, SPiT, ∂HT) to achieve equal or better accuracy and robustness versus grid-based ViTs, along with finer-grained attributions and improved zero-shot segmentation:

| Model | Patch-Type | IN1k Top-1 | Salient Seg. $F_\beta$ | Explained Variance $R^2$ | |---------------|--------------|------------|------------------------|--------------------------| | ViT-B16 | Square Patches| 0.854 | 0.803 | - | | SPiT-B16 | Superpixels | 0.858 | 0.903 | 0.914 |

(Lew et al., 2024, Aasan et al., 2024, Aasan et al., 4 Nov 2025)

Differentiable Hierarchical Tokenization: ∂HT constructs a full partition tree of the image using learnable, feature-driven region merging, then prunes to an optimal slice via information criteria (AIC/BIC). Each region becomes a differentiable token by mean-injection and feature-based cropping, directly feeding standard ViT layers. This approach supports both image-level and dense tasks, as well as out-of-the-box raster-to-vector conversion via region boundaries (Aasan et al., 4 Nov 2025).
Subobject-Level Tokens: Techniques inspired by linguistic subword segmentation adapt boundary-based or watershed algorithms to produce variable, morphology-aligned "subobject" tokens, yielding both improved monosemanticity and sample efficiency in downstream vision-LLMs (Chen et al., 2024).

3. Multi-Resolution and Dynamic Patch Schemes

Patch-based tokenization has evolved to better allocate model capacity to informative regions through multi-scale and dynamically scheduled patch extraction:

Foveated/Variable-Resolution Tokenization: Segment This Thing (STT) extracts concentric grids of patches at resolutions decaying with distance from a user-specified point prompt. This reduces the number of tokens from 4096 (uniform grid) to as low as 172 for a 1024×1024 image, yielding up to $N = (H \cdot W)/P^2$ 0 reduction in compute without compromising segmentation accuracy near the prompt. FLOP and latency benchmarks demonstrate clear efficiency:

| Model | Tokens | GFLOPs | Latency (ms) | |------------|--------|--------|--------------| | SAM-H | 4096 | 6533.7 | 572.7 | | STT-B | 172 | 30.9 | 7.3 |

(Schmidt et al., 10 Jun 2025)

Quadtree/Mixed-Resolution Tokenization: Variable-size patches are constructed by recursively subdividing the most salient regions (via MSE or feature-based scorers), as in Quadformer. This strategy maintains or improves ImageNet accuracy at matched computation and yields plus-0.8% performance at low GMACs compared to fixed-grid ViT (Ronen et al., 2023).
Dynamic Patch Scheduling: DDiT applies test-time adaptive patching in denoising diffusion transformers, estimating the latent "acceleration" to select coarser patches in early steps (global structure) and finer patches in later steps (local detail). This produces $N = (H \cdot W)/P^2$ 1– $N = (H \cdot W)/P^2$ 2 inference acceleration without significant FID or CLIP degradation (Kim et al., 19 Feb 2026).

4. Patch Tokenization Beyond Vision: Sequences and Time Series

The patch paradigm generalizes to language and temporal signals, manifesting as dynamic character grouping and temporal windowing:

Hierarchical BPE Patch Grouping: In language modeling, dynamic grouping by hierarchical BPE introduces explicit end-of-patch markers and a second BPE compression layer, constraining patch granularity without sacrificing token efficiency or cross-linguistic generality. This yields improved bits-per-byte (BPB) metrics (1.11 BPB) and handles scripts without whitespace (e.g., Chinese) (Dolga et al., 17 Oct 2025).
Patch Tokenization for Time Series: For multivariate time series $N = (H \cdot W)/P^2$ 3, patch-based tokenization creates fixed-length windows $N = (H \cdot W)/P^2$ 4, encodes them via CNNs or raw prompt formatting (LLMs), and reduces sequence length fed to the model. Empirical results in PatchInstruct show up to $N = (H \cdot W)/P^2$ 5 reduction in MAE compared to point-wise tokenization (Nagrath, 18 Jan 2026, Bumb et al., 15 Jun 2025). In forecasting frameworks with CNN encoder plus transformer, explicit patch representation improves memory efficiency and decouples local from global dynamics.

5. Applications to Video, Semantics, and Open Vocabulary Segmentation

Coordinate-Based Video Patch Tokenization: CoordTok factorizes a long video clip into coordinate-aligned triplane tokens, reconstructing patches only as needed for training. For 128-frame, $N = (H \cdot W)/P^2$ 6 resolution videos, CoordTok achieves comparable fidelity to baselines using $N = (H \cdot W)/P^2$ 7 tokens (vs. $N = (H \cdot W)/P^2$ 8– $N = (H \cdot W)/P^2$ 9), enabling efficient training and large receptive fields in diffusion transformers (Jang et al., 2024).
Feature Pyramid and Open-Vocabulary Tokenization: Feature Pyramid Tokenization (PAT) creates multi-scale token sequences by clustering CLIP-ViT features via VQ codebooks in a spatial pyramid. Hierarchical codebooks and pixel/semantic branches support pretrained VLMs in open vocabulary segmentation and reconstructive learning (Zhang et al., 2024).

6. Comparative Analysis and Trade-Offs

Patch-based tokenization admits important trade-offs:

Semantic Purity vs. Efficiency: Adaptive superpixel and hierarchical schemes approach the NLP ideal of atomic-concept tokens, at the expense of implementation complexity and, occasionally, minor computational overhead in tokenization (offset by token count reduction in transformers) (Lew et al., 2024, Aasan et al., 2024).
Computational Scaling: Sequence length reduction by patching (spatially or temporally) is fundamentally advantageous as self-attention and associated memory/compute scale as $x^{(i)} \in \mathbb{R}^{P \times P \times C}$ 0.
Flexibility and Compatibility: Modern superpixel/hierarchical tokenizers (SuiT, SPiT, ∂HT) preserve full compatibility with standard ViT architectures—only the input embedding changes.
Limitations: Dynamic patch strategies must balance token granularity to avoid loss of critical detail at coarse scales. Character-level patching in language and adaptive patching in vision both require careful ablation to avoid under- or over-segmentation of meaningful units (Dolga et al., 17 Oct 2025, Lew et al., 2024).

7. Future Directions and Open Challenges

Emerging research is focused on:

End-to-end learnable tokenization, as in differentiable hierarchical strategies, to jointly optimize patch formation and downstream loss (Aasan et al., 4 Nov 2025).
Extension to spatiotemporal domains: video modeling, 3D scenes, and point cloud transformers with patch-based volumetric tokens (Jang et al., 2024, Schmidt et al., 10 Jun 2025).
Open-world settings: subobject and adaptive-token VLMs for compositional, cross-modal reasoning (Chen et al., 2024, Zhang et al., 2024).
Adaptive and hybrid token selection: spatial and task-driven patch merging/splitting (Kim et al., 19 Feb 2026).
Joint optimization across domains (language, vision, time series) via hierarchical patch tokenization schemes (Dolga et al., 17 Oct 2025).

Patch-based tokenization remains foundational for efficient, scalable, and semantically meaningful transformer modeling across modalities. Ongoing work seeks to unify adaptive, hierarchical, and task-driven partitioning with universally compatible embedding strategies.

Refer to: (Lew et al., 2024, Jang et al., 2024, Ronen et al., 2023, Dolga et al., 17 Oct 2025, Aasan et al., 2024, Aasan et al., 4 Nov 2025, Zhang et al., 2024, Schmidt et al., 10 Jun 2025, Kim et al., 19 Feb 2026, Nagrath, 18 Jan 2026, Bumb et al., 15 Jun 2025, Chen et al., 2024) for methodology and empirical results.