Patch Embedding and Tokenization

Updated 12 April 2026

Patch embedding and tokenization are methods for converting structured data into vector representations, originally using fixed grid patches but now evolving to incorporate adaptive, semantically-driven token boundaries.
Recent advancements include entropy-based dynamic patchification, reinforcement learning for optimal token placement, and content-aware schemes that improve computational efficiency and model accuracy.
These techniques have practical applications across computer vision, time-series analysis, and code diffs, offering significant efficiency gains and improved downstream performance in Transformer-based architectures.

Patch embedding and tokenization constitute the foundational process by which complex structured data—such as images, time-series, or code changes—are converted into discrete, fixed- or variable-length vector representations suitable for downstream modeling with Transformer or autoregressive architectures. Driven by advances in Vision Transformers (ViTs), LLMs, and multimodal architectures, research in this area has revealed the limitations of fixed, regular-grid patchification and inspired a multitude of adaptive, hierarchical, and semantically-driven tokenization paradigms.

1. Classical Patch Embedding: Uniform Grid and Linear Projection

The canonical approach in Vision Transformers splits the input (e.g., an image I∈ℝ^{H×W×3}) into N non-overlapping, fixed-size patches of dimensions P×P, each flattened to produce vectors p_i∈ℝ^{{P^2·3}.} A learned linear projection E ∈ ℝ^{{D×(P^2·3)}} maps each patch to a D-dimensional embedding:

$z_i = E p_i + b$

Positional embeddings and, optionally, a class token are added before processing with Transformer encoder blocks (Renggli et al., 2022). This grid-based uniform tokenization is computationally convenient but is agnostic to semantic structure, object boundaries, and spatial heterogeneity present in the data.

2. Adaptive and Dynamic Patchification: Data-Driven Token Grouping

Recent work has focused on overcoming the inefficiency and lack of semantic alignment inherent in fixed patching by introducing adaptive tokenization schemes:

Dynamic Patchification via Entropy or Saliency: DPAR (Srivastava et al., 26 Dec 2025) aggregates VQ-VAE tokens into variable-length patches using a next-token prediction entropy computed by a lightweight autoregressive model. Patch boundaries are dynamically placed when the predictive entropy exceeds a threshold or a maximum patch length is reached. This mechanism allows computational resources to be allocated preferentially to information-rich regions, yielding up to 1.8–2.1× token count reduction and 40% training FLOP savings, with improved FID and convergence.
Reinforcement-Learned Patching: ReinPatch (Wu et al., 27 Mar 2026) casts variable-length patching as a discrete MDP. A Transformer-based policy π_θ decides patch boundaries to minimize downstream task loss. Training utilizes a Group Relative Policy Gradient that enforces a hard compression rate constraint and directly optimizes token placement for the end-task. Empirically, this yields state-of-the-art performance in time-series aggregation and interpretable, data-adaptive segmentation.
Mixed-Resolution and Content-Aware Tokenization: The Quadformer (Ronen et al., 2023) uses a quadtree splitting strategy guided by a saliency scorer (pixel-blur, feature-based, or Grad-CAM) to construct a non-uniform patch mosaic. High-saliency regions are tokenized at higher resolution, while homogeneous areas are allocated larger, coarser patches. This mixed-resolution paradigm achieves 0.5–0.9 pp absolute accuracy gains under fixed compute over uniform ViT baselines.
Dynamic Patch Schedules for Generative Models: DDiT (Kim et al., 19 Feb 2026) dynamically adjusts the patch size at every step of the diffusion process in image/video generation, leveraging the third-order finite difference in latent evolution to choose between coarse and fine tokenizations. This schedule achieves up to 3.5× speedup with negligible FID degradation.

3. Semantically-Aware and Irregular Tokenization

To address the semantic mismatch between grid patches and perceptually meaningful objects or parts, several schemas have emerged:

Subobject and Superpixel-Based Tokenization: The "Subobject-level Image Tokenization" approach employs boundary detection (DirectSAM) and connected-component segmentation to produce tokens that tightly align with semantic regions in the image (Chen et al., 2024). Tokens are embedded via a sequence-to-sequence autoencoder (SeqAE). In synthetic vision–language tasks, this yields improved training speed, greater per-attribute accuracy, and semantically coherent segmentations.
Homogeneous and SIR-Based Tokens: HOOK (Shao et al., 2024) defines "semantically independent regions" (SIRs) as maximal connected pixel sets aligned with true object/label boundaries. Local and global self-attention is used for perceptual grouping, and a cross-attention-based "Object Vectorization Module" aggregates seeds into adaptive numbers of object-aligned tokens. This framework achieves up to 10% absolute accuracy improvements and 1.5–2.8× efficiency gains versus standard patching.
Superpixel Modularization: SPiT (Aasan et al., 2024) generalizes tokenization by applying a content-aware graph-based superpixel hierarchy, with region-adaptive, scale- and shape-invariant feature projections. Decoupling tokenization from feature embedding, SPiT demonstrates increased attribution faithfulness, improved zero-shot dense prediction, and modular compatibility with standard ViTs.
Hierarchical Tokenization with Information Criteria: Differentiable Hierarchical Tokenization (dHT) (Aasan et al., 4 Nov 2025) trains a featurewise pixel-embedding CNN, merges pixels or regions by similarity, and uses information-criterion pruning to select the optimal partition scale. Each region is projected into ViT-compatibility using mean-injection and masked sampling. dHT yields up to 1.3 pp ImageNet gain compared to fixed-grid ViTs and supports seamless retrofitting.

4. Specialized Patch Embedding for Non-Image Domains

Patch tokenization principles generalize to non-image structured data:

Time-Series: Hybrid CNN-patching (Nagrath, 18 Jan 2026) extracts fixed-length temporal patches, encodes local structure with a 1D CNN + attention pooling, and refines patch embeddings with token-level self-attention before global Transformer processing. This separation of local and global dynamics improves forecasting stability and accuracy.
Code Changes (Software Patches): Methods such as Patcherizer (Tang et al., 2023) and MultiSEM (Tang et al., 2023) process code diffs and unified patches by extracting line-level, word-level, and contextual tokens, applying cross-attention between added/removed lines, and fusing representations across sequence and hierarchical graph (AST) structures. Similarly, hierarchical BPE-based tokenization (Dolga et al., 17 Oct 2025) uses explicit end-of-patch markers and a controlled patch length for language-agnostic, efficient translation from characters to tokens with lower model parameterization.

5. Transformer Architectural Innovations for Patch Tokens

Patchification strategy directly impacts model architecture:

Hierarchical and Merging Operations: PatchMerger (Renggli et al., 2022) introduces mid-network token merging, where soft-attention-based pooling reduces the number of tokens after intermediate transformer blocks, yielding over 50% FLOP reductions and Pareto-optimal accuracy/compute tradeoffs. Merging is differentiable and compatible with standard ViT backbones.
Multimodal Unification via Patch-As-Token: PaDT (Su et al., 2 Oct 2025) repurposes patch embeddings as "Visual Reference Tokens" to be directly generated/synthesized interleaved with text, enabling MLLMs to jointly address detection, segmentation, and text output. This involves dynamic expansion of the embedding table with patch-derived slots and a specialized task-decoder.
Efficient Integration and Discriminative Extraction: EPIR (Wang et al., 9 Apr 2026) introduces Dual Norm Shifted Patch Tokenization (DNSPT), intra-block token integration, and dynamic discriminative token selection to robustly handle micro-expression recognition via spatial context encoding and progressive token reduction.

6. Quantitative Impact and Empirical Results

Adaptive, semantic, and hierarchical patch tokenization yields substantial gains:

Approach	Reduction in Token Count	Efficiency/FLOP Gain	Accuracy Lift	Tasks/Domains
DPAR (Srivastava et al., 26 Dec 2025)	1.8–2.1×	40%	+5pp (linear-probe)	Gen. image AR, ImNet
HOOK (Shao et al., 2024)	~25× (vs. 196→8 tokens)	1.5–2.8×	+6–10pp	Remote sensing, segm.
Quadformer (Ronen et al., 2023)	Flexible	Slight cost add	+0.5–0.9pp	ViT classification
dHT (Aasan et al., 4 Nov 2025)	Variable	Adaptive	+1.3pp (ViT-B/DEiT3)	Classification, seg.
SPiT (Aasan et al., 2024)	Variable	Comparable	Robustness, faithfn.	Attribution, segment.
ReinPatch (Wu et al., 27 Mar 2026)	Task-controlled	Variable	–7.7% MSE	Time-series
CoordTok (Jang et al., 2024)	6–8× fewer	Order-of-mag.	+2–7 PSNR, –0.05 LPIPS	Video, diffusion

This demonstrates that informed patch embedding, whether via entropy, semantic region, reinforcement learning, or hierarchical graph clustering, can drastically reduce token sequence lengths, lower computational demand, and improve accuracy and convergence (Srivastava et al., 26 Dec 2025, Shao et al., 2024, Ronen et al., 2023, Aasan et al., 4 Nov 2025, Aasan et al., 2024, Wu et al., 27 Mar 2026, Jang et al., 2024).

7. Open Challenges and Future Directions

While significant progress has been achieved in adaptive and semantic patch tokenization, several open research directions remain:

Semantic Alignment vs. Efficiency: There exists a tradeoff between fine-grained semantic boundary adherence and computational regularity. It remains an open question how best to balance these for particular downstream tasks, especially in real-time or large-scale inference scenarios (Chen et al., 2024, Shao et al., 2024).
End-to-End Learnable Tokenizers: While differentiable tokenizers (dHT (Aasan et al., 4 Nov 2025), learnable superpixels (Aasan et al., 2024)) have shown promise, their integration with foundation models and support for full pretraining/fine-tuning cycles require further exploration.
Tokenization for Multimodal and Hierarchical Data: Universal patch/token models capable of flexibly handling image, video, language, and highly-structured data (e.g., graphs, code diffs) in a unified architecture represent a major area for future integration (Su et al., 2 Oct 2025, Dolga et al., 17 Oct 2025).
Token Adaptivity at Inference: Dynamic scheduling of patch granularity at inference, as in DDiT (Kim et al., 19 Feb 2026) and DPAR (Srivastava et al., 26 Dec 2025), raises questions of distribution shift and optimality vis-à-vis training tokenization schedules.

Overall, patch embedding and tokenization is transitioning from a simple preprocessing step toward a data-adaptive, learning-integrated, and semantically meaningful component essential for scalable, efficient, and robust sequence modeling in both vision and non-vision domains.