Shifted Patch Tokenization (SPT) in Vision Transformers

Updated 15 March 2026

Shifted Patch Tokenization (SPT) is an embedding technique that aggregates overlapping spatial information from shifted image views to enhance locality in Vision Transformers.
It increases the effective receptive field without altering the self-attention mechanism, making it effective for low-level vision tasks and small-data regimes.
Empirical results show SPT boosts classification accuracy by up to 3.60% on Tiny-ImageNet and improves restoration metrics in tasks like denoising and inpainting.

Shifted Patch Tokenization (SPT) is an embedding technique designed to augment the locality bias and receptive field of patch-based Transformer architectures, especially Vision Transformers (ViT), by aggregating overlapping spatial information from shifted versions of the input. SPT addresses inherent deficiencies of standard non-overlapping patch tokenization—namely, the limited context captured by each token and the resulting inductive bias mismatch for low-level vision or small-data regimes. The approach enhances both model data efficiency and downstream task performance by directly integrating cross-patch spatial structures at the embedding stage, without modifying the self-attention mechanism or overall Transformer architecture (Lee et al., 2021, Verma et al., 2023).

1. Motivation and Underlying Principles

Standard ViT tokenization slices an image $I \in \mathbb{R}^{H \times W \times C}$ into non-overlapping $P \times P$ patches, each of which is flattened and linearly projected. This arrangement provides each token with information from a strictly local $P \times P$ region without access to spatially adjacent pixels, leading to the following limitations:

Limited local receptive field per token and no shared pixels across tokens, resulting in weak inductive locality bias.
The burden of learning local structures is transferred entirely to the Transformer’s self-attention, which is inefficient for small datasets or low-level vision tasks.

SPT addresses these deficits by introducing redundancy and spatial overlap at the tokenization step. By generating and concatenating several shifted versions of the input image (each displaced by a pre-defined offset), SPT allows each patch token to aggregate information from overlapping spatial neighborhoods. This approach raises the effective receptive field of each token and breaks the rigid partitioning boundaries present in vanilla ViT, enabling local continuity in the learned representations (Lee et al., 2021, Verma et al., 2023).

2. Mathematical Formulation

Let $I \in \mathbb{R}^{H \times W \times C}$ denote the input. Define a set of $N_s$ shifts with offsets $\{\delta_i\}_{i=1}^{N_s}$ , such as diagonal or cardinal directions, with shift size typically $P/2$ . For each offset $\delta$ :

$I^{(\delta)} = \mathrm{CropPad}\bigl(\mathrm{Shift}(I,\,\delta)\bigr)$

All shifted images, along with the unshifted $I$ , are concatenated channel-wise:

$I_{\mathrm{concat}} = [I; I^{(\delta_1)}; ...; I^{(\delta_{N_s})}] \in \mathbb{R}^{H \times W \times C (N_s+1)}$

This augmented tensor is partitioned into non-overlapping $P\times P$ patches, each flattened to a vector $u_i \in \mathbb{R}^{P^2 C(N_s+1)}$ . Each vector passes through LayerNorm and a learned projection $W_e \in \mathbb{R}^{D \times (P^2 C(N_s+1))}$ :

$z_i = \mathrm{LayerNorm}(u_i) \ t_i = W_e\,z_i \in \mathbb{R}^D$

The set $\{t_i\}$ forms the token sequence fed into the Transformer encoder. This procedure generalizes to internal pooling/token-merging layers by reshaping token maps and reapplying SPT (Lee et al., 2021, Verma et al., 2023).

3. Implementation and Hyperparameters

Typical SPT settings, empirically validated in multiple studies, include:

Number of Shifts $N_s$ : Four, in either diagonal or cardinal directions.
Shift Magnitude: $\pm P/2$ (half the patch size).
Padding Mode: Zero-padding out-of-bounds pixels after shifting.
Channel-Concatenation Factor: $N_s + 1$ , e.g., factor of $5$ for four shifts.
Projection Dimension $D$ : Matches the ViT hidden size, e.g., $192$, $768$.
Patch Size $P$ : Depends on downstream task; $P=16$ for standard ViT, $P=8$ for small datasets.
Shift Operator: Either circular or zero-padding, depending on experiment.

The following table summarizes these core parameters:

Hyperparameter	Typical Value	Description
Number of shifts	4	Directions: Diagonal or Cardinal
Shift magnitude	$P/2$	Pixels per shift
Channel Factor	5	$N_s+1$ for four shifts
Padding	Zero	Out-of-bounds handling

4. Algorithmic Steps and Pseudocode

The essential algorithm is as follows (Verma et al., 2023, Lee et al., 2021):

def SPT_Embed(I, P, W_e):
    # I: H×W×C input image
    # P: patch size (e.g., 16)
    # W_e: D×(5P²C) projection matrix
    offsets = [(+P/2, 0), (−P/2, 0), (0, +P/2), (0, −P/2)]  # or diagonals
    shifts = []
    for δ in offsets:
        I_shift = shift_image(I, δ)  # with zero padding
        I_crop = crop_or_pad_to_size(I_shift, (H,W))
        shifts.append(I_crop)
    I_concat = concatenate([I] + shifts, axis=channel)  # H×W×5C
    patches = split_into_patches(I_concat, patch_size=P)  # list of N tensors P×P×5C
    tokens = []
    for x in patches:
        u = flatten(x)                     # shape: (5C·P²,)
        z = LayerNorm(u)                   # same shape
        t = np.dot(W_e, z)                 # shape: (D,)
        tokens.append(t)
    return stack(tokens)                   # shape: (N, D)

This function generalizes to any $N_s$ , patch size $P$ , and embedding dimension $D$ .

5. Empirical Results and Ablations

SPT consistently delivers measurable improvements over baseline ViT across both low-level vision and small-scale classification datasets. Selected findings:

Tiny-ImageNet (64×64 grayscale), no discriminator (Verma et al., 2023):
- Denoising: PSNR improves from $27.10$ to $27.23$; SSIM from $78.95\%$ to $79.96\%$ ; NMSE falls from $1.04\%$ to $0.99\%$ .
- Inpainting: PSNR improves from $24.69$ to $24.79$; SSIM from $79.18\%$ to $79.81\%$ ; NMSE from $2.30\%$ to $2.21\%$ .
- With adversarial loss, PSNR is further improved by $+0.12\,\mathrm{dB}$ (denoising) and $+0.16\,\mathrm{dB}$ (inpainting) over ViT+GAN.
Classification on CIFAR-100 and Tiny-ImageNet (Lee et al., 2021):
- CIFAR-100: Baseline ViT $73.81\%$ top-1 $\rightarrow$ $76.29\%$ with SPT ( $+2.48\%$ ).
- Tiny-ImageNet: Baseline ViT $57.07\%$ top-1 $\rightarrow$ $60.67\%$ with SPT ( $+3.60\%$ ).
- SPT yields $+1.4$ – $1.6\%$ on ImageNet.

A plausible implication is that SPT effectively mitigates data inefficiency and locality bias without incurring substantial computational overhead at scale.

6. Computational Complexity and System Integration

Each shifted view increases the input channel dimension, expanding the pre-projection embedding by a factor of $N_s+1$ . This results in:

Memory/Compute Overhead:
- Patch embedding layer’s input grows linearly with $N_s$ , i.e., for $N_s=4$ , a $5\times$ increase in pre-projection dimensionality.
- Wall-clock and FLOPs: Patch embedding’s increased cost remains a minor fraction compared to attention and feedforward layers (Verma et al., 2023, Lee et al., 2021).
- Temporary memory for concatenated feature maps.
Integration:
- SPT is a drop-in replacement for standard ViT’s patch embedding, requiring no changes to downstream positional encodings, self-attention, or head structure.
- Identical application to internal pooling or merging layers in hierarchical transformers (e.g., Swin, PiT).

7. Use Cases, Benefits, and Practical Recommendations

SPT is empirically shown to be most beneficial in these settings:

Transformers trained from scratch on small or medium datasets, where locality bias compensates for the lack of pretraining data.
Low-level, pixel-based vision tasks (denoising, inpainting) requiring preservation of fine spatial structures.
As a minimal architectural tweak, SPT is easily adoptable in any ViT family or patch-based model without requiring custom self-attention or nonstandard pooling.

Because SPT only modifies tokenization and marginally increases patch embedding cost, and because it yields consistent accuracy gains of $2$– $4\%$ (classification) and measurable gains in PSNR/SSIM (restoration), it is recommended wherever locality bias is desirable but architectural simplicity and computational efficiency are priorities (Lee et al., 2021, Verma et al., 2023).

Markdown Report Issue Upgrade to Chat

References (2)

Vision Transformer for Small-Size Datasets (2021)

Image Reconstruction using Enhanced Vision Transformer (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Shifted Patch Tokenization (SPT).