Shifted Patch Tokenization (SPT) in Vision Transformers
- Shifted Patch Tokenization (SPT) is an embedding technique that aggregates overlapping spatial information from shifted image views to enhance locality in Vision Transformers.
- It increases the effective receptive field without altering the self-attention mechanism, making it effective for low-level vision tasks and small-data regimes.
- Empirical results show SPT boosts classification accuracy by up to 3.60% on Tiny-ImageNet and improves restoration metrics in tasks like denoising and inpainting.
Shifted Patch Tokenization (SPT) is an embedding technique designed to augment the locality bias and receptive field of patch-based Transformer architectures, especially Vision Transformers (ViT), by aggregating overlapping spatial information from shifted versions of the input. SPT addresses inherent deficiencies of standard non-overlapping patch tokenization—namely, the limited context captured by each token and the resulting inductive bias mismatch for low-level vision or small-data regimes. The approach enhances both model data efficiency and downstream task performance by directly integrating cross-patch spatial structures at the embedding stage, without modifying the self-attention mechanism or overall Transformer architecture (Lee et al., 2021, Verma et al., 2023).
1. Motivation and Underlying Principles
Standard ViT tokenization slices an image into non-overlapping patches, each of which is flattened and linearly projected. This arrangement provides each token with information from a strictly local region without access to spatially adjacent pixels, leading to the following limitations:
- Limited local receptive field per token and no shared pixels across tokens, resulting in weak inductive locality bias.
- The burden of learning local structures is transferred entirely to the Transformer’s self-attention, which is inefficient for small datasets or low-level vision tasks.
SPT addresses these deficits by introducing redundancy and spatial overlap at the tokenization step. By generating and concatenating several shifted versions of the input image (each displaced by a pre-defined offset), SPT allows each patch token to aggregate information from overlapping spatial neighborhoods. This approach raises the effective receptive field of each token and breaks the rigid partitioning boundaries present in vanilla ViT, enabling local continuity in the learned representations (Lee et al., 2021, Verma et al., 2023).
2. Mathematical Formulation
Let denote the input. Define a set of shifts with offsets , such as diagonal or cardinal directions, with shift size typically . For each offset :
All shifted images, along with the unshifted , are concatenated channel-wise:
This augmented tensor is partitioned into non-overlapping patches, each flattened to a vector . Each vector passes through LayerNorm and a learned projection :
The set forms the token sequence fed into the Transformer encoder. This procedure generalizes to internal pooling/token-merging layers by reshaping token maps and reapplying SPT (Lee et al., 2021, Verma et al., 2023).
3. Implementation and Hyperparameters
Typical SPT settings, empirically validated in multiple studies, include:
- Number of Shifts : Four, in either diagonal or cardinal directions.
- Shift Magnitude: (half the patch size).
- Padding Mode: Zero-padding out-of-bounds pixels after shifting.
- Channel-Concatenation Factor: , e.g., factor of $5$ for four shifts.
- Projection Dimension : Matches the ViT hidden size, e.g., $192$, $768$.
- Patch Size : Depends on downstream task; for standard ViT, for small datasets.
- Shift Operator: Either circular or zero-padding, depending on experiment.
The following table summarizes these core parameters:
| Hyperparameter | Typical Value | Description |
|---|---|---|
| Number of shifts | 4 | Directions: Diagonal or Cardinal |
| Shift magnitude | Pixels per shift | |
| Channel Factor | 5 | for four shifts |
| Padding | Zero | Out-of-bounds handling |
4. Algorithmic Steps and Pseudocode
The essential algorithm is as follows (Verma et al., 2023, Lee et al., 2021):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
def SPT_Embed(I, P, W_e): # I: H×W×C input image # P: patch size (e.g., 16) # W_e: D×(5P²C) projection matrix offsets = [(+P/2, 0), (−P/2, 0), (0, +P/2), (0, −P/2)] # or diagonals shifts = [] for δ in offsets: I_shift = shift_image(I, δ) # with zero padding I_crop = crop_or_pad_to_size(I_shift, (H,W)) shifts.append(I_crop) I_concat = concatenate([I] + shifts, axis=channel) # H×W×5C patches = split_into_patches(I_concat, patch_size=P) # list of N tensors P×P×5C tokens = [] for x in patches: u = flatten(x) # shape: (5C·P²,) z = LayerNorm(u) # same shape t = np.dot(W_e, z) # shape: (D,) tokens.append(t) return stack(tokens) # shape: (N, D) |
This function generalizes to any , patch size , and embedding dimension .
5. Empirical Results and Ablations
SPT consistently delivers measurable improvements over baseline ViT across both low-level vision and small-scale classification datasets. Selected findings:
- Tiny-ImageNet (64×64 grayscale), no discriminator (Verma et al., 2023):
- Classification on CIFAR-100 and Tiny-ImageNet (Lee et al., 2021):
- CIFAR-100: Baseline ViT top-1 with SPT ().
- Tiny-ImageNet: Baseline ViT top-1 with SPT ().
- SPT yields – on ImageNet.
A plausible implication is that SPT effectively mitigates data inefficiency and locality bias without incurring substantial computational overhead at scale.
6. Computational Complexity and System Integration
Each shifted view increases the input channel dimension, expanding the pre-projection embedding by a factor of . This results in:
- Memory/Compute Overhead:
- Patch embedding layer’s input grows linearly with , i.e., for , a increase in pre-projection dimensionality.
- Wall-clock and FLOPs: Patch embedding’s increased cost remains a minor fraction compared to attention and feedforward layers (Verma et al., 2023, Lee et al., 2021).
- Temporary memory for concatenated feature maps.
- Integration:
- SPT is a drop-in replacement for standard ViT’s patch embedding, requiring no changes to downstream positional encodings, self-attention, or head structure.
- Identical application to internal pooling or merging layers in hierarchical transformers (e.g., Swin, PiT).
7. Use Cases, Benefits, and Practical Recommendations
SPT is empirically shown to be most beneficial in these settings:
- Transformers trained from scratch on small or medium datasets, where locality bias compensates for the lack of pretraining data.
- Low-level, pixel-based vision tasks (denoising, inpainting) requiring preservation of fine spatial structures.
- As a minimal architectural tweak, SPT is easily adoptable in any ViT family or patch-based model without requiring custom self-attention or nonstandard pooling.
Because SPT only modifies tokenization and marginally increases patch embedding cost, and because it yields consistent accuracy gains of $2$– (classification) and measurable gains in PSNR/SSIM (restoration), it is recommended wherever locality bias is desirable but architectural simplicity and computational efficiency are priorities (Lee et al., 2021, Verma et al., 2023).