Patch n’ Pack for Vision Transformers

Updated 23 December 2025

Patch n’ Pack is a technique for training Vision Transformers on images with arbitrary resolutions by packing patch tokens from multiple images into fixed-size sequences.
It employs patch-wise tokenization, dynamic token dropping, and greedy packing with self-attention masking to efficiently process variable-length image data.
This method improves throughput, model accuracy, and robustness across diverse tasks while requiring minimal changes to standard ViT pipelines.

Patch n’ Pack sequence packing is a method for efficiently training Vision Transformer (ViT) models on images of arbitrary resolutions and aspect ratios by organizing variable-length patch token sequences from multiple images into packed, fixed-size token sequences. This approach enables native resolution processing, increases throughput, and enhances model robustness while imposing minimal changes to standard ViT pipelines. Patch n’ Pack underpins NaViT (Native Resolution ViT), which supports diverse downstream vision tasks and facilitates superior trade-offs between compute cost and model accuracy (Dehghani et al., 2023).

1. Image Patching and Tokenization

Patch n’ Pack operates on the principle of patch-wise tokenization. Given an image with height $H$ and width $W$ , non-overlapping patches of size $p \times p$ are extracted. The number of patches (tokens) per image is $L = (H/p)\cdot(W/p)$ . Each patch is flattened to a vector $x_{\text{patch}}\in \mathbb{R}^{p\cdot p\cdot 3}$ and projected to the transformer’s model dimension $D$ via a linear layer, producing embedded tokens $z \in \mathbb{R}^D$ . A positional embedding of dimension $D$ is added to each token, with the particular embedding scheme elaborated in Section 5.

2. Mathematical Formulation of Sequence Packing

Patch n’ Pack packs tokens from multiple images into sequences of fixed maximum length $L_{\max}$ , processing them jointly in a single forward pass. Let $L_i$ denote the number of tokens for image $i$ (potentially after patch dropping). Packing proceeds by concatenating the token sequences from images $S$ such that:

$\sum_{i \in S} L_i \leq L_{\max}$

Unused positions in the packed sequence are filled with a special PAD token. In practice, $B'$ sequences per minibatch are constructed, shaping the input tensor as $(B', L_{\max}, D)$ .

Token-dropping: Accelerating training, an image-specific random drop rate $d_i \in [0,1]$ reduces its token count to:

$L_i = \mathrm{round}\left((H_i/p)\cdot(W_i/p)\cdot(1-d_i)\right)$

The drop rates may be drawn from constant values, Beta distributions, or adapted dynamically based on training progression ( $d_i(n)$ ).

Self-attention masking: To preserve independence between images packed together, a binary mask $M \in \mathbb{R}^{L_{\max} \times L_{\max}}$ is used:

$M_{tu} = \begin{cases} 0 & \text{if tokens $t,u$ originate from the same image} \ -\infty & \text{otherwise} \end{cases}$

This mask is added to attention logits, ensuring self-attention occurs only within an image.

Pooling and loss: A masked [CLS]-style pooling head extracts one feature vector per image for downstream loss calculation (e.g., cross-entropy, distributed contrastive loss). Contrastive pretraining uses chunked loss computation to avoid quadratic scaling with sequence length.

3. Packing Algorithm and Pseudo-code

The Patch n’ Pack algorithm follows a two-phase process: preparing tokens with optional dropping and resolution sampling, followed by greedy packing and attention masking. The pseudo-code below summarizes the approach:

packed_sequences = []
for each image in minibatch:
    R_i = R_sampler()           # sample target resolution
    resize image (preserve aspect ratio, set area ≈ R_i^2)
    H_i, W_i = image shape
    L_i0 = (H_i/p)*(W_i/p)
    d_i = d_sampler()           # sample drop rate
    drop random floor(d_i*L_i0) patches, L_i = remain
    tokens_i = patch_embed(...)

current_seq = []
current_len = 0
for i in 1..N_images:
    if current_len + L_i ≤ L_max:
        append tokens_i to current_seq
        record boundaries
        current_len += L_i
    else:
        pad current_seq to L_max (PAD tokens)
        packed_sequences.append(current_seq)
        start new current_seq = tokens_i
        current_len = L_i
if current_seq non-empty:
    pad and append

Unused positions are padded, and attention and pooling structures are adjusted according to the packed boundaries (Dehghani et al., 2023).

4. Computational Complexity and Efficiency

Patch n’ Pack sequence packing introduces several efficiency and complexity considerations:

Attention Overhead: The cost scales as $\mathcal{O}(L_{\max}^2 D)$ for self-attention. Empirically, the increase in compute over single-image ViT is less than 10% for large models, as attention dominates less relative to growing MLP size.
Throughput: Standard ViT processes one image per forward pass ( $L$ tokens). NaViT typically packs $E \approx 4$ –6 images per sequence for $E\cdot L$ tokens, maintaining wall-clock time. In JFT pretraining, this yields approximately five times greater image throughput per compute budget.
Memory: The packed tensor $(B', L_{\max}, D)$ avoids dynamic fragmentation. Additional overhead for attention masks is negligible. Padding is minimized (<2% tokens) with appropriate $L_{\max}$ and resolution sampling.
Padding: Efficient packing and flexible image sizing result in minimal wasted compute for PAD tokens.

Summary Table: Packing vs Standard ViT

Aspect	Standard ViT	Patch n’ Pack (NaViT)
Sequence length	$L$ (fixed- $R$ )	$L_{\max}$ (packed)
Images per pass	$1$	$E \approx 4$ –6
Attention cost	$\mathcal{O}(L^2D)$	$\mathcal{O}(L_{\max}^2D)$ (~ $<$ 10% extra)
Padding overhead	0	$<$ 2%

5. Pipeline Modifications for Packed Sequences

Integrating Patch n’ Pack requires minor but essential changes to the ViT pipeline:

Masked Self-Attention: The attention softmax incorporates per-batch masks $M_s$ to prevent inter-image token attention.
Masked Pooling Head: Pooling heads (CLS or attention-pooling) must recognize sequence boundaries, producing one vector per image.
Factorized Positional Embeddings: Single-image ViTs employ 1D positional embeddings. For variable aspect ratio and resolution, 2D factorized positional embeddings are used:

$p_{\text{pos}} = \phi_x(r_x) + \phi_y(r_y)$

where $(r_x, r_y) \in [0,1]^2$ are normalized token coordinates, and $\phi_x, \phi_y$ are learnable or sinusoidal mappings $\mathbb{R}^{[0,1]}\to \mathbb{R}^D$ . This formulation generalizes to unseen resolutions and aspect ratios, supporting smooth scaling and extrapolation.

At inference, the framework supports arbitrary choice of image resolution and packing structure, enabling performance-cost trade-offs without architectural changes.

6. Empirical Evaluation and Trade-Offs

Experiments reveal several notable trade-offs and performance characteristics:

Pre-training Efficiency: NaViT-B/16 achieves top ViT-B/16 performance on JFT pretraining in $1.1 \times 10^{12}$ TPU-hours versus $4\times$ more compute for ViT-B/16, mainly by seeing $\sim5\times$ more images via packing and token dropping.
Variable Resolution Training: Sampling $R \sim U[64,384]$ and preserving aspect ratio outperforms fixed-resolution baselines across budgets. At inference, NaViT allows for smooth cost-accuracy calibration based on input resolution.
Token-Dropping Strategies: Using per-image drop rates from a Beta distribution and decaying the drop rate yields $0.5$– $1\%$ accuracy gains at constant compute.
Downstream and OOD Robustness:
- On zero-shot and linear-probe ImageNet, NaViT surpasses compute-matched ViT by $2$– $4\%$ .
- On robustness datasets (ImageNet-A, ObjectNet), gains up to $10\%$ are observed.
- Calibration error remains stable at $\sim0.046$ for 128–1024 tokens/sequences.
- For semantic segmentation (ADE20k), $+1$ –$2$ mIoU at matched finetune FLOPs.
- For detection (LVIS rare classes), $+7\%$ AP $_{\text{rare}}$ vs ViT-L/14.
- On FairFace and CelebA, $+3$ – $6\%$ label accuracy is achieved using frozen NaViT.
- In video (Kinetics400), NaViT-L matches ViViT-L (80.4\%) in $6\times$ fewer epochs.
Inference Cascades: Utilizing low-token models to filter easy examples and cascading to higher-token models on hard fractions yields Pareto-optimal latency/accuracy trade-offs.

The approach integrates with minimal effort into existing ViT pipelines, requiring only data loader changes (for sequence packing, resolution, and drop-rate sampling), attention masking, and factorized positional embeddings. The resulting framework delivers $3$– $5\times$ more images per compute hour, enhanced transfer effectiveness, improved robustness, and flexible test-time configuration (Dehghani et al., 2023).

7. Applications and Implications

Patch n’ Pack is compatible with supervised and contrastive image-text pretraining and applies to tasks including image and video classification, object detection, semantic segmentation, and fairness or robustness benchmarking. Its ability to handle native image resolutions and manage cost-accuracy trade-offs at inference provides a flexible tool for practical deployment. The architectural changes are limited in scope, enabling broad applicability for ViT-style models. This approach marks a shift from the fixed-resolution, CNN-inspired pipelines towards a more adaptive, sequence-based vision modeling paradigm. A plausible implication is the increasing relevance of sequence-packing and patch-level model design for scalable, heterogeneous vision workloads (Dehghani et al., 2023).

PDF Markdown Chat (Pro)

References (1)

Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution (2023)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Patch n’ Pack Sequence Packing.