Patch n’ Pack for Vision Transformers
- Patch n’ Pack is a technique for training Vision Transformers on images with arbitrary resolutions by packing patch tokens from multiple images into fixed-size sequences.
- It employs patch-wise tokenization, dynamic token dropping, and greedy packing with self-attention masking to efficiently process variable-length image data.
- This method improves throughput, model accuracy, and robustness across diverse tasks while requiring minimal changes to standard ViT pipelines.
Patch n’ Pack sequence packing is a method for efficiently training Vision Transformer (ViT) models on images of arbitrary resolutions and aspect ratios by organizing variable-length patch token sequences from multiple images into packed, fixed-size token sequences. This approach enables native resolution processing, increases throughput, and enhances model robustness while imposing minimal changes to standard ViT pipelines. Patch n’ Pack underpins NaViT (Native Resolution ViT), which supports diverse downstream vision tasks and facilitates superior trade-offs between compute cost and model accuracy (Dehghani et al., 2023).
1. Image Patching and Tokenization
Patch n’ Pack operates on the principle of patch-wise tokenization. Given an image with height and width , non-overlapping patches of size are extracted. The number of patches (tokens) per image is . Each patch is flattened to a vector and projected to the transformer’s model dimension via a linear layer, producing embedded tokens . A positional embedding of dimension is added to each token, with the particular embedding scheme elaborated in Section 5.
2. Mathematical Formulation of Sequence Packing
Patch n’ Pack packs tokens from multiple images into sequences of fixed maximum length , processing them jointly in a single forward pass. Let denote the number of tokens for image (potentially after patch dropping). Packing proceeds by concatenating the token sequences from images such that:
Unused positions in the packed sequence are filled with a special PAD token. In practice, sequences per minibatch are constructed, shaping the input tensor as .
Token-dropping: Accelerating training, an image-specific random drop rate reduces its token count to:
The drop rates may be drawn from constant values, Beta distributions, or adapted dynamically based on training progression ().
Self-attention masking: To preserve independence between images packed together, a binary mask is used:
$M_{tu} = \begin{cases} 0 & \text{if tokens $t,u$ originate from the same image} \ -\infty & \text{otherwise} \end{cases}$
This mask is added to attention logits, ensuring self-attention occurs only within an image.
Pooling and loss: A masked [CLS]-style pooling head extracts one feature vector per image for downstream loss calculation (e.g., cross-entropy, distributed contrastive loss). Contrastive pretraining uses chunked loss computation to avoid quadratic scaling with sequence length.
3. Packing Algorithm and Pseudo-code
The Patch n’ Pack algorithm follows a two-phase process: preparing tokens with optional dropping and resolution sampling, followed by greedy packing and attention masking. The pseudo-code below summarizes the approach:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
packed_sequences = [] for each image in minibatch: R_i = R_sampler() # sample target resolution resize image (preserve aspect ratio, set area ≈ R_i^2) H_i, W_i = image shape L_i0 = (H_i/p)*(W_i/p) d_i = d_sampler() # sample drop rate drop random floor(d_i*L_i0) patches, L_i = remain tokens_i = patch_embed(...) current_seq = [] current_len = 0 for i in 1..N_images: if current_len + L_i ≤ L_max: append tokens_i to current_seq record boundaries current_len += L_i else: pad current_seq to L_max (PAD tokens) packed_sequences.append(current_seq) start new current_seq = tokens_i current_len = L_i if current_seq non-empty: pad and append |
Unused positions are padded, and attention and pooling structures are adjusted according to the packed boundaries (Dehghani et al., 2023).
4. Computational Complexity and Efficiency
Patch n’ Pack sequence packing introduces several efficiency and complexity considerations:
- Attention Overhead: The cost scales as for self-attention. Empirically, the increase in compute over single-image ViT is less than 10% for large models, as attention dominates less relative to growing MLP size.
- Throughput: Standard ViT processes one image per forward pass ( tokens). NaViT typically packs –6 images per sequence for tokens, maintaining wall-clock time. In JFT pretraining, this yields approximately five times greater image throughput per compute budget.
- Memory: The packed tensor avoids dynamic fragmentation. Additional overhead for attention masks is negligible. Padding is minimized (<2% tokens) with appropriate and resolution sampling.
- Padding: Efficient packing and flexible image sizing result in minimal wasted compute for PAD tokens.
Summary Table: Packing vs Standard ViT
| Aspect | Standard ViT | Patch n’ Pack (NaViT) |
|---|---|---|
| Sequence length | (fixed-) | (packed) |
| Images per pass | $1$ | –6 |
| Attention cost | (~10% extra) | |
| Padding overhead | 0 | 2% |
5. Pipeline Modifications for Packed Sequences
Integrating Patch n’ Pack requires minor but essential changes to the ViT pipeline:
- Masked Self-Attention: The attention softmax incorporates per-batch masks to prevent inter-image token attention.
- Masked Pooling Head: Pooling heads (CLS or attention-pooling) must recognize sequence boundaries, producing one vector per image.
- Factorized Positional Embeddings: Single-image ViTs employ 1D positional embeddings. For variable aspect ratio and resolution, 2D factorized positional embeddings are used:
where are normalized token coordinates, and are learnable or sinusoidal mappings . This formulation generalizes to unseen resolutions and aspect ratios, supporting smooth scaling and extrapolation.
At inference, the framework supports arbitrary choice of image resolution and packing structure, enabling performance-cost trade-offs without architectural changes.
6. Empirical Evaluation and Trade-Offs
Experiments reveal several notable trade-offs and performance characteristics:
- Pre-training Efficiency: NaViT-B/16 achieves top ViT-B/16 performance on JFT pretraining in TPU-hours versus more compute for ViT-B/16, mainly by seeing more images via packing and token dropping.
- Variable Resolution Training: Sampling and preserving aspect ratio outperforms fixed-resolution baselines across budgets. At inference, NaViT allows for smooth cost-accuracy calibration based on input resolution.
- Token-Dropping Strategies: Using per-image drop rates from a Beta distribution and decaying the drop rate yields $0.5$– accuracy gains at constant compute.
- Downstream and OOD Robustness:
- On zero-shot and linear-probe ImageNet, NaViT surpasses compute-matched ViT by $2$–.
- On robustness datasets (ImageNet-A, ObjectNet), gains up to are observed.
- Calibration error remains stable at for 128–1024 tokens/sequences.
- For semantic segmentation (ADE20k), –$2$ mIoU at matched finetune FLOPs.
- For detection (LVIS rare classes), AP vs ViT-L/14.
- On FairFace and CelebA, – label accuracy is achieved using frozen NaViT.
- In video (Kinetics400), NaViT-L matches ViViT-L (80.4\%) in fewer epochs.
- Inference Cascades: Utilizing low-token models to filter easy examples and cascading to higher-token models on hard fractions yields Pareto-optimal latency/accuracy trade-offs.
The approach integrates with minimal effort into existing ViT pipelines, requiring only data loader changes (for sequence packing, resolution, and drop-rate sampling), attention masking, and factorized positional embeddings. The resulting framework delivers $3$– more images per compute hour, enhanced transfer effectiveness, improved robustness, and flexible test-time configuration (Dehghani et al., 2023).
7. Applications and Implications
Patch n’ Pack is compatible with supervised and contrastive image-text pretraining and applies to tasks including image and video classification, object detection, semantic segmentation, and fairness or robustness benchmarking. Its ability to handle native image resolutions and manage cost-accuracy trade-offs at inference provides a flexible tool for practical deployment. The architectural changes are limited in scope, enabling broad applicability for ViT-style models. This approach marks a shift from the fixed-resolution, CNN-inspired pipelines towards a more adaptive, sequence-based vision modeling paradigm. A plausible implication is the increasing relevance of sequence-packing and patch-level model design for scalable, heterogeneous vision workloads (Dehghani et al., 2023).