TokenPacker: Efficient Token Packing
- TokenPacker is a family of methods for efficiently compressing token sequences, minimizing redundancy and computational overhead in neural architectures.
- It utilizes strategies like coarse-to-fine visual packing, bin-packing in NLP, and gated selection in vision transformers to retain critical information.
- Empirical results show up to 89% token reduction and significant FLOP savings while maintaining performance across multimodal and planning tasks.
TokenPacker refers to a family of methods and algorithms for efficient token sequence packing, compaction, and projection, designed to maximize computational efficiency for modern neural architectures. Usage spans multimodal LLMs (MLLMs), sequence modeling in LLM pretraining, context-aware packing for vision transformers, semantics-preserving packetization for token communication, and extreme compression for decision-time planning in world models. TokenPacker, as a term, describes approaches that compress or reorganize token representations while aiming to minimize information loss and preserve functional equivalence or utility.
1. Motivation: Efficiency Challenges and Redundancy in Token Representations
Transformer architectures incur substantial memory and compute costs due to their inherent quadratic scaling with token count, , both for text and visual inputs. In MLLMs, visual encoders (e.g., CLIP-ViT) generate a large number of patch embeddings, which, when projected one-to-one to LLM tokens via a simple MLP, result in prohibitively high resource utilization as image resolution—and thus —grows. For example, 10241024 pixel images with CLIP patch size yield tokens, resulting in 100x more tokens than typical LLM text context and thus causing severe inefficiency. Similarly, sequence models in NLP frequently pad variable-length sequences to a uniform batch length, where up to 50–89% of tokens can be padding with no semantic information, imposing unnecessary computational overhead (Krell et al., 2021). Vision transformers suffer analogous inefficiencies, processing both informative and background tokens with equal attention. These challenges highlight the necessity for advanced token packing, selection, and compaction strategies (Li et al., 2024, Zhang et al., 2024, Kim et al., 5 Mar 2026).
2. Architectures and Methodologies for Token Packing
2.1 Multimodal LLMs: Coarse-to-Fine Visual Token Packing
TokenPacker, as introduced in the context of MLLMs, applies a coarse-to-fine scheme for visual token projection (Li et al., 2024). The pipeline comprises:
- Coarse Query Initialization: Bilinear interpolation downsamples the high-resolution CLIP feature map, yielding low-resolution "point queries" , where is the downsampling factor.
- Region-to-Point Injection: The original high-res feature map is partitioned into 0 local 1 regions, forming region keys 2 and values 3 (potentially multi-level, using features from multiple CLIP layers). Each coarse query 4 is updated via cross-attention restricted to its corresponding region:
5
Spatial encoding and masking enforce strict locality.
- Final Projection: The enriched queries are mapped through an MLP to yield the condensed set of visual tokens 6 for consumption by the LLM.
- Compression Ratio: 7.
This design achieves 75–89% token reduction with minimal or no loss in accuracy—sometimes even improving performance via structured retention of fine-grained detail.
2.2 Sequence Packing in NLP: Bin-Packing Formulation
The classic sequence packing problem is formalized as a variant of the bin packing integer program:
- Variables:
- 8 input sequences of lengths 9.
- Bin capacity 0 (pack length).
- Assignment variables 1, indicator 2 for bin 3.
- (Optional) limit 4 on sequences per pack.
- Constraints:
5
- Algorithms:
- Greedy ("T5 style") streaming.
- Shortest-Pack-First Histogram Packing (SPFHP), using a min-heap to optimally pack the length histogram.
- Non-negative Least Squares Histogram Packing (NNLSHP), leveraging a combinatorial enumeration of all admissible packings for extremely high packing efficiency (6) (Krell et al., 2021).
2.3 Context-Aware Visual Token Packing
Vision transformers with Select and Pack Attention (SPA) employ a supervised gating layer to identify and select informative tokens within each image, based on learned selection scores. The selected token subset is packed into fixed-length packages via an efficient gather/scatter operation. Batch-parallelism and self-attention are preserved by applying a block mask in multi-head attention, limiting each token's attention to tokens originating from the same image (or class) (Zhang et al., 2024).
2.4 Planning with CompACT: Extreme Token Compression
The CompACT tokenizer projects high-dimensional observations to as few as 8 discrete tokens per frame by leveraging a frozen vision backbone, learnable latent resamplers, and Finite Scalar Quantization (FSQ). This procedure forces retention of only task-essential semantics, achieving an order-of-magnitude computational reduction for world-model-based planning tasks with minimal loss in planning utility (Kim et al., 5 Mar 2026).
3. Information Preservation and Attention Masking Strategies
One common challenge in all TokenPacker variants lies in avoiding "cross-contamination"—where information is mixed between originally separate samples or semantic units—during the packed sequence's self-attention operations.
- Block-Diagonal Masking: For NLP or image token packing, each token tracks its source sequence or image-of-origin, enforcing a block-diagonal self-attention mask so that no token attends outside its segment (Krell et al., 2021, Zhang et al., 2024).
- Spatially-Localized Attention: In TokenPacker for MLLMs, region-to-point injection restricts attention to within-local regions, allowing high-frequency detail transfer without collapsing spatial structure (Li et al., 2024).
- Supervised Selection: SPA leverages explicit supervision (e.g., from bounding box or mask annotations) to train the gating layer, thereby aligning selected tokens with information-dense regions (Zhang et al., 2024).
- Preservation of Positional Information: Packed models often require careful treatment of positional encoding. Rather than simple bias-add schemes that assume fixed-length input, packed implementations use lookup tables or modular arithmetic to maintain positional integrity per segment, guaranteeing equivalence with the unpacked model.
4. Computational Benefits and Theoretical Analysis
Comprehensive complexity and efficiency analysis reveals substantial FLOP and memory savings:
| Method | Visual Tokens | FLOPs Scaling | Packing Efficiency |
|---|---|---|---|
| MLP Baseline | 7 | 8 | 950% w/padding |
| TokenPacker (MLLM) | 0 (1) | 2, 3 | 75–89% reduction |
| NNLSHP (NLP) | -- | 99.7% utilization | 4 |
| SPA (ViT) | 5 | 6 | 10–16% cost saving |
| CompACT (planner) | 8–16 vs 784 | 7 speedup | 8 latency |
In MLLMs, TokenPacker with 9 compresses 0 tokens to 1, reducing LLM attention complexity by over two orders of magnitude (Li et al., 2024). Model equivalence proofs show that correct application of packing and masking yields results indistinguishable from unpacked models (Krell et al., 2021). SPA reduces FLOPs by 16.4% (Swin-B backbone, BDD100K), with a reported 2 mAP gain (Zhang et al., 2024). In world model rollouts, CompACT compresses state representations to 8 tokens, providing a 3 end-to-end latency reduction for navigation and manipulation planning (Kim et al., 5 Mar 2026).
5. Empirical Results and Task-Specific Performance
Experiments across vision-language understanding, OCR, general vision benchmarks, NLP pretraining, and planning demonstrate consistently strong or improved performance at drastically reduced token and compute load:
- MLLMs (TokenPacker): With 4, Vicuna-7B achieved 62.8% average accuracy over 12 benchmarks using 75% fewer tokens relative to baseline (baseline: 62.0%). At 5 (89% reduction), accuracy dropped by only 1.4% (Li et al., 2024).
- NLP Pretraining (Packed BERT): NNLSHP with depth 3 yielded 99.7% real token utilization and nearly 6 throughput, with 7 downstream degradation on SQuAD 1.1 (Krell et al., 2021).
- Vision Transformers (SPA/TokenPacker): Swin-T with SPA achieves higher mAP or Top-1 accuracy with only 23–30% of the original tokens and 10–16% lower FLOPs (Zhang et al., 2024).
- World Models (CompACT): Navigation and manipulation tasks see 8 lower latency with only minor increases in trajectory error compared to models using hundreds of tokens (Kim et al., 5 Mar 2026).
6. Implementation Details, Practical Guidelines, and Limitations
Implementation requires minimal architectural changes beyond standard data pipelines:
- Model Modifications: Addition of block-masks for attention layers, positional index tracking, and optional gating modules for token selection (Krell et al., 2021, Zhang et al., 2024).
- Data Preprocessing: Generation of length histograms (NLP), gating labels (vision), or learnable queries (CompACT) (Li et al., 2024, Kim et al., 5 Mar 2026).
- Hyperparameters: Downsampling factors (9 for MLLMs), pack sizes (0, 1 for NLP/Vision), population/beam width for genetic methods, and gating thresholds.
- Training Protocols: Packing universally improves efficiency, but extreme compaction (e.g., 2 tokens in MLLMs) may result in a performance drop, exposing a critical trade-off between information bottleneck and efficiency (Li et al., 2024).
- Limitations: Scaling genetic/beam search methods (3) remains challenging for large 4 in token communication (Lee et al., 28 Apr 2025), and achieving optimal region granularity or adaptive selection in vision is still an open research problem (Zhang et al., 2024).
7. Open Questions and Future Directions
Research in token packing continues to explore:
- Adaptive and Learnable Packing: Dynamic region proposals, adaptive region sizes, or controller networks for variable package length (Li et al., 2024, Zhang et al., 2024).
- End-to-End Optimization: Joint training of packing modules (e.g., TokenPacker) with the downstream model, allowing learned adaptation to token utility.
- Advanced Gating and Masking Strategies: Dynamic layer selection, hierarchical token merging, and cross-modal region-to-point injection for richer semantics.
- Extension to Video and 3D Data: Temporal or spatiotemporal packing, integrating tracking information or 3D occupancy for more efficient multimodal learning (Li et al., 2024, Zhang et al., 2024).
- Robustness and Semantics in Noisy Channels: Advanced semantics-aware packetization (e.g., SemPA-GBeam) to hedge against erasures and reconstruct maximal task-relevant information (Lee et al., 28 Apr 2025).
TokenPacker approaches, by exploiting locality, context-aware selection, and masking, represent a foundational advance toward scalable, efficient deployment of large-scale models across diverse domains.