Object-Centric Token Merging
- Object-centric token merging is a paradigm that aggregates tokens by grouping semantically meaningful entities using segmentation masks, slot attention, or saliency cues.
- It employs diverse strategies such as segmentation-based pooling, slot-based merging, and saliency-driven methods to compress token sequences while preserving spatial and semantic integrity.
- Empirical results demonstrate significant reductions in computation and memory—with up to 16× FLOP reduction—while maintaining high performance across vision, video, and robotic applications.
Object-centric token merging is a paradigm in transformer-based models—both in vision and multimodal architectures—in which tokens representing spatial regions, features, or entities are aggregated specifically according to object-level or semantically meaningful groupings, rather than by generic feature similarity or patch location. This operation reduces computational and memory demands, mitigates representational redundancy, and produces more semantically aligned token sequences by mapping each perceptual object (or agent entity) to a single, representative merged token. Distinguished from classical linear patch pruning, object-centric token merging is characterized by its use of segmentation masks, attention-based object saliency, or learned slot overlap to guide merge decisions, often with adaptivity to content and end-to-end differentiability.
1. Principles and Theoretical Basis
The core principle behind object-centric token merging is to exploit the natural grouping of visual input into discrete entities—a process that mirrors human visual perception, which first segments a scene into objects and then focuses attention for further reasoning (Lei et al., 18 Nov 2025). Conventional transformers process images or videos as uniform grids of patch tokens, resulting in quadratic growth of token count with resolution (e.g., for 1024×1024 images at 16×16 patching). Operations such as self-attention, with time/memory, become a severe bottleneck. Existing patch-centric merging (e.g., ToMe) leverages token similarity, but is prone to loss of high-level semantics or spatial structure.
Object-centric token merging remedies these failures by explicitly mapping each object, agent, or salient region into a single token via:
- Semantic mask aggregation: using segmentation masks (from models like Mask2Former or SAM) to identify and pool all patch embeddings for each object (Lei et al., 18 Nov 2025, Zhang et al., 18 Nov 2025).
- Slot-based overlap fusion: merging tokens with highly overlapping slot-attention regions (high Soft-IoU), consolidating fragmented object representation (Chatzisavvas et al., 11 Mar 2026).
- Saliency-driven or agent-centric stratification: guiding merge operations by local feature saliency, entropy sharpness, or robot gripper localization (Choi et al., 2024, Bendikas et al., 28 Sep 2025).
The general hypothesis is that this yields more efficient, semantically coherent tokenization aligned with downstream object-centric reasoning.
2. Algorithmic Strategies and Implementations
Object-centric token merging encompasses a spectrum of operational strategies, differing in their source of semantic partitioning, merge algebra, and scheduling:
Slot-based Merging (Slot Attention)
“Slot merging” addresses over-segmentation in slot-based object-centric learning (OCL), especially where multiple slots compete for the same object due to overlapping attention (Chatzisavvas et al., 11 Mar 2026). The protocol is as follows:
- For slots with attention masks , compute soft-IoU:
- Iteratively merge the pair of slots with maximal above a threshold via barycentric averaging:
updating the mask via .
- The process is end-to-end differentiable due to gradient flow through the merge weights. The threshold is set data-driven by the histogram triangle method.
Segmentation-based Merging (Object-centric Pooling)
CORE and AdaTok both employ high-quality segmentation models:
- For each mask , form the merged token
with weights for soft aggregation, or hard assignment for exclusive tokenization (Lei et al., 18 Nov 2025, Zhang et al., 18 Nov 2025).
- Merged tokens are mapped to the LLM input space; centroids are used for positional coherence (centroid-guided sorting).
- The number of tokens adapts per image to the number of objects, achieving adaptive-rate compression.
Saliency and Video-centric Methods
Saliency-aware token merging (vid-TLDR and Object-Centric Diffusion) use early attention maps or predefined masks to compute object/background saliency:
- Entropy/“sharpness” of attention rows is used to score foreground tokens; only high-sharpness tokens are kept/merged (Choi et al., 2024).
- Token merging is preferentially performed in background, or within spatio-temporal blocks in video, often restricting merges to maintain fidelity in edited or important regions (Kahatapitiya et al., 2024).
Token Merging as Semantic Binding in Text-to-Image
For text-to-image diffusion, composite tokens are constructed by aggregating CLIP embeddings of entity and modifier tokens, yielding representations with shared cross-attention maps. Auxiliary losses (entropy, semantic consistency) refine the merged token for binding integrity (Hu et al., 2024).
Action-centric and Agent-aware Tokenization
Oat-VLA introduces explicit partitioning of visual tokens into object-centric groups and a high-resolution patch window centered on the robot gripper; final sequence comprises object tokens (via pooled masks on ViT features) and agent tokens (local neighborhood) (Bendikas et al., 28 Sep 2025).
3. Integration and Architectural Variants
Object-centric token merging is architecturally modular. In vision-language and multimodal systems:
- Object Pooling Modules: Merged or pooled object tokens are generated upstream of the transformer/LLM and concatenated with textual tokens (Lei et al., 18 Nov 2025, Zhang et al., 18 Nov 2025, Bendikas et al., 28 Sep 2025).
- Slot Merging in OCL Pipelines: Inserted after Slot Attention convergence but before feature reconstruction or the decoder (Chatzisavvas et al., 11 Mar 2026).
- Inference-time Compression: Most methods—especially for video or text-to-image—apply merging at inference, with no retraining required (Choi et al., 2024, Kahatapitiya et al., 2024, Hu et al., 2024).
- Positional Restoration: Centroid- or mask-guided token ordering maintains spatial/layout cues after compression (Lei et al., 18 Nov 2025).
- Adaptivity and Control: Algorithmic schedules support fixed-rate (target tokens/image) or adaptive-rate (one token/object) modes; hyperparameters (merge ratio , saliency threshold, barycentric weights) provide fine-grained control.
4. Empirical Performance and Computational Efficiency
Empirical metrics consistently demonstrate that object-centric token merging:
- Retains performance under extreme compression: AdaTok and CORE retain >95–97% of full performance using 2–10% as many tokens across benchmarks such as POPE, MME, SQA-I, VQA, and others (Lei et al., 18 Nov 2025, Zhang et al., 18 Nov 2025).
- Reduces FLOPs and memory by –: E.g., CORE achieves a reduction in FLOPs and reduction in KV cache memory in LLM inference (POPE, 63 vs 2880 tokens) (Lei et al., 18 Nov 2025).
- Accelerates training and inference: Oat-VLA's reduction from 256 to 16 tokens allows 2× speedup and larger batch sizes, converging to the same or higher action accuracy in half the wall-clock time on LIBERO robotic manipulation (Bendikas et al., 28 Sep 2025). Object-centric 3D ToMe yields – runtime and memory reductions in video editing without loss of temporal consistency (Kahatapitiya et al., 2024).
- Improves object factorization: Slot merging in OCL improves mask IoU and object boundary consistency, surpassing prior adaptive methods and patch-centric merging (DINOSAUR pipeline, VOC/COCO benchmarks) (Chatzisavvas et al., 11 Mar 2026).
- Semantic and spatial preservation: Hard object masks outperform soft aggregation in highly entangled settings; centroid-guided sorting restores layout cues essential for multimodal reasoning (Lei et al., 18 Nov 2025).
A comparison of representative methods and their compression-performance tradeoffs (as reported):
| Method | # Tokens | Task/Benchmark | Performance Retention (%) | Ref |
|---|---|---|---|---|
| AdaTok | ∼53 | MME | 97.6 | (Zhang et al., 18 Nov 2025) |
| CORE | 63 | POPE | 97.4 | (Lei et al., 18 Nov 2025) |
| Oat-VLA | 16 | LIBERO (actions) | ≈100 (converges 2× faster) | (Bendikas et al., 28 Sep 2025) |
| Slot merging | varies | VOC mIoU | +3.6 over baseline | (Chatzisavvas et al., 11 Mar 2026) |
5. Limitations and Open Challenges
Current object-centric token merging methods are subject to several constraints:
- Dependency on segmentation: Quality depends on the accuracy and class coverage of the segmenter (e.g., Mask2Former supports 133 classes; out-of-distribution objects can fragment into multiple tokens, raising token count rather than causing merge errors) (Lei et al., 18 Nov 2025).
- Memory and bandwidth bottlenecks: Bottleneck shifts to the vision segmentation module; operator fusion and CUDA optimization are suggested to further exploit hardware savings.
- No end-to-end mask learning: Most pipelines utilize frozen upstream segmenters or object detectors. Integrating the mask generation step with downstream gradient flow and updating jointly remains an active research direction (Lei et al., 18 Nov 2025, Zhang et al., 18 Nov 2025).
- Granularity control: Adjusting merge aggressiveness (e.g., via mask confidence or slot merging threshold ) entails trade-offs between fidelity and efficiency.
6. Applications and Research Directions
Object-centric token merging is being generalized across domains:
- Vision-LLMs: Token compression enables scaling LVLMs and MLLMs to higher resolutions and complex queries with strict resource envelopes (Lei et al., 18 Nov 2025, Zhang et al., 18 Nov 2025).
- Video Transformers: Spatiotemporal merging, with object/background selectivity, provides significant acceleration in video understanding and editing, including for temporally coherent diffusion models (Choi et al., 2024, Kahatapitiya et al., 2024).
- Robotics and Action Planning: By segregating agent-centric and object-centric tokens, models such as Oat-VLA encode control-critical features precisely, boosting learning speed and real-world performance (Bendikas et al., 28 Sep 2025).
- Text-to-Image Synthesis: Semantic binding of attributes and objects via composite token merging significantly increases integrity for complex multi-object compositional generation (Hu et al., 2024).
Ongoing directions cover end-to-end training with mask learning, dynamic or hierarchical merging schedules, integration with foundation segmentation models (e.g., SAM), and expansion to cross-modal object tokenization for unified perception-action architectures.
7. Relation to Prior Patch-based Token Merging
Object-centric token merging represents a conceptual advance over patch-similarity-based schemes such as ToMe (Bolya et al., 2022), which merge purely by local key-space similarity. While ToMe often results in object part aggregation and achieves substantial acceleration, its lack of explicit semantic priors can suffer in complex scenes or under aggressive merging, leading to mixing of distinct semantics. Object-centric approaches address these shortcomings by:
- Using segmentation or attention-based saliency to prevent cross-object merging,
- Enabling “one object, one token” guarantees,
- Preserving agent or foreground information critical for reasoning or control.
Empirically, object-centric merging methods such as AdaTok, CORE, and slot merging outperform or match the speedup and compression achieved by ToMe, maintaining higher performance in semantic-rich and compositional tasks (Zhang et al., 18 Nov 2025, Lei et al., 18 Nov 2025, Chatzisavvas et al., 11 Mar 2026).
Object-centric token merging stands as a foundational primitive for efficient, semantically coherent, and scalable token processing in object-aware and multimodal machine learning systems.