Papers
Topics
Authors
Recent
Search
2000 character limit reached

Object-Centric Token Merging

Updated 14 March 2026
  • Object-centric token merging is a paradigm that aggregates tokens by grouping semantically meaningful entities using segmentation masks, slot attention, or saliency cues.
  • It employs diverse strategies such as segmentation-based pooling, slot-based merging, and saliency-driven methods to compress token sequences while preserving spatial and semantic integrity.
  • Empirical results demonstrate significant reductions in computation and memory—with up to 16× FLOP reduction—while maintaining high performance across vision, video, and robotic applications.

Object-centric token merging is a paradigm in transformer-based models—both in vision and multimodal architectures—in which tokens representing spatial regions, features, or entities are aggregated specifically according to object-level or semantically meaningful groupings, rather than by generic feature similarity or patch location. This operation reduces computational and memory demands, mitigates representational redundancy, and produces more semantically aligned token sequences by mapping each perceptual object (or agent entity) to a single, representative merged token. Distinguished from classical linear patch pruning, object-centric token merging is characterized by its use of segmentation masks, attention-based object saliency, or learned slot overlap to guide merge decisions, often with adaptivity to content and end-to-end differentiability.

1. Principles and Theoretical Basis

The core principle behind object-centric token merging is to exploit the natural grouping of visual input into discrete entities—a process that mirrors human visual perception, which first segments a scene into objects and then focuses attention for further reasoning (Lei et al., 18 Nov 2025). Conventional transformers process images or videos as uniform grids of patch tokens, resulting in quadratic growth of token count with resolution (e.g., N=4096N=4096 for 1024×1024 images at 16×16 patching). Operations such as self-attention, with O(N2)O(N^2) time/memory, become a severe bottleneck. Existing patch-centric merging (e.g., ToMe) leverages token similarity, but is prone to loss of high-level semantics or spatial structure.

Object-centric token merging remedies these failures by explicitly mapping each object, agent, or salient region into a single token via:

The general hypothesis is that this yields more efficient, semantically coherent tokenization aligned with downstream object-centric reasoning.

2. Algorithmic Strategies and Implementations

Object-centric token merging encompasses a spectrum of operational strategies, differing in their source of semantic partitioning, merge algebra, and scheduling:

Slot-based Merging (Slot Attention)

“Slot merging” addresses over-segmentation in slot-based object-centric learning (OCL), especially where multiple slots compete for the same object due to overlapping attention (Chatzisavvas et al., 11 Mar 2026). The protocol is as follows:

  • For slots SiS_i with attention masks A:,iA_{:,i}, compute soft-IoU:

IoU(p,q)=n=1Npnqnn=1N(pn+qnpnqn)\operatorname{IoU}(p, q) = \frac{\sum_{n=1}^N p_n q_n}{\sum_{n=1}^N (p_n + q_n - p_n q_n)}

  • Iteratively merge the pair (i,j)(i^*,j^*) of slots with maximal IijI_{ij} above a threshold TT via barycentric averaging:

SiwiSi+wjSjwherewi=αiαi+αj, wj=1wiS_i \leftarrow w_i S_i + w_j S_j \quad\text{where}\quad w_i = \frac{\alpha_i}{\alpha_i+\alpha_j},\ w_j = 1-w_i

updating the mask via A:,iA:,i+A:,jA_{:,i} \leftarrow A_{:,i} + A_{:,j}.

  • The process is end-to-end differentiable due to gradient flow through the merge weights. The threshold TT is set data-driven by the histogram triangle method.

Segmentation-based Merging (Object-centric Pooling)

CORE and AdaTok both employ high-quality segmentation models:

  • For each mask Pn[0,1]H×WP_n \in [0,1]^{H\times W}, form the merged token

tn=iωn,ifiiωn,it_n = \frac{\sum_{i} \omega_{n,i} f_i}{\sum_{i} \omega_{n,i}}

with weights ωn,i=Pn[i]\omega_{n,i} = P_n[i] for soft aggregation, or hard assignment for exclusive tokenization (Lei et al., 18 Nov 2025, Zhang et al., 18 Nov 2025).

  • Merged tokens are mapped to the LLM input space; centroids cnc_n are used for positional coherence (centroid-guided sorting).
  • The number of tokens KK adapts per image to the number of objects, achieving adaptive-rate compression.

Saliency and Video-centric Methods

Saliency-aware token merging (vid-TLDR and Object-Centric Diffusion) use early attention maps or predefined masks to compute object/background saliency:

  • Entropy/“sharpness” of attention rows is used to score foreground tokens; only high-sharpness tokens are kept/merged (Choi et al., 2024).
  • Token merging is preferentially performed in background, or within spatio-temporal blocks in video, often restricting merges to maintain fidelity in edited or important regions (Kahatapitiya et al., 2024).

Token Merging as Semantic Binding in Text-to-Image

For text-to-image diffusion, composite tokens are constructed by aggregating CLIP embeddings of entity and modifier tokens, yielding representations with shared cross-attention maps. Auxiliary losses (entropy, semantic consistency) refine the merged token for binding integrity (Hu et al., 2024).

Action-centric and Agent-aware Tokenization

Oat-VLA introduces explicit partitioning of visual tokens into object-centric groups and a high-resolution patch window centered on the robot gripper; final sequence comprises object tokens (via pooled masks on ViT features) and agent tokens (local neighborhood) (Bendikas et al., 28 Sep 2025).

3. Integration and Architectural Variants

Object-centric token merging is architecturally modular. In vision-language and multimodal systems:

4. Empirical Performance and Computational Efficiency

Empirical metrics consistently demonstrate that object-centric token merging:

  • Retains performance under extreme compression: AdaTok and CORE retain >95–97% of full performance using 2–10% as many tokens across benchmarks such as POPE, MME, SQA-I, VQA, and others (Lei et al., 18 Nov 2025, Zhang et al., 18 Nov 2025).
  • Reduces FLOPs and memory by 5×5\times16×16\times: E.g., CORE achieves a 16×16\times reduction in FLOPs and 182×182\times reduction in KV cache memory in LLM inference (POPE, 63 vs 2880 tokens) (Lei et al., 18 Nov 2025).
  • Accelerates training and inference: Oat-VLA's reduction from 256 to 16 tokens allows 2× speedup and larger batch sizes, converging to the same or higher action accuracy in half the wall-clock time on LIBERO robotic manipulation (Bendikas et al., 28 Sep 2025). Object-centric 3D ToMe yields 8×8\times10×10\times runtime and 17×17\times memory reductions in video editing without loss of temporal consistency (Kahatapitiya et al., 2024).
  • Improves object factorization: Slot merging in OCL improves mask IoU and object boundary consistency, surpassing prior adaptive methods and patch-centric merging (DINOSAUR pipeline, VOC/COCO benchmarks) (Chatzisavvas et al., 11 Mar 2026).
  • Semantic and spatial preservation: Hard object masks outperform soft aggregation in highly entangled settings; centroid-guided sorting restores layout cues essential for multimodal reasoning (Lei et al., 18 Nov 2025).

A comparison of representative methods and their compression-performance tradeoffs (as reported):

Method # Tokens Task/Benchmark Performance Retention (%) Ref
AdaTok ∼53 MME 97.6 (Zhang et al., 18 Nov 2025)
CORE 63 POPE 97.4 (Lei et al., 18 Nov 2025)
Oat-VLA 16 LIBERO (actions) ≈100 (converges 2× faster) (Bendikas et al., 28 Sep 2025)
Slot merging varies VOC mIoU +3.6 over baseline (Chatzisavvas et al., 11 Mar 2026)

5. Limitations and Open Challenges

Current object-centric token merging methods are subject to several constraints:

  • Dependency on segmentation: Quality depends on the accuracy and class coverage of the segmenter (e.g., Mask2Former supports 133 classes; out-of-distribution objects can fragment into multiple tokens, raising token count rather than causing merge errors) (Lei et al., 18 Nov 2025).
  • Memory and bandwidth bottlenecks: Bottleneck shifts to the vision segmentation module; operator fusion and CUDA optimization are suggested to further exploit hardware savings.
  • No end-to-end mask learning: Most pipelines utilize frozen upstream segmenters or object detectors. Integrating the mask generation step with downstream gradient flow and updating jointly remains an active research direction (Lei et al., 18 Nov 2025, Zhang et al., 18 Nov 2025).
  • Granularity control: Adjusting merge aggressiveness (e.g., via mask confidence σ\sigma or slot merging threshold TT) entails trade-offs between fidelity and efficiency.

6. Applications and Research Directions

Object-centric token merging is being generalized across domains:

  • Vision-LLMs: Token compression enables scaling LVLMs and MLLMs to higher resolutions and complex queries with strict resource envelopes (Lei et al., 18 Nov 2025, Zhang et al., 18 Nov 2025).
  • Video Transformers: Spatiotemporal merging, with object/background selectivity, provides significant acceleration in video understanding and editing, including for temporally coherent diffusion models (Choi et al., 2024, Kahatapitiya et al., 2024).
  • Robotics and Action Planning: By segregating agent-centric and object-centric tokens, models such as Oat-VLA encode control-critical features precisely, boosting learning speed and real-world performance (Bendikas et al., 28 Sep 2025).
  • Text-to-Image Synthesis: Semantic binding of attributes and objects via composite token merging significantly increases integrity for complex multi-object compositional generation (Hu et al., 2024).

Ongoing directions cover end-to-end training with mask learning, dynamic or hierarchical merging schedules, integration with foundation segmentation models (e.g., SAM), and expansion to cross-modal object tokenization for unified perception-action architectures.

7. Relation to Prior Patch-based Token Merging

Object-centric token merging represents a conceptual advance over patch-similarity-based schemes such as ToMe (Bolya et al., 2022), which merge purely by local key-space similarity. While ToMe often results in object part aggregation and achieves substantial acceleration, its lack of explicit semantic priors can suffer in complex scenes or under aggressive merging, leading to mixing of distinct semantics. Object-centric approaches address these shortcomings by:

  • Using segmentation or attention-based saliency to prevent cross-object merging,
  • Enabling “one object, one token” guarantees,
  • Preserving agent or foreground information critical for reasoning or control.

Empirically, object-centric merging methods such as AdaTok, CORE, and slot merging outperform or match the speedup and compression achieved by ToMe, maintaining higher performance in semantic-rich and compositional tasks (Zhang et al., 18 Nov 2025, Lei et al., 18 Nov 2025, Chatzisavvas et al., 11 Mar 2026).


Object-centric token merging stands as a foundational primitive for efficient, semantically coherent, and scalable token processing in object-aware and multimodal machine learning systems.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Object-Centric Token Merging.