Object-Centric Tokenization

Updated 25 December 2025

Object-centric tokenization is a method that converts perceptual inputs into discrete tokens representing individual objects or semantic regions.
It utilizes attention, segmentation, and quantization techniques to reduce token counts and computational costs while enhancing task performance.
Integration with multimodal systems improves vision-language and video analysis, driving advances in applications like robotic manipulation and autonomous driving.

Object-centric tokenization is a paradigm for encoding images, video, or sensor data into a concise set of discrete tokens, each intended to represent a coherent object, entity, or semantic region, rather than spatially uniform patches or frame-level features. This approach formalizes and exploits the compositionality of scenes, focusing on the inductive bias that objects—not pixels—are the principal units of reasoning, inference, and downstream task execution in both human cognition and advanced machine learning models. Object-centric tokenization is motivated by the inefficiency and semantic dispersion of traditional dense-grid or patch-based tokenizers, particularly in vision-language and multimodal systems, and has led to substantial reductions in token count, computational cost, and improvements in interpretability, alignment, and task performance (Bendikas et al., 28 Sep 2025, Feng et al., 25 Nov 2024, Tian et al., 1 Jul 2024, Chi et al., 23 May 2025, Li et al., 25 Nov 2025).

1. Principles of Object-Centric Tokenization

Object-centric tokenization is built on the assumption that semantic understanding is best achieved by grouping perceptual input by objects or parts rather than by spatial partitions. Within this paradigm:

Tokens represent objects or semantically independent regions (SIRs), typically enforced by attention, segmentation, or explicit object proposal mechanisms. In the HOOK framework, SIRs are formally defined as image regions “semantically independent of any pixels outside themselves,” with strict non-overlap and full coverage constraints: each token covers at most one object and vice versa (Shao et al., 27 Mar 2024).
Slot-based attention and permutation-invariant grouping are commonly used to aggregate patch features into tokens, exploiting attention-based routing and competition to encourage exclusivity and object allocation (Chi et al., 23 May 2025, Li et al., 25 Nov 2025, Bao et al., 2023).
Pooling and projection modules combine features within predicted object masks or regions—commonly by mean pooling—and project the resulting token embeddings into the space required by downstream modules (e.g., LLM input space) (Bendikas et al., 28 Sep 2025, Feng et al., 25 Nov 2024).
Discrete or quantized token spaces are used for compression and to facilitate discrete autoregressive modeling in multimodal LLMs (Chi et al., 23 May 2025, Bao et al., 2023).

This approach contrasts with grid-based or frame-based schemes, where tokens spatially entangle object boundaries and dilute entity-specific semantics through global pooling or coarse resampling, reducing interpretability and computational efficiency (Feng et al., 25 Nov 2024).

2. Core Methodologies and Architectural Instantiations

Leading approaches to object-centric tokenization employ a range of detection, segmentation, attention, and quantization techniques. Representative instantiations include:

Oat-VLA (Object-Agent-centric Tokenization for Vision-Language-Action): Replaces standard ViT patch tokenization (e.g., $224\times224 \to 256$ tokens) with two streams: (1) an object-centric branch extracts $N$ binary object masks (e.g., with FT-Dinosaur), averages patch features within each mask, and produces one token per object (typically $N=7$ ); (2) an agent-centric branch detects the gripper keypoint using a Faster-R-CNN, pools patch embeddings in a $3\times3$ grid centered on the agent ( $P=3$ , giving $9$ tokens); the total visual token count per image is $16$ after MLP projection (Bendikas et al., 28 Sep 2025).
VideoOrion: Employs a "detect–segment–track" pipeline using expert vision models (GroundingDINO, SAM, XMem) to first detect and segment objects in key frames, track objects across the video, and spatially-temporally pool mask-aggregated patch features across object tracks. Each object yields a single token encoding the appearance and dynamics across its trajectory, yielding explicit disentanglement of entities (Feng et al., 25 Nov 2024).
HOOK (Homogeneous visual tOKenizer): Introduces an Object Perception Module to split images into fine 4 $\times$ 4-pixel seeds, aggregates these via stacked local and global self-attention, then uses cross-attention with learnable queries to merge seeds into a variable number $N$ of homogeneous, object-aligned tokens. These are designed to correspond one-to-one with SIRs (Shao et al., 27 Mar 2024).
Slot-MLLM: Builds on slot attention to group vision encoder patches into $N$ slot embeddings, then discretizes these using residual vector quantization (RVQ) for autoregressive modeling within a multimodal LLM. The slot tokens encode both local object details and high-level semantics (Chi et al., 23 May 2025).
OC-VTP (Object-Centric Vision Token Pruning): Applies slot attention over mid-level vision encoder outputs to select $S$ most representative tokens ("object slots") by optimizing for minimal reconstruction error (AW-MSE) of the original patch features, providing a guaranteed principled reduction in token count with competitive accuracy (Li et al., 25 Nov 2025).

These methods utilize various mechanisms for populating object tokens: unsupervised discovery guided by motion segmentation and vector quantization (as in MoTok (Bao et al., 2023)), explicit detection and segmentation, or self-supervised slot attention. Several approaches (e.g., HOOK, Oat-VLA) support adaptive token counts for task-specific sparsity or density.

3. Integration in Vision-Language and Multimodal Systems

Object-centric tokens are designed for compatibility and efficiency within transformer-based vision-LLMs (VLMs), video-LLMs, and multimodal LLMs (MLLMs):

Contextual fusion with language tokens is performed by directly concatenating object-centric visual tokens with text tokens before transformer cross-attention layers or within a unified next-token prediction framework (Bendikas et al., 28 Sep 2025, Chi et al., 23 May 2025, Tian et al., 1 Jul 2024).
Adapters and projectors translate object tokens into the LLM embedding space. In the TOKEN framework for autonomous driving, MLP adapters align object and map tokens with the LLM input (Tian et al., 1 Jul 2024).
Unchanged backbone architectures: Most object-centric pipelines retain fixed vision encoders and LLMs, swapping only the tokenization module before the cross-modal fusion stage to preserve pretrained priors (Bendikas et al., 28 Sep 2025, Feng et al., 25 Nov 2024).
Support for autoregressive generation and understanding: Discrete object tokens (e.g., after RVQ) can act as atomic units for both comprehension (captioning, VQA) and generation (visual synthesis/editing), facilitating next-token prediction and semantic alignment (Chi et al., 23 May 2025).

Integration requires careful alignment—not only at the embedding level but also in supporting reasoning chains (e.g., chain-of-thought prompting in planning tasks (Tian et al., 1 Jul 2024)) and trajectory-based or temporal modeling in video or robotic scenarios.

4. Quantitative and Practical Impact

Object-centric tokenization enables dramatic reductions in token count, computational savings, and improvements in downstream performance:

Token reduction: Oat-VLA reduces the visual token count from 256 (grid-based ViT) to just 16, a reduction of 93.75%, resulting in doubled batch sizes and sample throughput on 8 $\times$ H100 clusters for robotic manipulation (Bendikas et al., 28 Sep 2025). HOOK achieves 6–8 tokens per image for classification/segmentation, with speedups of 1.5–2.8× over PatchEmbed (Shao et al., 27 Mar 2024).
Task performance: On the LIBERO suite, Oat-VLA converges at least twice as fast as OpenVLA and attains higher or equivalent success rates (e.g., average 78.6% vs 76.5% with LoRA fine-tuning) (Bendikas et al., 28 Sep 2025). In real pick-and-place experiments, Oat-VLA outperforms OpenVLA both in-distribution (72% vs 52%) and out-of-distribution (46% vs 29%) (Bendikas et al., 28 Sep 2025).
Video QA and referring: VideoOrion compresses videos from thousands of patch/frame tokens to ≤72 tokens (object + context) with an order-of-magnitude reduction in cross-attention computation. MVBench accuracy increases to 58.3% (VideoLLaMA2 baseline: 53.4%), and referring tasks outperform Artemis and prior baselines (Feng et al., 25 Nov 2024).
Planning in autonomous driving: TOKEN reduces trajectory L2 error by 27% and collision rates by 39% in long-tail driving scenarios, outperforming PARA-Drive and Video-LLaMA variants (Tian et al., 1 Jul 2024).
Computational savings: OC-VTP reduces LLaVA-1.5 FLOPs by 6.5× (from 6.30T to 0.97T) at 11.1% token retention; on LLaVA-NeXT, a 17× reduction in prefill FLOPs is achieved (Li et al., 25 Nov 2025).

Performance on image reconstruction (Slot-MLLM vs SEED vs LaViT) and multimodal benchmarks (GQA, CIDEr, POPE, NaturalBench) show consistent gains for object-centric schemes (Chi et al., 23 May 2025). Further, cluster purity in unsupervised MoTok is substantially higher than vanilla VQ-VAE, indicating emergent semantic alignment (Bao et al., 2023).

5. Interpretability, Emergence, and Token Semantics

Disentanglement: By design, object-centric tokens correspond to distinct semantic entities or parts, as verified by analyses of attention maps, qualitative visualizations, and increased cluster purity/ARI scores versus patch/VQ baselines (Bao et al., 2023, Shao et al., 27 Mar 2024, Feng et al., 25 Nov 2024).
Interpretability in pruning: OC-VTP demonstrates that a token per slot aligns to object centers and salient scene elements, with large or composite objects split over several slots and redundant backgrounds elided (Li et al., 25 Nov 2025).
Motion-guided discovery: In MoTok, combining slot attention with motion segmentation yields object-specific token clusters without supervision, outperforming prior art on synthetic and real benchmarks for unsupervised object discovery (Bao et al., 2023).
Semantic independence: HOOK’s SIR coverage ensures each token is maximally homogeneous, mapping exactly and exclusively to one object (Shao et al., 27 Mar 2024).

Emergence of object-aligned tokens is thus a direct consequence of pooling, masking, attention, and loss-based routing mechanisms, reinforced by auxiliary semantic objectives where used.

6. Limitations, Extensions, and Prospective Directions

Pipeline dependencies: Current object-centric tokenization efficiency often depends on external expert models for object detection, segmentation, and tracking (e.g., GroundingDINO, SAM, XMem in VideoOrion (Feng et al., 25 Nov 2024)), which introduce overhead and may degrade in low-resolution, occluded, or out-of-domain inputs.
Heuristic ordering and lack of explicit positions: Some pipelines (e.g., VideoOrion) sort tokens by first appearance rather than embedding explicit temporal or spatial order (Feng et al., 25 Nov 2024).
Training overhead: End-to-end pipelines with slot attention, residual vector quantization, and diffusion models (Slot-MLLM (Chi et al., 23 May 2025)) may add complexity to pretraining and require tuning of loss weight trade-offs for optimal performance.
Potential extensions: Directions include video slot tokenization with temporal attention, 3D and AR streams with “object slot" embeddings for shape/dynamics, text token pruning in analogy to visual token pruning (Feng et al., 25 Nov 2024, Li et al., 25 Nov 2025).
Emergent reasoning capabilities: By aligning object-level tokens and reasoning architectures (TOKEN), structured chain-of-thought and planning can be unlocked for safety-critical domains (Tian et al., 1 Jul 2024).

A plausible implication is that continued research into adaptive, task-aware, and feedback-driven object-centric tokenization will further close the gap between semantic efficiency and model capacity across perception, language, planning, and generation tasks.

7. Summary Table: Representative Architectures

Approach	Paradigm/Module	Token Count (per sample)
Oat-VLA (Bendikas et al., 28 Sep 2025)	Masked pooling (object/agent)	16 (7 object + 9 agent)
VideoOrion (Feng et al., 25 Nov 2024)	Detect–Segment–Track, pooling	≤64 (objects), ≤72 (total)
HOOK (Shao et al., 27 Mar 2024)	Seed splitting + cross-attention	6–8 (adjustable)
Slot-MLLM (Chi et al., 23 May 2025)	Slot Attention + RVQ	32 slots × 4 RVQ depth
OC-VTP (Li et al., 25 Nov 2025)	Slot Attention pruning	5–33% of original tokens
MoTok (Bao et al., 2023)	Motion-guided VQ, slot attention	K slots/discrete codes

This landscape underscores the rapid coalescence of object-centric tokenization as a central technology for efficient, interpretable, and semantically rich multimodal representation learning.