Localized Visual Tokenization in Multimodal Models

Updated 21 February 2026

Localized visual tokenization is a method that converts image regions into discrete, semantically coherent tokens for precise multimodal representation.
It employs adaptive techniques like slot attention, region proposals, and superpixel grouping to enhance interpretability and reduce token redundancy.
The approach supports tasks such as image captioning, region grounding, and generative image modeling by seamlessly integrating visual tokens with language models.

Localized visual tokenization refers to the class of methods that convert visual data—particularly images—into discrete, spatially grounded token sequences that correspond to meaningful, often semantically or structurally coherent, regions or objects in the image. These tokens are designed to be directly consumable by LLMs and multimodal transformers, thereby bridging the gap between visual and linguistic representations for tasks such as multimodal understanding, grounded reasoning, and generative image modeling. Localized visual tokenization stands in contrast to global or uniform grid-based approaches, enabling fine-grained, adaptable modeling of image content and facilitating tasks such as image captioning, region referring, semantic parsing, and image generation.

1. Principles and Motivations of Localized Visual Tokenization

The motivation for localized visual tokenization arises from limitations of traditional visual data representations, particularly grid-based patch tokenization as used in ViTs. Uniform patching can indiscriminately straddle semantic boundaries, resulting in tokens that encode mixed content and limit the fidelity of downstream reasoning or generation. Key drivers for localization in visual tokenization include:

Semantic purity: Tokens are encouraged to correspond to semantically coherent regions (objects, parts, or concepts) (Lew et al., 2024), avoiding the mixing of unrelated semantics.
Flexible length and region adaptation: Instead of a fixed-length, uniformly spaced sequence, localized tokenizers yield a variable-length sequence adapting to image content and complexity (Jin et al., 2023, Zhang et al., 1 Sep 2025).
Improved interpretability and compositionality: Each token can be traced to a specific region or object, facilitating region-level operations such as grounding and editing (Ma et al., 2024, Chi et al., 23 May 2025).
Enabling next-token autoregression in MLLMs: Localized visual tokens integrate seamlessly with language tokens in autoregressive transformers, enabling unified text–vision generation and reasoning (Zhao et al., 7 Feb 2025, Chi et al., 23 May 2025).
Robustness and linguistic inclusivity: By bypassing language-dependent tokenization (e.g., BPE), visually rendered or region-based approaches improve support across scripts, typographic variants, and low-resource languages (Susanto et al., 12 Jan 2026, Xing et al., 21 Oct 2025).

2. Core Architectures and Methodologies

Localized visual tokenization encompasses a range of algorithmic paradigms, from patch- and region-based methods to object-centric slot attention and adaptive superpixel/groupwise approaches. Notable frameworks and their methodologies include:

Patch-wise autoencoders with quantization: QLIP tokenizes an image by splitting into non-overlapping patches, encoding via a ViT backbone, projecting to a low-dimensional latent, quantizing via binary-spherical quantization (BSQ), and reconstructing via an up-projection and decoder. Discrete tokens correspond to patch locations with learned positional embeddings, ensuring spatial localization (Zhao et al., 7 Feb 2025).
Region-proposal-based frameworks: Groma applies a region proposer (class-agnostic DETR over a ViT feature pyramid) to extract boxes. Each box is encoded via multi-scale ROIAlign, projected, and injected into the LLM token stream as a region token. Proxy tokens allow referential region output (Ma et al., 2024).
Slot attention and object-centric tokenization: Slot-MLLM leverages a ViT–Q-Former backbone followed by iterative slot attention, wherein learnable slot queries compete to explain spatial features, yielding object-centric slot embeddings. These slots are discretized via multi-round residual vector quantization, integrated into a unified next-token LLM framework (Chi et al., 23 May 2025).
Superpixel-based tokenization: SuiT employs a superpixel algorithm (FastSLIC) to oversegment images into semantically coherent, irregularly-shaped regions. Feature extraction and pooling (average and max) over each superpixel yield tokens that replace patch tokens in ViTs (Lew et al., 2024).
Spatially-adaptive tokenization via region partitioning: GPSToken segments the image via entropy-driven region partitioning, parameterizes each region as a 2D Gaussian (shape, position) with texture features, and feeds these tokens to a transformer backbone, supporting non-uniform, texture-homogeneous tokenization (Zhang et al., 1 Sep 2025).
Unsupervised visual concept tokenization: VCT uses cross-attention between learnable slots (prototypes) and image tokens, enforced by a disentangling loss, to yield tokens each encoding an independent visual factor or object (Yang et al., 2022).
Hierarchical, pixel-level differentiable approaches: dHT merges pixelwise features via differentiable clustering and model selection (information criteria), forming a variable-length sequence of adaptive region tokens for transformer input with full retro-compatibility (Aasan et al., 4 Nov 2025).
Dynamic, content-dependent tokenization: LaVIT's dynamic discrete tokenizer employs a learned selector and merger to adaptively reduce the patch grid to a semantic token sequence, with quantization ensuring compactness (Jin et al., 2023).

3. Integration with Multimodal LLMs

Localized visual tokenization is engineered to interface with modern autoregressive LLMs/MLLMs, facilitating both unified representation and unified generation:

Token sequence formulation: Localized visual tokens (often continuous features projected into the LLM input space or discrete token IDs) are concatenated with text tokens using a shared or joint vocabulary with segment and position embeddings (Zhao et al., 7 Feb 2025, Chi et al., 23 May 2025, Jin et al., 2023).
Region grounding and referential mechanisms: In models like Groma, specific region tokens are assigned unique proxy identifiers (“<r_i>”), enabling both input and output referring expressions, crucial for tasks such as region captioning, referring expression comprehension, and grounding (Ma et al., 2024).
Auto-regressive multimodal modeling: Localized visual tokens serve as direct inputs and outputs for unified next-token prediction, supporting vision→language, language→vision, and vision↔language tasks within a single architecture (Zhao et al., 7 Feb 2025, Chi et al., 23 May 2025, Jin et al., 2023).
Training objectives: Multi-stage objectives typically combine reconstruction (e.g. MSE, perceptual, adversarial losses), mutual information maximization/image–text alignment (e.g. InfoNCE, CLIP), and next-token autoregressive loss over mixed-modality sequences (Zhao et al., 7 Feb 2025, Chi et al., 23 May 2025, Ma et al., 2024).

4. Empirical Advances and Comparative Outcomes

Empirical studies demonstrate substantial advantages of localized visual tokenization over traditional patch or global tokenizers:

Semantic fidelity and efficiency: Superpixel and adaptively merged tokens result in higher semantic integrity, fewer tokens (yielding lower FLOPs and latency), and improved classification, transfer, and segmentation performance (Lew et al., 2024, Aasan et al., 4 Nov 2025, Jin et al., 2023, Xing et al., 21 Oct 2025).
Object- and region-level comprehension: Slot attention and region-proposal-based methods yield object- or region-specific tokens, leading to superior performance on region captioning, referring expression comprehension (e.g., Groma achieves 86.52% mAcc on referring expression benchmarks), and grounded VQA (Ma et al., 2024, Chi et al., 23 May 2025).
Robustness to noise and multilinguality: Vision-centric and visually rendered tokenization pipelines like SeeTok and DualGPT improve robustness to typographic noise and generalization to low-resource scripts, addressing over-segmentation and misalignment issues of subword tokenization (Susanto et al., 12 Jan 2026, Xing et al., 21 Oct 2025).
Image reconstruction and generation: Approaches like QLIP and GPSToken achieve state-of-the-art rFID and FID scores for image generation using a small number of spatially-adaptive tokens (e.g., GPSToken-M128 achieves FID=1.50 on ImageNet256), and QLIP enables strong zero-shot and generative performance with a unified tokenizer (Zhao et al., 7 Feb 2025, Zhang et al., 1 Sep 2025).
Efficiency gains: Dynamic tokenization and visual rendering can compress token counts by 4–8×, with corresponding reductions in computational overhead (e.g., SeeTok lowers FLOPs by 70.5% relative to pure-text baselines) (Xing et al., 21 Oct 2025).

5. Applications and Task-Specific Outcomes

Localized visual tokenization underpins a comprehensive suite of multimodal tasks:

Referring expression comprehension and region grounding: Embedding localization into tokenization (e.g., Groma) enables fine-grained mapping between text and regions, supporting explicit region annotation, description, and referential dialogue (Ma et al., 2024).
Image captioning and region captioning: Region-level token representations support targeted captioning and improve both global and region-specific metrics (CIDEr, METEOR) (Chi et al., 23 May 2025, Ma et al., 2024).
Image editing and compositional generation: Object-centric tokens (Slot-MLLM) enable localized image editing by manipulating specific tokens or slots associated with objects, supporting applications such as color swaps or attribute modification (Chi et al., 23 May 2025).
Semantic segmentation and scene decomposition: Approaches relying on superpixels, hierarchical clustering, or concept tokens enhance segmentation by aligning tokens to coherent object or concept regions, often yielding improved ARI and mean IoU (Lew et al., 2024, Aasan et al., 4 Nov 2025, Yang et al., 2022).
Multilingual and script-inclusive text modeling: Visual rendering of text and region grounding in non-Latin scripts enable fairer handling of morphologically diverse or under-resourced languages (Susanto et al., 12 Jan 2026, Xing et al., 21 Oct 2025).
Raster-to-vector and vector-based reconstructions: Differentiable hierarchical tokenization supports out-of-the-box raster-to-vector conversion by extracting region-bounded patches suitable for vectorization (Aasan et al., 4 Nov 2025).

6. Analysis, Challenges, and Future Prospects

While localized visual tokenization exhibits major advances, it poses challenges and open questions:

Token–object correspondence and redundancy: Not all “localized” tokens correspond exactly to classical objects, particularly when parts overlap or hierarchical part–whole relations exist; redundancy control is essential (Chi et al., 23 May 2025, Jin et al., 2023).
Granularity–efficiency tradeoff: Increased token localization may boost interpretability and downstream grounding, but can escalate sequence length if not properly merged or filtered; dynamic methods address this via content-adaptive schemes (Jin et al., 2023, Zhang et al., 1 Sep 2025).
Compatibility and architectural integration: Localized tokens must interface seamlessly with transformer-based backbones, either through embedding and positional schemes (e.g., mean-injection, mask pooling, 2D learned positions) or by projecting to LLM-compatible dimensions (Aasan et al., 4 Nov 2025, Lew et al., 2024).
Tokenization for non-visual data: For script-aware and multilingual settings, simply rendering text visually is insufficient unless the underlying tokenization and alignment reflect linguistic structure of rare or complex scripts (Susanto et al., 12 Jan 2026, Xing et al., 21 Oct 2025).
Adaptivity and supervision: Unsupervised and weakly-supervised approaches (e.g., cross-attention prototypes, entropy-based partitioning) hold promise for automatic factor discovery but require further work to guarantee semantic exclusivity in challenging real-world settings (Yang et al., 2022, Zhang et al., 1 Sep 2025).

Localized visual tokenization constitutes a critical advance in multimodal machine learning, enabling interpretable, high-fidelity, and flexible translation between the structure of images and the symbolic domain of LLMs. Ongoing research addresses challenges in granularity selection, region–semantic alignment, and robust integration with large-scale pre-trained models for both discriminative and generative tasks.

References: