Disentangled Visual Tokenization (DiVT)

Updated 31 May 2026

DiVT is a visual tokenization approach that disentangles perceptual, structural, and semantic factors for clear, discrete image representations.
It utilizes dual codebooks and hierarchical factorization to boost image reconstruction quality and multimodal reasoning, achieving notable metrics like high PSNR and ImageNet accuracy.
Semantic clustering and pretrained alignment enhance token interpretability while offering flexible granularity to optimize compute efficiency and latency in multimodal models.

Disentangled Visual Tokenization (DiVT) encompasses a family of approaches that structure the tokenization of visual data such that discrete codes represent semantically or structurally independent aspects of an input image. The key principle underlying DiVT is to separate (“disentangle”) perceptual, structural, and semantic factors at the tokenization stage, enabling downstream transformer-based models—especially multimodal LLMs (MLLMs)—to process images using discrete, interpretable, and compositional visual units that more closely mimic natural language tokens. This paradigm addresses core challenges in visual understanding, image generation, and multimodal reasoning by providing greater expressiveness, more human-aligned representations, and improved computational efficiency.

1. Motivations and Foundational Principles

Traditional visual tokenizers, such as VQ-VAE and VQGAN, employ a single codebook to quantize latent representations, resulting in entangled visual tokens where each token can capture a mix of local details, global semantics, and textures. This entanglement can hinder multimodal modeling in two critical scenarios:

In unified autoregressive MLLMs, a tension arises between maximizing token informativeness for high-level understanding and ensuring the token stream supports high-fidelity reconstruction and generation. Forcing a single codebook or embedding space to serve both goals leads to failure modes such as degraded reconstruction (blurred images) or impaired semantic alignment (zero-shot classification drops).
For adapting images to word-like units compatible with LLMs, densely entangled embeddings yield token streams that are highly redundant, spatially uniform, and poorly aligned with discrete, concept-level natural language tokens.

DiVT aims to ameliorate these conflicts by separating (factorizing or clustering) the information space, resulting in vocabularies where each token or codebook is constrained to specialize in a distinct aspect of the visual signal, e.g., low-level structure, mid-level texture, or high-level semantics (Song et al., 18 Mar 2025, Bai et al., 2024, Lee et al., 18 May 2026, Yang et al., 2022).

2. Major DiVT Methodologies

2.1 Dual Codebooks and Hierarchical Factorization

One DiVT theme is the use of multiple independent codebooks, each attached to a different level of a vision backbone's hierarchy. For example, DualToken employs two codebooks: one trained on shallow ViT layers (perceptual codebook) to capture local textures and structure, and one on the deepest layer (semantic codebook) to encode object-level, conceptual content (Song et al., 18 Mar 2025). Each codebook’s tokens are learned via residual vector quantization (RQ-VAE); only shallow tokens are used for image reconstruction, while deep tokens drive semantic reasoning.

A more general form, Factorized Quantization (FQ), decomposes a large codebook into $M$ independent sub-codebooks, each quantizing a branch of the encoder’s feature projection. Orthogonality regularization is imposed across sub-codebooks to enforce decomposability such that, for example, one sub-code encodes low-level structure, another mid-level texture (DINO-v2 aligned), another global semantics (CLIP aligned) (Bai et al., 2024). Learning objectives include reconstruction, GAN, perceptual (LPIPS), disentanglement regularization, and representation regression (CLIP/DINO).

2.2 Semantic Clustering and Concept Tokenization

Another branch of DiVT is based on clustering or concept induction to form visual tokens that correspond to object parts, objects, or distinct semantic regions, rather than rigid spatial patches (Lee et al., 18 May 2026, Yang et al., 2022). This proceeds in stages:

Patch embeddings from a pre-trained ViT or VQ-VAE are clustered based on cosine similarity.
Adaptive centroid selection and refinement yield $K$ clusters per image, with $K$ varying adaptively with image complexity.
Cluster aggregation (via cross-attention or pooling) produces $K$ “word-like” visual tokens per image, each Ideally mapping to a coherent concept or salient region.
Disentangling losses (e.g., mutual-exclusion or cross-entropy over slot swaps) drive each token to represent one independent factor or concept, following the principle that swaps should selectively alter image reconstructions or concept assignments.

2.3 Integration with Pretrained Vision Models

To anchor sub-codebooks or clusters to specific semantic levels, some DiVT variants align one or more codes to the outputs of frozen, large-scale pretrained vision encoders, such as CLIP (for high-level semantics) or DINO-v2 (mid-level). This is performed by regressing the output of specific codebooks/sub-codes onto these features using an $\ell_2$ objective (Bai et al., 2024). This forced semantic alignment fosters disentanglement and enables compositionality at the codebook level.

3. Mathematical Formulations

The DiVT family employs modular, mathematically grounded mechanisms for codebook learning, clustering, and disentanglement:

Codebook Factorization: Given image feature $h_{\text{base},i}$ , $M$ adapters produce $h^{(m)}_i = F_m(h_{\text{base},i}) \in \mathbb{R}^D$ . Each branch is quantized independently:

$q^{(m)}_i = \arg\min_{c \in C^{(m)}} \|h^{(m)}_i - c\|_2$

The concatenated or summed embedding forms $z_i$ (Bai et al., 2024).

Orthogonality Regularization: To enforce disentanglement between sub-codebooks:

$K$ 0

Pretrained Representation Alignment:

$K$ 1

with $K$ 2 representing frozen backbone features.

Cluster-based Tokenization and Aggregation: For clustering-based DiVT,

$K$ 3

yields centroids through greedy neighbor removal. Aggregated tokens are derived by masked cross-attention over cluster members (Lee et al., 18 May 2026).

Disentangling Loss for Concept Slots:

$K$ 4

where $K$ 5 measures the effect of swapping $K$ 6-th slot on the concept token set (Yang et al., 2022).

4. Empirical Results and Comparative Analyses

Quantitative evaluations consistently confirm the advantages of disentangled tokenization over monolithic or entangled alternatives:

DualToken (DiVT): Simultaneously achieves high zero-shot accuracy (81.6% on ImageNet, rivaling semantic models) and strong reconstruction quality (PSNR=23.56, SSIM=0.742; best among unified tokenizers). Downstream multimodal benchmarks (e.g., MMBench: 64.5, VQAv2: 78.3, POPE: 86.1) demonstrate robust multimodal reasoning and generation (Song et al., 18 Mar 2025).
Factorized Quantization (FQGAN DiVT): On ImageNet, dual/triple codebooks (K=16,384 each) raise rFID from 2.19 (VQGAN) to 0.94/0.76, and PSNR from 20.8dB to 22.7dB. In generative contexts, factorized AR heads outperform comparable tokenizers in FID (Bai et al., 2024).
MLLM Token Budgeting (Cluster DiVT): Adaptive clustering enables major token reductions while retaining accuracy; for instance, at $K$ 7 (avg. 35.7 tokens/image), accuracy on VQA2/GQA/MMBench is within 2 points of the full 576-token baseline. Computational gains include bandwidth, memory, and latency reductions (e.g., 13.5 tokens yields 71ms prefill latency vs 138ms for 576 tokens) (Lee et al., 18 May 2026).
Disentanglement and Scene Decomposition: On Shapes3D, Cars3D, CLEVR, and MPI3D, unsupervised concept-token approaches achieve SOTA on FactorVAE, DCI, ARI, and MSC metrics; for example, FactorVAE score 0.957 on Shapes3D and ARI=0.923 on CLEVR (Yang et al., 2022).

5. Interpretability, Efficiency, and Token Granularity

Disentangled Visual Tokenization directly impacts interpretability, compressibility, and grounding:

Semantic Alignment: DiVT clusters or sub-codebooks naturally align with objects, parts, or scene factors; visualizations show tight correspondence between cluster tokens and salient regions (object instances, textures, semantic boundaries) (Lee et al., 18 May 2026).
Accuracy–Compute Trade-off: DiVT enables explicit modulation of token granularity via clustering thresholds or codebook splits. Token counts scale from a dozen (global structure) to hundreds (fine semantic details), letting practitioners tune for memory/latency/accuracy needs.
Attention Grounding: In MLLMs, attention maps between text tokens and DiVT visual tokens yield sharp, interpretable heatmaps. Each word token can focus on a single, semantically matched visual token, improving multi-modal explainability.
Efficiency: Token count reductions lead to improved KV-cache/memory footprints and lower inference latency. E.g., at $K$ 8 (22 tokens), DiVT reduces KV-cache to 3.8% of full projector baseline (Lee et al., 18 May 2026).

6. Limitations and Future Prospects

Despite substantial progress, several challenges remain for DiVT approaches:

Complexity of AR Prediction: In factorized codebook settings, autoregressive modeling of multiple tokens per patch increases generative complexity (Bai et al., 2024). As the number of sub-codebooks grows, joint modeling becomes harder.
Optimal Decomposition: Determining the ideal number and specialization of sub-codebooks or concept tokens remains task- and data-dependent. Over-fragmentation may degrade accuracy due to excessive redundancy or loss of global context.
Expressiveness vs. Simplicity: While DiVT can match or surpass baselines on many tasks, certain continuous or fine-grained visual details may be lost when tokens are excessively coarse or abstract.
Dependency on Backbone Quality: The granularity and semantic richness of DiVT tokens depend on the underlying vision encoder or pretrained backbone, and may not generalize to domains with mismatched visual statistics.

A plausible implication is that next-generation DiVT systems will increasingly combine semantic clustering, hierarchical codebook splits, and pretrained-aligned regularization to produce visual token streams that are interpretable, efficient, and closely aligned with natural language structures, setting the stage for further advances in unified multimodal modeling.

Markdown Report Issue Upgrade to Chat

References (4)

DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies (2025)

Factorized Visual Tokenization and Generation (2024)

A More Word-like Image Tokenization for MLLMs (2026)

Visual Concepts Tokenization (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Disentangled Visual Tokenization (DiVT).