Image Tokenization

Updated 31 May 2026

Image tokenization is the process of converting images into compact tokens that encapsulate semantic and structural details for efficient representation.
Advances such as CAT and GPSToken use adaptive, content-aware methods to allocate tokens based on image complexity, optimizing both compression and model performance.
Integration with transformer and diffusion models demonstrates improved reconstruction fidelity and multimodal reasoning capabilities while reducing computational overhead.

Image tokenization refers to the process of transforming an image into a sequence or set of tokens—compact, information-rich representations—that can be utilized for purposes such as compression, generation, understanding, or as an interface to transformer-based models. Advances in image tokenization have catalyzed dramatic improvements in tasks ranging from image synthesis to multimodal vision-language reasoning. This article surveys the technical underpinnings, architectural innovations, and empirical best practices of contemporary image tokenization research.

1. Foundations and Rationale

Image tokenization operates by encoding images into lower-dimensional discrete or continuous latent representations. This transformation drastically reduces the dimensionality and structure of image data, enabling efficient modeling and synthesis by transformer models or other autoregressive/diffusion architectures. Traditional approaches typically rely on fixed-length, raster-ordered patch codes (e.g., VQ-VAE, VQ-GAN), but this paradigm is increasingly being challenged by adaptive, semantically- or content-aligned methods.

Content-adaptive and region-aware tokenizers—such as CAT (Shen et al., 6 Jan 2025) and GPSToken (Zhang et al., 1 Sep 2025)—allocate variable capacity to different images or image regions, recognizing that complexity and semantic salience vary widely in natural data. Typical motivations for these advances include:

Efficient utilization of compute and memory: Representing simpler images or regions with fewer tokens reduces inference and training cost.
Semantic alignment: Mapping tokens to coherent visual elements (objects, subobjects, or textures) aligns the tokenization process with downstream reasoning.
Improved fidelity/compression trade-offs: Adaptive and structured tokenization enables higher perceptual and quantitative image quality at fixed or reduced bitrate.

2. Tokenization Architectures and Mechanisms

Modern tokenization systems employ diverse architectural strategies, each designed to exploit distinct structural priors. Key classes include:

Content-adaptive latent assignment: CAT (Shen et al., 6 Jan 2025) couples a nested VAE—capable of emitting latents at multiple spatial resolutions—with LLM-driven complexity scoring. Images are routed to encoder paths yielding 8×, 16×, or 32× compression, so computational resources scale with perceived complexity.
Spatially adaptive, region-based tokens: GPSToken (Zhang et al., 1 Sep 2025) initializes tokens as texture-homogeneous regions via an entropy-driven partition and parameterizes each as a 2D Gaussian (with mean, covariance, and texture code), refined by a transformer to capture local geometry and content. GaussianToken (Dong et al., 26 Jan 2025) pursues a similar approach with grid-based Gaussian splatting, demonstrating substantial improvements over standard VQ-GAN baselines.
Language- or semantics-conditioned schemes: TexTok (Zha et al., 2024) incorporates caption-derived linguistic tokens into the tokenization process, allowing semantic content to be captured explicitly and learned by self-attention with image patches and learnable tokens.
Permutation-invariant and set-based methods: TokenSet (Geng et al., 20 Mar 2025) represents the image as an unordered set of codebook tokens, mapping this set to a count vector. Fixed-Sum Discrete Diffusion enforces both discreteness and fixed-sum constraints for generative modeling.

Other significant strategies include 1D sequence tokenizers (TiTok (Yu et al., 2024), Instella-T2I (Wang et al., 26 Jun 2025)), binary spherical quantization (BSQ (Zhao et al., 2024)), subobject-level tokenization via semantic segmentation and embedding (Chen et al., 2024), hierarchical multi-scale tokens for super-resolution (Hadji et al., 14 May 2026), and variable-length encoding through nested truncation or recurrent allocation (Duggal et al., 2024, Miwa et al., 17 Jan 2025, Fu et al., 4 Jan 2026).

3. Training Protocols and Objectives

Tokenization models are trained using objectives tailored for both fidelity and robustness:

Reconstruction losses: $\ell_1$ , $\ell_2$ , or perceptual feature losses (VGG, LPIPS, MoCo-v2) to ensure accurate pixel (or feature) recovery.
Compression/alignment: KL-divergence for variational autoencoders (KL-VAEs, MacTok (Zeng et al., 31 Mar 2026)), commitment losses for vector quantizers (VQ, RQ, MSVQ), codebook utilization/entropy, adversarial losses (Patch-GAN, DINOv2-based discriminators).
Semantic or feature knowledge distillation: Tokenizers may be trained to reconstruct features from powerful, pretrained vision encoders (e.g., CLIP, DINO) rather than pixels directly—see VQ-KD (Wang et al., 2024).
Masking and alignment regularizers: MacTok applies random and DINO-guided masking to force compact encoders to preserve essential semantics; ReToK (Fu et al., 4 Jan 2026) employs hierarchical regularization to align decoding features with pretrained semantic representations.

For variable-length or adaptive methods, recurrent/iterative training schedules (ALIT (Duggal et al., 2024)), tail-drop regularization (One-D-Piece (Miwa et al., 17 Jan 2025)), or LLM-guided complexity scores (CAT) are employed to ensure critical information is concentrated in the most useful or “earliest” token positions.

4. Integration with Generative and Multimodal Models

Tokenized latents serve as the interface for transformer-based or diffusion generative models. The connection is direct and architecture-dependent:

Diffusion transformers: CAT (Shen et al., 6 Jan 2025) and GPSToken (Zhang et al., 1 Sep 2025) convert image content to variable-length or spatially-adaptive latents, which are consumed by class-conditional diffusion transformers (DiT, SiT-XL/2). These systems achieve state-of-the-art FID scores with reduced compute (CAT: FID = 4.56, throughput +18.5%).
Autoregressive (AR) and masked models: SFTok (Rao et al., 18 Dec 2025) demonstrates that multi-step, self-forcing discrete tokenization narrows the gap to continuous latents even at compact code lengths (rFID = 1.21 at 64 tokens). TokenSet (Geng et al., 20 Mar 2025) introduces a diffusion model for discrete sets.
Multimodal integration: Disentangled Visual Tokenization (DiVT (Lee et al., 18 May 2026)) adapts token count and granularity for compatibility with LLMs, clustering patch embeddings into discrete, semantic “visual words.” TexTok (Zha et al., 2024) enables direct text-image joint tokenization, and subobject-level tokenization (Chen et al., 2024) substantially accelerates and improves multimodal learning.

In all cases, adaptive or semantically structured tokens yield significant reductions in compute, memory, and latency, while often improving or preserving downstream generation performance.

5. Quantitative Performance and Evaluation

Performance of tokenization systems is consistently benchmarked using Fréchet Inception Distance (FID, rFID for reconstruction, gFID for generation), PSNR, SSIM, LPIPS, and inference throughput. Typical findings:

Method	Token Count	rFID	gFID	Throughput/Speedup	Notes
CAT (Shen et al., 6 Jan 2025)	Variable	$<$ fixed-16×	4.56	+18.5%	LLM-complexity, adaptive VAE
GPSToken (Zhang et al., 1 Sep 2025)	128	0.65	1.50	3–5× faster convergence	2D Gaussian, spatially-adaptive
SFTok (Rao et al., 18 Dec 2025)	64	1.21	2.29	Compact AR, MaskGIT	Multi-step, self-forcing discrete
TiTok (Yu et al., 2024)	32–128	1.97–2.21	2.13 (512²)	74–410× vs. diffusion	1D sequence, highly efficient
One-D-Piece (Miwa et al., 17 Jan 2025)	8–256	1.08 (256)	–	Variable rate, user control	Tail-drop, head info concentration
TexTok (Zha et al., 2024)	32–256	1.46 (256)	1.46	93.5× (512²)	Language-guided, ViT/T5 backbone
MacTok (Zeng et al., 31 Mar 2026)	64/128	0.75/0.43	1.44/1.52	Up to 64× token reduction	Masked, DINO-aligned continuous
Subobject-level (Chen et al., 2024)	20–50 (avg)	–	–	2× faster convergence	Object/region semantic segments

These results establish that both adaptive and structured tokenization lead to substantial improvements in both reconstruction and generation, at fixed or reduced computational cost.

6. Advances in Adaptivity, Semantics, and Multiscale Structure

Increasingly, image tokenization is moving away from uniform, grid-based patch codes toward methods that:

Adapt token count and content to image semantics (CAT (Shen et al., 6 Jan 2025), DiVT (Lee et al., 18 May 2026), ALIT (Duggal et al., 2024)).
Disentangle spatial layout from texture or semantic content (GPSToken (Zhang et al., 1 Sep 2025), COMiT (Davtyan et al., 24 Feb 2026)).
Cluster patches or regions into higher-level entities (Subobject-level (Chen et al., 2024), DiVT).
Represent images at multiple scales within a single coding scheme (Hierarchical Image Tokenization (Hadji et al., 14 May 2026), Spectral Image Tokenizer (Esteves et al., 2024)).

These developments enable a range of capabilities, including on-the-fly semantic/compute trade-offs, resolution-agnostic coding, partial and progressive decoding, and enhanced interpretability and compositional generalization.

7. Practical Impacts, Limitations, and Open Directions

The emergence of content-adaptive, semantic, and set-based image tokenization unlocks advances in compression, generation, and multimodal representation. Variable-length tokenizations allow quality- or bandwidth-adaptive image transmission. Object/part-level tokens improve sample efficiency in vision-language learning and accelerate convergence on reasoning tasks, as seen in subobject tokenization (Chen et al., 2024).

Limitations include the dependence on pretrained backbones (DINO, CLIP), sensitivity to clustering thresholds or complexity estimators, and the trade-off between interpretability and pure reconstruction objective. The field is moving toward joint optimization with downstream tasks, as well as more principled integration of perceptual, semantic, compression, and reasoning objectives.

References:

CAT: Content-Adaptive Image Tokenization (Shen et al., 6 Jan 2025) GPSToken: Gaussian Parameterized Spatially-adaptive Tokenization (Zhang et al., 1 Sep 2025) Language-Guided Image Tokenization (TexTok) (Zha et al., 2024) Image and Video Tokenization with Binary Spherical Quantization (Zhao et al., 2024) MacTok: Robust Continuous Tokenization (Zeng et al., 31 Mar 2026) SFTok: Bridging the Performance Gap in Discrete Tokenizers (Rao et al., 18 Dec 2025) Tokenize Image as a Set (Geng et al., 20 Mar 2025) Instella-T2I: 1D Binary Tokenization (Wang et al., 26 Jun 2025) GaussianToken: 2D Gaussian Splatting (Dong et al., 26 Jan 2025) A More Word-like Image Tokenization for MLLMs (DiVT) (Lee et al., 18 May 2026) Image Understanding Makes for A Good Tokenizer (VQ-KD) (Wang et al., 2024) XQ-GAN: Open-source Image Tokenization (Li et al., 2024) One-D-Piece: Quality-Controllable Compression (Miwa et al., 17 Jan 2025) An Image is Worth 32 Tokens (TiTok) (Yu et al., 2024) ReToK: Improving Flexible Tokenizers (Fu et al., 4 Jan 2026) COMiT: Communication-Inspired Tokenization (Davtyan et al., 24 Feb 2026) Spectral Image Tokenizer (Esteves et al., 2024) Subobject-level Image Tokenization (Chen et al., 2024) Adaptive Length Image Tokenization via Recurrent Allocation (ALIT) (Duggal et al., 2024) Hierarchical Image Tokenization for Multi-Scale Super Resolution (Hadji et al., 14 May 2026)