Papers
Topics
Authors
Recent
Search
2000 character limit reached

Visual Tokenization Phase in Multimodal LLMs

Updated 17 June 2026
  • Visual Tokenization Phase is the process of converting images into adaptable, discrete tokens that enable unified processing in language-vision models.
  • It employs a multi-stage pipeline including patch extraction, dynamic token selection, token merging, and vector quantization to generate semantically meaningful representations.
  • Adaptive token retention and a learnable codebook ensure content-dependent sequence lengths and robust multimodal integration for efficient inference.

Visual Tokenization Phase

Visual tokenization is the process of converting an image into a variable-length or fixed-length sequence of discrete tokens that can be ingested by LLMs in a unified language–vision pipeline. In the context of “Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization,” visual tokenization is designed to generate compact, semantically rich, discrete representations from non-linguistic image data such that both images and text can be handled indiscriminately within a single, generative, autoregressive LLM framework. This approach breaks conventional modality barriers, dynamically adjusts sequence length to image content, and enables scalable multimodal understanding and generation within the foundation model LaVIT (Jin et al., 2023).

1. Architecture and Processing Pipeline

The visual tokenization phase is founded on a multi-stage pipeline tightly coupled with a backbone vision encoder (frozen ViT, e.g., EVA-CLIP ViT-G/14):

  • Patch Extraction and Encoding: The input image xRH×W×3x \in \mathbb{R}^{H \times W \times 3} is split into non-overlapping P×PP \times P patches, forming N=HW/P2N=HW/P^2 tokens. The frozen ViT transforms these into patch embeddings X={x1,...,xN}RDX = \{x_1,...,x_N\} \subset \mathbb{R}^D.
  • Dynamic Token Selection: A lightweight MLP head computes per-patch retention probabilities πi\pi_i, from which binary selection masks Mi{0,1}M_i \in \{0,1\} are drawn via Gumbel-Softmax sampling. Only patches with Mi=1M_i = 1 are retained, effectively sparsifying the token sequence based on learned visual saliency.
  • Token Merging: Retained tokens XrX_r undergo LL stacked Transformer blocks, each comprising causal self-attention and cross-attention with the dropped tokens XdX_d. The merger integrates information from dropped to retained entries, producing high-level image part features P×PP \times P0 that abstract regional semantics rather than raw patches.
  • Vector Quantization: Each P×PP \times P1 is assigned to its nearest codebook prototype in a learnable codebook P×PP \times P2 with P×PP \times P3. Discrete visual token IDs P×PP \times P4 are emitted. These IDs are treated analogously to word-piece tokens by the downstream LLM.
  • Interface to LLM: The visual token ID sequence is bracketed with [IMG]...[/IMG] markers and concatenated with text token IDs. This unified sequence feeds into an autoregressive LLM (e.g., LLaMA), training the model on a single cross-entropy objective spanning both visual and textual tokens.

This architecture ensures both images and text can be processed in a unified, generative, transformer-based learning and inference paradigm (Jin et al., 2023).

2. Dynamic Sequence Length: Adaptive Retention

A distinctive feature is length-adaptive tokenization. The MLP selector produces per-patch selection probabilities P×PP \times P5, with the retention mask P×PP \times P6 sampled differentiably via Gumbel-Softmax. The expected retention rate P×PP \times P7 is regularized toward a fixed target P×PP \times P8 (e.g., P×PP \times P9) via a penalty in the loss, but the precise number of retained tokens N=HW/P2N=HW/P^20 is data-dependent. This results in dynamic visual-token sequence lengths tightly matched to the image’s content and complexity.

No hard threshold is imposed on logits; all stages are fully differentiable. Content that is more semantically complex or information-rich yields longer sequences, whereas simpler images are aggressively compressed, yielding robust length–fidelity tradeoffs.

3. Codebook, Token Semantics, and Visual Language

The quantizer maintains a learnable N=HW/P2N=HW/P^21-dimensional codebook (N=HW/P2N=HW/P^22). After Stage-1 training, image patches co-clustered under a given codebook index correspond to regions of high-level semantic similarity (e.g., all “wheel-like” regions). Because token merging occurs before quantization, each discrete token typically encapsulates a merged, semantically meaningful “visual part” (e.g., “head,” “dog face,” “sky”), offering much more abstraction than raw patch VQ. This codebook thus underpins a discrete “visual language” interpretable by the LLM, with tokens analogous to words or subwords (Jin et al., 2023).

4. Training Objectives and Optimization

The tokenizer is trained via a combination of reconstruction and regularization objectives:

  • Visual Reconstruction Loss: The quantized embeddings for retained tokens are decoded through a lightweight decoder to reconstruct all original patch embeddings N=HW/P2N=HW/P^23. The loss is the mean 1 minus cosine similarity:

N=HW/P2N=HW/P^24

  • Retention Rate Penalty: To drive expected sequence length towards the target N=HW/P2N=HW/P^25:

N=HW/P2N=HW/P^26

with N=HW/P2N=HW/P^27.

  • Total Tokenizer Loss:

N=HW/P2N=HW/P^28

No explicit VQ-VAE-style commitment or embedding loss is required; the reconstruction objective naturally shapes the codebook. Hard assignment within the codebook (nearest neighbor) is used, minimizing quantization entropy collapse.

5. Merging, Quantization and LLM Unification

Each Transformer merger block involves two attention mechanisms:

  • Causal Self-Attention on Retained Tokens: Builds an autoregressive ordering and interaction among retained tokens only.
  • Cross-Attention to Dropped Tokens: Each retained token aggregates context from similar dropped tokens, with attention computed as:

N=HW/P2N=HW/P^29

where X={x1,...,xN}RDX = \{x_1,...,x_N\} \subset \mathbb{R}^D0, X={x1,...,xN}RDX = \{x_1,...,x_N\} \subset \mathbb{R}^D1, X={x1,...,xN}RDX = \{x_1,...,x_N\} \subset \mathbb{R}^D2.

Quantization then assigns merged features to their nearest codebook element:

X={x1,...,xN}RDX = \{x_1,...,x_N\} \subset \mathbb{R}^D3

This process results in a stream of discrete visual token IDs suitable for direct ingestion by LLMs. During LLM pretraining and inference, these visual tokens are concatenated with text, enabling the LLM to generate, comprehend, and reason over visual and linguistic content in a shared token space (Jin et al., 2023).

6. Pseudocode and Implementation Details

The core algorithm is as follows:

X={x1,...,xN}RDX = \{x_1,...,x_N\} \subset \mathbb{R}^D4

After training, the tokenizer is frozen. At Stage-2, all images are converted to visual token sequences, concatenated with text tokens, and leveraged in unified auto-regressive LLM training.

7. Implications and Advances

This visual tokenization phase enables robust, scalable multimodal modeling by:

  • Producing compact visual token sequences at adaptive lengths, suitable for direct language-model integration.
  • Abstracting local image content into higher-level, semantically meaningful tokens acting as discrete “words” in visual language.
  • Enabling end-to-end, autoregressive training and inference across both modalities with a single cross-entropy loss.
  • Supporting use cases—including understanding, generation, and vision–language reasoning—without the need for separate vision and text processing streams.

Extensive empirical results in LaVIT validate that this approach outperforms prior models by large margins on diverse multimodal benchmarks (Jin et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Visual Tokenization Phase.