Selftok: Discrete Visual Tokens of Autoregression, by Diffusion, and for Reasoning (2505.07538v3)

Published 12 May 2025 in cs.CV

Abstract: We completely discard the conventional spatial prior in image representation and introduce a novel discrete visual tokenizer: Self-consistency Tokenizer (Selftok). At its design core, we compose an autoregressive (AR) prior -- mirroring the causal structure of language -- into visual tokens by using the reverse diffusion process of image generation. The AR property makes Selftok fundamentally distinct from traditional spatial tokens in the following two key ways: - Selftok offers an elegant and minimalist approach to unify diffusion and AR for vision-LLMs (VLMs): By representing images with Selftok tokens, we can train a VLM using a purely discrete autoregressive architecture -- like that in LLMs -- without requiring additional modules or training objectives. - We theoretically show that the AR prior satisfies the BeLLMan equation, whereas the spatial prior does not. Therefore, Selftok supports reinforcement learning (RL) for visual generation with effectiveness comparable to that achieved in LLMs. Besides the AR property, Selftok is also a SoTA tokenizer that achieves a favorable trade-off between high-quality reconstruction and compression rate. We use Selftok to build a pure AR VLM for both visual comprehension and generation tasks. Impressively, without using any text-image training pairs, a simple policy gradient RL working in the visual tokens can significantly boost the visual generation benchmark, surpassing all the existing models by a large margin. Therefore, we believe that Selftok effectively addresses the long-standing challenge that visual tokens cannot support effective RL. When combined with the well-established strengths of RL in LLMs, this brings us one step closer to realizing a truly multimodal LLM. Project Page: https://selftok-team.github.io/report/.

Collections

Sign up for free to add this paper to one or more collections.

Sign Up

Summary

The paper introduces Selftok, a discrete visual tokenizer that generates autoregressive tokens through a novel diffusion-based approach for unified vision-language reasoning.
Its methodology leverages an encoder, quantizer, and diffusion decoder with a custom token schedule to enforce causal dependencies and optimize image reconstruction.
Empirical results demonstrate state-of-the-art ImageNet reconstruction and improved performance in visual reasoning, image editing, and reinforcement learning applications.

This paper, "Selftok: Discrete Visual Tokens of Autoregression, by Diffusion, and for Reasoning" (2505.07538), introduces Selftok, a novel discrete visual tokenizer designed to generate autoregressive (AR) visual tokens. Unlike traditional methods that rely on spatial priors (where tokens directly map to image patches), Selftok completely discards this prior. The core motivation is to create visual tokens that are fundamentally compatible with the discrete autoregressive architecture of LLMs, thereby enabling the creation of unified Vision-LLMs (VLMs) that can leverage the well-established training paradigms and emergent capabilities (like reasoning via Reinforcement Learning) of LLMs.

The paper argues against using continuous representations (cAR) or discrete spatial tokens for building unified AR VLMs. Continuous tokens complicate training (less stable MSE loss vs. cross-entropy) and reinforcement learning (infinite state-action space) and are less conducive to disentanglement. Spatial tokens, despite being discrete, have non-AR causal dependencies (collider effect), which violates the policy improvement optimality required for effective Reinforcement Learning (RL) and makes them incompatible with the causal structure of AR models.

How Selftok Works

Selftok is designed to encode an image $I$ into a sequence of $K$ discrete tokens $\mathcal{V}_K = [v_1, \ldots, v_K]$ such that they conform to an AR prior, allowing reconstruction of the image. The key innovation is composing the AR constraint into the reconstruction objective by leveraging the recursive nature of the reverse diffusion process.

The reverse diffusion process describes the transformation from noise ( $\mathbf{x}_0$ ) to a clean image ( $\mathbf{x}_1$ ) over time $t \in [0, 1]$ . This process can be viewed recursively: the path from a noisy intermediate state $\mathbf{x}_t$ to the clean image $\mathbf{x}_1$ is a sub-problem of the path from $\mathbf{x}_0$ to $\mathbf{x}_1$ . Selftok establishes a correspondence between the AR recursion $P(\mathcal{V}_K) = P(\mathcal{V}_{<i}) \cdot P(\mathcal{V}_{i}|\mathcal{V}_{<i})$ and the diffusion recursion $\mathbf{x}_1 = \mathbf{x}_t + \int_{t}^1 \mathbf{v}_s(\mathbf{x}_s) ds$ .

The Selftok training objective (Eq. 9) minimizes the expected squared error between the image $\mathbf{x}_1$ and the reconstruction produced by a decoder, conditioned on a noisy intermediate state $\mathbf{x}_t$ (sampled from $q(\mathbf{x}_t|\mathbf{x}_1)$ ) and a suffix of tokens $\mathcal{V}_{k(t)} = [v_{k(t)}, \ldots, v_K]$ . The index $k(t)$ is determined by a token schedule mapping continuous time $t$ to a token index. This recursive formulation forces the tokens $\mathcal{V}_{k(t)}$ to encode the necessary information to complete the diffusion path from $\mathbf{x}_t$ to $\mathbf{x}_1$ , thereby implicitly learning an AR structure where later tokens depend on earlier ones (represented by $\mathbf{x}_t$ ). The paper shows this induces a causal graph suitable for AR learning, using $\mathbf{x}_0$ as an instrumental variable (Eq. 7).

Empirical evidence for Selftok's AR property is provided by analyzing the entropy of next-token predictions under different token ordering (Figure 1), showing segmented decreasing trends characteristic of AR sequences.

Implementation Details

The Selftok tokenizer consists of an Encoder, a Quantizer, and a Decoder, leveraging a dual-stream transformer architecture similar to MMDiT.

Encoder: Takes a VAE latent of the input image and learnable continuous token embeddings. A dual-stream transformer processes image patches and token embeddings, interacting via co-attention. Token-aware adaptive layer normalization (AdaLN) differentiates token embeddings. The output is $K$ continuous token embeddings.
Quantizer: Compresses the continuous embeddings through a bottleneck layer and quantizes them by mapping each to the closest word in a learnable codebook ( $\approx 32,000$ words, dimension $D'=16$ ). A straight-through estimator handles backpropagation. Quantization loss (commitment + entropy) and EMA updates with dead-code reactivation are used for training the codebook.
Decoder: A diffusion model initialized from SD3 weights. It's a dual-stream transformer conditioned on the noisy image latent $\mathbf{x}_t$ and the quantized token embeddings $\mathcal{V}_{k(t)}$ . The token stream weights are trained from scratch. Timestep conditioning is applied via AdaLN in the image stream.
Training Objective: Jointly optimizes encoder and decoder parameters using the reconstruction loss $\| \mathbf{x}_1 - \mathrm{Dec}(\mathbf{x}_t, \mathcal{V}_{ k(t)}) \|^2$ averaged over uniform time sampling, plus the quantization loss. A re-weighting mechanism for token embeddings compensates for imbalanced gradient updates due to the token schedule.
Token Schedule $k(t)$ : Maps diffusion time $t$ to the starting index of the token suffix $\mathcal{V}_{k(t)}$ . While a uniform schedule $k(t) = t \times K + 1$ is theoretically ideal for AR, an empirically better custom schedule allocates fewer tokens to small $t$ , aligning with the observation that early diffusion steps have less impact on reconstruction quality.
One-step Renderer: After training the tokenizer, a separate renderer is trained to reconstruct the image from the full token sequence $\mathcal{V}_K$ in a single forward pass. It's initialized from the decoder and trained with MSE, LPIPS, and GAN losses (Eq. 12) to improve speed and perceptual quality.

Tokenizer Validation

Selftok achieves state-of-the-art reconstruction quality on ImageNet compared to existing spatial (2D) and 1D tokenizers (Table 1). Qualitative results demonstrate high-quality reconstruction (Figure 2) and a non-spatial representation, contrasting with patch-based methods (Figure 3). Ablation studies validate the choices for codebook size, token count, time sampler (uniform sampling performs best), token schedule (custom schedule is better empirically), and the benefits of the one-step renderer (improved speed and quality - Table 2, 3, 4).

Selftok-based VLM

The discrete Selftok tokens allow building a unified pure AR VLM based on LLMs (Llama3-8B in this case) by expanding its vocabulary with visual tokens. The VLM is trained using a standard LLMing objective on multimodal token sequences.

Training Stages:
- Stage 1: Cross-modality Alignment: Training on diverse data formats (Text-to-Image, Image-to-Text, Image-Only, Text-Only) to align language and visual tokens (Figure 4). Multi-level captioning enhances data richness.
- Stage 2: Cross-task Alignment: Supervised fine-tuning (SFT) on specific tasks like text-to-image generation, image editing, and image understanding.
Inference: An adaptive logit adjustment strategy is used during inference (Algorithm 1). This adjusts logits based on prediction entropy, addressing uncertainty arising from insufficient visual pretraining, although ideally, a well-trained AR VLM should not require it like LLMs.

VLM Validation

The Selftok-based VLM performs well on downstream tasks:

Text-to-Image Generation: Achieves competitive results on GenEval and DPG-Bench after SFT (Selftok-SFT in Table 5, 6). Qualitative examples demonstrate good alignment and aesthetics (Figure 5).
Image Editing: Shows competitive performance on PIE-Bench, balancing fidelity and editability (Table 7, Figure 6, 13).
Vision-Language Comprehension: Achieves strong results on MME compared to other unified models, though specialized comprehension models perform better (Table 8).
Synergy: Experiments show that training on both generation and comprehension tasks yields positive performance gains for Selftok ( $\Delta > 0$ ), unlike spatial tokens which can show conflicts ( $\Delta < 0$ ) (Figure 7). This synergy is key for using comprehension models as rewards in RL.

Selftok-based Visual RL

The paper argues that visual RL is essential for visual generative models to move beyond mimicking training data and achieve genuine reasoning, especially for handling rare or complex instructions (hallucination issue - Figure 8 in Appendix B). The AR property of Selftok tokens is crucial for formulating visual RL as a finite Markov Decision Process (MDP) and deriving the BeLLMan equation, which guarantees effective policy updates. Spatial tokens, lacking AR structure, cannot support this.

Problem Formulation: Visual RL is framed as an MDP where states are token sequences generated so far, actions are predicting the next token, and the reward evaluates the final generated image's quality relative to the task (e.g., text prompt).
Implementation:
- Reward Model: Utilizes visual comprehension models. Program-based rewards use detectors for structured tasks (counting, position). QA-based rewards use powerful VLMs (like InternVL, GPT-4o) to evaluate complex prompts via VQA.
- Policy Gradient: A simplified GRPO approach is used to update the policy network based on reward advantages. A KL divergence term is included for stability (Eq. 15).
- The paper explicitly differentiates Selftok-based RL from diffusion-based DPO, highlighting that Selftok provides actual state-action trajectories necessary for standard RL, unlike diffusion which lacks access to true inversion paths.

Visual RL Validation

Applying visual RL (Selftok-Zero) after SFT significantly boosts text-to-image generation performance. Selftok-Zero achieves state-of-the-art results on both GenEval (92 overall, +18 gain over Selftok-SFT) and DPG-Bench (85.57 overall, +3.77 gain over Selftok-SFT) (Table 5, 6). Gains are particularly pronounced in complex attributes like Position and Counting. Critically, Selftok shows much larger gains from visual RL compared to spatial token methods (Janus-Pro-Zero), confirming the hypothesis that AR tokens are more effective for RL (Figure 9). Program-based rewards yield larger gains than QA-based rewards, likely due to their more precise signal. Qualitative results show Selftok-Zero successfully generating images for prompts where Selftok-SFT failed due to training data biases (Figure 10, Figure 8 in Appendix B). Visual RL also improves image editing performance (Figure 11 in Appendix B).

Conclusion, Limitations, and Ongoing Work

The paper concludes that Selftok effectively addresses the challenge of enabling effective RL for visual generation by providing AR discrete tokens. Its key contributions are unifying diffusion and AR in a single LLM framework and demonstrating the practical benefits of AR visual tokens for training and RL.

Limitations include the slower token generation speed of AR models compared to diffusion, posing throughput challenges for high-resolution video. The current model scale also limits the demonstration of multimodal emergent capabilities.

Ongoing work focuses on addressing these limitations:

Multi-resolution Selftok: Scaling to higher resolutions (e.g., $512 \times 512$ ) by increasing token count and reusing lower-resolution tokens (Figure 10).
Physics-aware Post-training: Incorporating physical laws into reward models for video generation to improve realism and world modeling capabilities.