- The paper introduces Selftok, a discrete visual tokenizer that generates autoregressive tokens through a novel diffusion-based approach for unified vision-language reasoning.
- Its methodology leverages an encoder, quantizer, and diffusion decoder with a custom token schedule to enforce causal dependencies and optimize image reconstruction.
- Empirical results demonstrate state-of-the-art ImageNet reconstruction and improved performance in visual reasoning, image editing, and reinforcement learning applications.
This paper, "Selftok: Discrete Visual Tokens of Autoregression, by Diffusion, and for Reasoning" (2505.07538), introduces Selftok, a novel discrete visual tokenizer designed to generate autoregressive (AR) visual tokens. Unlike traditional methods that rely on spatial priors (where tokens directly map to image patches), Selftok completely discards this prior. The core motivation is to create visual tokens that are fundamentally compatible with the discrete autoregressive architecture of LLMs, thereby enabling the creation of unified Vision-LLMs (VLMs) that can leverage the well-established training paradigms and emergent capabilities (like reasoning via Reinforcement Learning) of LLMs.
The paper argues against using continuous representations (cAR) or discrete spatial tokens for building unified AR VLMs. Continuous tokens complicate training (less stable MSE loss vs. cross-entropy) and reinforcement learning (infinite state-action space) and are less conducive to disentanglement. Spatial tokens, despite being discrete, have non-AR causal dependencies (collider effect), which violates the policy improvement optimality required for effective Reinforcement Learning (RL) and makes them incompatible with the causal structure of AR models.
How Selftok Works
Selftok is designed to encode an image I into a sequence of K discrete tokens VK=[v1,…,vK] such that they conform to an AR prior, allowing reconstruction of the image. The key innovation is composing the AR constraint into the reconstruction objective by leveraging the recursive nature of the reverse diffusion process.
The reverse diffusion process describes the transformation from noise (x0) to a clean image (x1) over time t∈[0,1]. This process can be viewed recursively: the path from a noisy intermediate state xt to the clean image x1 is a sub-problem of the path from x0 to x1. Selftok establishes a correspondence between the AR recursion P(VK)=P(V<i)⋅P(Vi∣V<i) and the diffusion recursion x1=xt+∫t1vs(xs)ds.
The Selftok training objective (Eq. 9) minimizes the expected squared error between the image x1 and the reconstruction produced by a decoder, conditioned on a noisy intermediate state xt (sampled from q(xt∣x1)) and a suffix of tokens Vk(t)=[vk(t),…,vK]. The index k(t) is determined by a token schedule mapping continuous time t to a token index. This recursive formulation forces the tokens Vk(t) to encode the necessary information to complete the diffusion path from xt to x1, thereby implicitly learning an AR structure where later tokens depend on earlier ones (represented by xt). The paper shows this induces a causal graph suitable for AR learning, using x0 as an instrumental variable (Eq. 7).
Empirical evidence for Selftok's AR property is provided by analyzing the entropy of next-token predictions under different token ordering (Figure 1), showing segmented decreasing trends characteristic of AR sequences.
Implementation Details
The Selftok tokenizer consists of an Encoder, a Quantizer, and a Decoder, leveraging a dual-stream transformer architecture similar to MMDiT.
- Encoder: Takes a VAE latent of the input image and learnable continuous token embeddings. A dual-stream transformer processes image patches and token embeddings, interacting via co-attention. Token-aware adaptive layer normalization (AdaLN) differentiates token embeddings. The output is K continuous token embeddings.
- Quantizer: Compresses the continuous embeddings through a bottleneck layer and quantizes them by mapping each to the closest word in a learnable codebook (≈32,000 words, dimension D′=16). A straight-through estimator handles backpropagation. Quantization loss (commitment + entropy) and EMA updates with dead-code reactivation are used for training the codebook.
- Decoder: A diffusion model initialized from SD3 weights. It's a dual-stream transformer conditioned on the noisy image latent xt and the quantized token embeddings Vk(t). The token stream weights are trained from scratch. Timestep conditioning is applied via AdaLN in the image stream.
- Training Objective: Jointly optimizes encoder and decoder parameters using the reconstruction loss ∥x1−Dec(xt,Vk(t))∥2 averaged over uniform time sampling, plus the quantization loss. A re-weighting mechanism for token embeddings compensates for imbalanced gradient updates due to the token schedule.
- Token Schedule k(t): Maps diffusion time t to the starting index of the token suffix Vk(t). While a uniform schedule k(t)=t×K+1 is theoretically ideal for AR, an empirically better custom schedule allocates fewer tokens to small t, aligning with the observation that early diffusion steps have less impact on reconstruction quality.
- One-step Renderer: After training the tokenizer, a separate renderer is trained to reconstruct the image from the full token sequence VK in a single forward pass. It's initialized from the decoder and trained with MSE, LPIPS, and GAN losses (Eq. 12) to improve speed and perceptual quality.
Tokenizer Validation
Selftok achieves state-of-the-art reconstruction quality on ImageNet compared to existing spatial (2D) and 1D tokenizers (Table 1). Qualitative results demonstrate high-quality reconstruction (Figure 2) and a non-spatial representation, contrasting with patch-based methods (Figure 3). Ablation studies validate the choices for codebook size, token count, time sampler (uniform sampling performs best), token schedule (custom schedule is better empirically), and the benefits of the one-step renderer (improved speed and quality - Table 2, 3, 4).
Selftok-based VLM
The discrete Selftok tokens allow building a unified pure AR VLM based on LLMs (Llama3-8B in this case) by expanding its vocabulary with visual tokens. The VLM is trained using a standard LLMing objective on multimodal token sequences.
- Training Stages:
- Stage 1: Cross-modality Alignment: Training on diverse data formats (Text-to-Image, Image-to-Text, Image-Only, Text-Only) to align language and visual tokens (Figure 4). Multi-level captioning enhances data richness.
- Stage 2: Cross-task Alignment: Supervised fine-tuning (SFT) on specific tasks like text-to-image generation, image editing, and image understanding.
- Inference: An adaptive logit adjustment strategy is used during inference (Algorithm 1). This adjusts logits based on prediction entropy, addressing uncertainty arising from insufficient visual pretraining, although ideally, a well-trained AR VLM should not require it like LLMs.
VLM Validation
The Selftok-based VLM performs well on downstream tasks:
- Text-to-Image Generation: Achieves competitive results on GenEval and DPG-Bench after SFT (Selftok-SFT in Table 5, 6). Qualitative examples demonstrate good alignment and aesthetics (Figure 5).
- Image Editing: Shows competitive performance on PIE-Bench, balancing fidelity and editability (Table 7, Figure 6, 13).
- Vision-Language Comprehension: Achieves strong results on MME compared to other unified models, though specialized comprehension models perform better (Table 8).
- Synergy: Experiments show that training on both generation and comprehension tasks yields positive performance gains for Selftok (Δ>0), unlike spatial tokens which can show conflicts (Δ<0) (Figure 7). This synergy is key for using comprehension models as rewards in RL.
Selftok-based Visual RL
The paper argues that visual RL is essential for visual generative models to move beyond mimicking training data and achieve genuine reasoning, especially for handling rare or complex instructions (hallucination issue - Figure 8 in Appendix B). The AR property of Selftok tokens is crucial for formulating visual RL as a finite Markov Decision Process (MDP) and deriving the BeLLMan equation, which guarantees effective policy updates. Spatial tokens, lacking AR structure, cannot support this.
- Problem Formulation: Visual RL is framed as an MDP where states are token sequences generated so far, actions are predicting the next token, and the reward evaluates the final generated image's quality relative to the task (e.g., text prompt).
- Implementation:
- Reward Model: Utilizes visual comprehension models. Program-based rewards use detectors for structured tasks (counting, position). QA-based rewards use powerful VLMs (like InternVL, GPT-4o) to evaluate complex prompts via VQA.
- Policy Gradient: A simplified GRPO approach is used to update the policy network based on reward advantages. A KL divergence term is included for stability (Eq. 15).
- The paper explicitly differentiates Selftok-based RL from diffusion-based DPO, highlighting that Selftok provides actual state-action trajectories necessary for standard RL, unlike diffusion which lacks access to true inversion paths.
Visual RL Validation
Applying visual RL (Selftok-Zero) after SFT significantly boosts text-to-image generation performance. Selftok-Zero achieves state-of-the-art results on both GenEval (92 overall, +18 gain over Selftok-SFT) and DPG-Bench (85.57 overall, +3.77 gain over Selftok-SFT) (Table 5, 6). Gains are particularly pronounced in complex attributes like Position and Counting. Critically, Selftok shows much larger gains from visual RL compared to spatial token methods (Janus-Pro-Zero), confirming the hypothesis that AR tokens are more effective for RL (Figure 9). Program-based rewards yield larger gains than QA-based rewards, likely due to their more precise signal. Qualitative results show Selftok-Zero successfully generating images for prompts where Selftok-SFT failed due to training data biases (Figure 10, Figure 8 in Appendix B). Visual RL also improves image editing performance (Figure 11 in Appendix B).
Conclusion, Limitations, and Ongoing Work
The paper concludes that Selftok effectively addresses the challenge of enabling effective RL for visual generation by providing AR discrete tokens. Its key contributions are unifying diffusion and AR in a single LLM framework and demonstrating the practical benefits of AR visual tokens for training and RL.
Limitations include the slower token generation speed of AR models compared to diffusion, posing throughput challenges for high-resolution video. The current model scale also limits the demonstration of multimodal emergent capabilities.
Ongoing work focuses on addressing these limitations:
- Multi-resolution Selftok: Scaling to higher resolutions (e.g., 512×512) by increasing token count and reusing lower-resolution tokens (Figure 10).
- Physics-aware Post-training: Incorporating physical laws into reward models for video generation to improve realism and world modeling capabilities.