Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 95 TPS
Gemini 2.5 Pro 47 TPS Pro
GPT-5 Medium 29 TPS
GPT-5 High 33 TPS Pro
GPT-4o 102 TPS
GPT OSS 120B 471 TPS Pro
Kimi K2 192 TPS Pro
2000 character limit reached

Robust Latent Matters: Boosting Image Generation with Sampling Error Synthesis (2503.08354v2)

Published 11 Mar 2025 in cs.CV and cs.AI

Abstract: Recent image generation schemes typically capture image distribution in a pre-constructed latent space relying on a frozen image tokenizer. Though the performance of tokenizer plays an essential role to the successful generation, its current evaluation metrics (e.g. rFID) fail to precisely assess the tokenizer and correlate its performance to the generation quality (e.g. gFID). In this paper, we comprehensively analyze the reason for the discrepancy of reconstruction and generation qualities in a discrete latent space, and, from which, we propose a novel plug-and-play tokenizer training scheme to facilitate latent space construction. Specifically, a latent perturbation approach is proposed to simulate sampling noises, i.e., the unexpected tokens sampled, from the generative process. With the latent perturbation, we further propose (1) a novel tokenizer evaluation metric, i.e., pFID, which successfully correlates the tokenizer performance to generation quality and (2) a plug-and-play tokenizer training scheme, which significantly enhances the robustness of tokenizer thus boosting the generation quality and convergence speed. Extensive benchmarking are conducted with 11 advanced discrete image tokenizers with 2 autoregressive generation models to validate our approach. The tokenizer trained with our proposed latent perturbation achieve a notable 1.60 gFID with classifier-free guidance (CFG) and 3.45 gFID without CFG with a $\sim$400M generator. Code: https://github.com/lxa9867/ImageFolder.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces RobustTok, a novel tokenizer training scheme using latent perturbation and a new pFID metric to enhance the robustness of discrete latent spaces, addressing error accumulation in autoregressive image generation.
  • RobustTok simulates sampling errors during training via latent perturbation, which replaces tokens with neighbors from the codebook, and evaluates robustness using the proposed perturbed FID (pFID) metric.
  • Experiments on ImageNet demonstrate RobustTok's effectiveness, showing improved generative performance (e.g., gFID of 1.60) and faster convergence compared to prior methods, confirming that latent space robustness boosts image generation quality.

The paper introduces RobustTok, a novel plug-and-play tokenizer training scheme designed to enhance the robustness of discrete latent spaces in autoregressive image generation. The core idea revolves around addressing the error accumulation problem that arises from the discrepancies between tokenizer training and AR inference conditions. Specifically, the paper identifies that autoregressive error propagation primarily stems from the lack of robustness in discrete latent spaces. To tackle this, the authors propose a latent perturbation method to simulate sampling noises during the generative process and a new tokenizer evaluation metric, perturbed FID (pFID), to measure the robustness of the discrete latent space under synthesized sampling error.

Here's a breakdown of the key components and contributions:

  • Problem Definition: The paper begins by highlighting the challenges in autoregressive image generation, particularly the error accumulation issue. It is noted that while autoregressive models are trained under teacher forcing, during inference, predictions rely solely on previously generated tokens. This discrepancy leads to sampling errors, where unexpected tokens are sampled, challenging the robustness of frozen visual decoders.
  • Latent Perturbation: To address the identified issue, the authors introduce a latent perturbation method for tokenizer training. This approach involves adding noise to the latent tokens during training to enhance the tokenizer's robustness.
    • Perturbation Rate (α\alpha): Defined as the proportion of perturbed tokens within an image.
    • Perturbation Proportion (β\beta): Represents the proportion of images within a batch to which perturbation is applied.
    • Perturbation Strength (δ\delta): Quantifies the level of perturbation, where a discrete token z=e_k\mathbf{z} = \mathbf{e}\_k is replaced with a randomly selected top-δ\delta nearest neighbor e_δ\mathbf{e}\_\delta from the codebook C\mathcal{C}. The set of top-δ\delta nearest neighbors S_δ\mathcal{S}\_\delta is mathematically expressed as:

      S_δ=argminS_δC,S_δ=δe_nS_δe_ne_k_22\mathcal{S}\_\delta = \underset{\mathcal{S}\_\delta \subset \mathcal{C}, |\mathcal{S}\_\delta|=\delta}{\arg\min} \sum_{\mathbf{e}\_n \in \mathcal{S}\_\delta} \|\mathbf{e}\_n - \mathbf{e}\_k\|\_2^2

      where:

      • S_δ\mathcal{S}\_\delta is the set of top-δ\delta nearest neighbors.
      • C\mathcal{C} is the codebook.
      • e_n\mathbf{e}\_n is a codeword within S_δ\mathcal{S}\_\delta.
      • e_k\mathbf{e}\_k is the original token.
      • |\cdot| denotes the counting operation.
  • Perturbed FID (pFID): A novel evaluation metric is introduced to measure the discrete latent space robustness under synthesized sampling error. The pFID metric is designed to capture the quality of subsequent generation and guide improvements in AR generative modeling. pFID is calculated by applying perturbation across all images (β=1\beta = 1) and averaging the FID scores between input images across a set of perturbation rates α{0.9,0.8,0.7,0.6,0.5}\alpha \in \{0.9, 0.8, 0.7, 0.6, 0.5\} and perturbation strengths δ{200,280,360}\delta \in \{200, 280, 360\}.
  • RobustTok Architecture and Training: The proposed RobustTok leverages a Vision Transformer (ViT) architecture for both the visual encoder and decoder. The training process involves a plug-and-play perturbation strategy, incorporating a pretrained DINOv2 model for injecting semantics into the latent space. During training, latent perturbation is applied after semantic regularization to preserve clear semantics in the discrete tokens.
  • Experimental Results: The efficacy of the approach is demonstrated through experiments on the ImageNet 256x256 benchmark. The results indicate that RobustTok outperforms existing methods, achieving lower gFID scores and accelerated convergence. Ablation studies validate the effectiveness of perturbation parameters, confirming that robustness gains directly translate to improved generative performance. For instance, RobustTok achieves a gFID of 1.60 with classifier-free guidance (CFG) using a ~400M generator. A notable 0.12 and 0.10 gFID gains are achieved by utilizing RobustTok on top of RAR generator.
  • Robust Latent Space: t-SNE visualizations of the latent space reveal that RobustTok constructs a space with many reusable tokens, which act as key tokens that can be easily modeled. In contrast, the latent space without latent perturbation distributes usage more uniformly across tokens.
  • Ablation Studies: Ablation experiments are conducted to determine the optimal selection of perturbation hyperparameters. The results indicate that using a large perturbation parameter, \eg, β=0.5\beta = 0.5, degrades the model's reconstruction capability and adversely affects generative performance. Training without an annealing strategy leads to mode collapse and loss of generation diversity, whereas annealing to zero results in an overly deterministic tokenizer. It was found that annealing to half strikes a balance between robustness and adaptability, preserving essential latent properties while improving the quality of generated outputs.
  • Loss Function: The RobustTok is trained with a composite loss function that includes reconstruction loss (L_rec\mathcal{L}\_{rec}), vector quantization loss (L_VQ\mathcal{L}\_{VQ}), adversarial loss (L_ad\mathcal{L}\_{ad}), perceptual loss (L_P\mathcal{L}\_P), and semantic loss (L_clip\mathcal{L}\_{clip}). The overall loss function is expressed as:

    L=λ_recL_rec+λ_VQL_VQ+λ_adL_ad+λ_PL_P+λ_semL_sem\mathcal{L} = \lambda\_{rec}\mathcal{L}\_{rec} + \lambda\_{VQ}\mathcal{L}\_{VQ} + \lambda\_{ad}\mathcal{L}\_{ad} + \lambda\_P\mathcal{L}\_P + \lambda\_{sem}\mathcal{L}\_{sem}

  • Limitations: While the paper primarily focuses on discrete latent spaces, the authors acknowledge that the discussed problem also exists in continuous tokenizers with diffusion models. They suggest that future work could explore latent perturbation in continuous tokenizers, noting the challenges in determining appropriate perturbation strategies without a constrained codebook.
Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com