Latent Denoising Tokenizer (l-DeTok)
- The paper introduces l-DeTok, a tokenizer that incorporates explicit denoising into its training to better align latent embeddings with generative model requirements.
- It employs an encoder-decoder architecture with interpolative Gaussian noise and random masking to generate robust latent representations from corrupted inputs.
- Experimental benchmarks on ImageNet show significant FID improvements for both autoregressive and diffusion models, highlighting its broad applicability.
A Latent Denoising Tokenizer (l-DeTok) is a tokenizer architecture designed to produce latent representations that are robust to corruption and noise. Its core training objective is to align the encoder’s latent embeddings with the denoising requirements of downstream generative models, particularly those that rely on reconstructing clean signals from corrupted inputs. This approach addresses fundamental limitations in conventional tokenizers by explicitly simulating the denoising tasks encountered in generative model training, thereby facilitating improved generation quality and robustness across a variety of domains, including vision, text, and multimodal tasks (Yang et al., 21 Jul 2025).
1. Foundational Principles and Motivation
The guiding principle behind l-DeTok is that modern generative models, such as diffusion models or autoregressive (AR) transformers, are trained to reconstruct clean data from corrupted forms (e.g., masked, noised, or sampled with errors). Traditional tokenizers, typically trained with pixel-level or input-level reconstruction losses (e.g., as variational autoencoders), do not explicitly align their latent representations with the specific denoising procedures that generative models apply. l-DeTok introduces denoising directly into the tokenizer’s training, prompting the encoder to generate representations that remain informative and recoverable even when subjected to strong corruption such as interpolative Gaussian noise and random masking (Yang et al., 21 Jul 2025).
2. Objective Function and Latent Deconstruction
The l-DeTok objective is to reconstruct the original data (e.g., image) from heavily corrupted latent embeddings. The loss function employed is a weighted sum of the following terms:
- Pixel-wise mean-squared error (MSE) loss: Promotes faithful image reconstruction.
- KL regularization term: Structures the latent space similar to a variational autoencoder.
- Perceptual loss: Ensures that reconstructed images are perceptually close to the target using, e.g., ConvNeXt or VGG features.
- Adversarial loss: Encourages sharp, realistic outputs by incorporating a GAN.
The total loss takes the form
During training, input images are encoded into latent representations, which are then deconstructed by two primary mechanisms:
- Interpolative Gaussian Noise: Each latent token is interpolated with Gaussian noise via
where and is zero-mean Gaussian noise, scaled by hyperparameter (Yang et al., 21 Jul 2025).
- Random Masking: A randomly selected proportion of latent tokens is replaced by a mask token, at a variable masking rate , sampled from a biased uniform distribution. This challenges the decoder to reconstruct the full image even when significant information is unavailable (Yang et al., 21 Jul 2025).
These strategies require the decoder to “denoise” latent codes, directly regularizing the encoder to produce robust and reconstructable representations.
3. Methodology and Implementation
l-DeTok typically adopts an encoder–decoder architecture, often implemented with Vision Transformers (ViTs) as the backbone. During training:
- The encoder converts an input (e.g., an image patch sequence) into latent tokens.
- The tokens are subjected to interpolative noise and/or random masking.
- The decoder reconstructs the original input from these corrupted latents.
A simplified implementation is as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
def denoise(x, encoder, decoder, max_mask_ratio=0.7, gamma=3.0): # Encode image with optional masking z, ids_restore = encoder(x, max_mask_ratio=max_mask_ratio) # Variational latent embeddings posteriors = diagonal_gaussian_dist(z) z_sampled = posteriors.sample() # Uniform sampling of noise level τ in [0, 1] noise_level = torch.rand(batch_size, 1, 1).expand_as(z_sampled) # Generate scaled Gaussian noise noise = gamma * torch.randn_like(z_sampled) z_noised = (1 - noise_level) * z_sampled + noise_level * noise # Reconstruction from corrupted latents recon = decoder(z_noised, ids_restore) return recon |
At inference, the encoder emits clean latent tokens; noise and masking are disabled, ensuring full signal fidelity for downstream applications.
4. Benchmark Results and Performance Analysis
Comprehensive experiments on ImageNet 256×256 illustrated that l-DeTok outperforms standard tokenizers across a variety of representative generative models, including:
- Non-autoregressive diffusion models (e.g., DiT, SiT, LightningDiT)
- Autoregressive models (e.g., MAR, RandomAR, RasterAR)
Notable findings include:
- Consistent reduction in Fréchet Inception Distance (FID) compared to standard tokenizers, for both AR and non-AR models.
- For example, with the MAR model and classifier-free guidance, FID improved from 3.31 to 2.65; for an extended model (MAR-B), FID improved from 2.31 to 1.55 (Yang et al., 21 Jul 2025).
- Robust generalization across model classes: while semantic-distilled tokenizers benefit some model types, l-DeTok yields robust improvements regardless of whether the downstream model is AR or non-AR.
These results underscore the effectiveness of aligning the tokenizer’s latent space with the denoising demands of the generative task.
5. Theoretical Context and Connections to Related Work
The rationale for l-DeTok is situated within a broader trend in generative modeling: nearly all state-of-the-art models for vision or text employ objectives that involve denoising corrupted input—either through masking, noise injection, or sequential prediction. The l-DeTok design thus provides an explicit encoder regularization aligned with this core paradigm.
The approach relates conceptually to the use of denoising autoencoders in text (where corruption is applied to sentences and the model reconstructs the original; see also denoising adversarial autoencoders in text generation (Shen et al., 2019)), as well as latent denoising via residual mapping in communication systems (Xu et al., 11 Feb 2025). However, l-DeTok uniquely formulates both the noise process (interpolative with per-token variable strengths) and the random masking to train ViT-based encoders–decoders explicitly for generative denoising demands (Yang et al., 21 Jul 2025).
6. Practical Impact, Limitations, and Future Directions
l-DeTok marks a significant shift in tokenizer design, emphasizing not just reconstruction accuracy but explicit robustness to latent corruption—a property directly relevant to the operational regime of diffusion or AR generative models. The unified denoising training principle offers several practical advantages:
- Improved generation quality across a diverse array of downstream generative models.
- Decoupled reliance on semantic distillation, simplifying and potentially scaling the tokenizer training process.
Open research questions include:
- Extension to discrete/tokenized latent spaces, such as those using vector quantization (Qiu et al., 11 Mar 2025).
- Mitigation of the training–inference gap (as the decoder is trained on heavily corrupted latents but used on clean latents at deployment).
- Application to broader and more complex datasets (e.g., video, multi-modal).
- Systematic comparison to plug-and-play and residual denoising frameworks (Xu et al., 11 Feb 2025), and to the family of robust tokenizers developed for sampling error synthesis (Qiu et al., 11 Mar 2025).
A summary table of loss components and roles in l-DeTok:
Loss Term | Encourages | Typical Implementation |
---|---|---|
MSE | Pixel accuracy | |
KL | Latent structure | VAE-style regularization |
Perceptual | Visual similarity | VGG/ConvNeXt features |
Adversarial | Realism/sharpness | GAN discriminator loss |
l-DeTok’s training regime—reconstructing targets from heavily perturbed latent embeddings—positions it to serve as a foundation for robust generative modeling across modalities where denoising is central to the generative process.