Latent Denoising Tokenizer (l-DeTok)

Updated 22 July 2025

The paper introduces l-DeTok, a tokenizer that incorporates explicit denoising into its training to better align latent embeddings with generative model requirements.
It employs an encoder-decoder architecture with interpolative Gaussian noise and random masking to generate robust latent representations from corrupted inputs.
Experimental benchmarks on ImageNet show significant FID improvements for both autoregressive and diffusion models, highlighting its broad applicability.

A Latent Denoising Tokenizer (l-DeTok) is a tokenizer architecture designed to produce latent representations that are robust to corruption and noise. Its core training objective is to align the encoder’s latent embeddings with the denoising requirements of downstream generative models, particularly those that rely on reconstructing clean signals from corrupted inputs. This approach addresses fundamental limitations in conventional tokenizers by explicitly simulating the denoising tasks encountered in generative model training, thereby facilitating improved generation quality and robustness across a variety of domains, including vision, text, and multimodal tasks (Yang et al., 21 Jul 2025).

1. Foundational Principles and Motivation

The guiding principle behind l-DeTok is that modern generative models, such as diffusion models or autoregressive (AR) transformers, are trained to reconstruct clean data from corrupted forms (e.g., masked, noised, or sampled with errors). Traditional tokenizers, typically trained with pixel-level or input-level reconstruction losses (e.g., as variational autoencoders), do not explicitly align their latent representations with the specific denoising procedures that generative models apply. l-DeTok introduces denoising directly into the tokenizer’s training, prompting the encoder to generate representations that remain informative and recoverable even when subjected to strong corruption such as interpolative Gaussian noise and random masking (Yang et al., 21 Jul 2025).

2. Objective Function and Latent Deconstruction

The l-DeTok objective is to reconstruct the original data (e.g., image) from heavily corrupted latent embeddings. The loss function employed is a weighted sum of the following terms:

Pixel-wise mean-squared error (MSE) loss: Promotes faithful image reconstruction.
KL regularization term: Structures the latent space similar to a variational autoencoder.
Perceptual loss: Ensures that reconstructed images are perceptually close to the target using, e.g., ConvNeXt or VGG features.
Adversarial loss: Encourages sharp, realistic outputs by incorporating a GAN.

The total loss takes the form

$L_{total} = L_{MSE} + \lambda_{KL} L_{KL} + \lambda_{percep} L_{percep} + \lambda_{GAN} L_{GAN}$

During training, input images are encoded into latent representations, which are then deconstructed by two primary mechanisms:

Interpolative Gaussian Noise: Each latent token $x$ is interpolated with Gaussian noise $\varepsilon(\gamma)$ via

$x' = (1 - \tau) x + \tau \varepsilon(\gamma)$

where $\tau \sim \text{Uniform}(0, 1)$ and $\varepsilon(\gamma)$ is zero-mean Gaussian noise, scaled by hyperparameter $\gamma$ (Yang et al., 21 Jul 2025).

Random Masking: A randomly selected proportion of latent tokens is replaced by a mask token, at a variable masking rate $m$ , sampled from a biased uniform distribution. This challenges the decoder to reconstruct the full image even when significant information is unavailable (Yang et al., 21 Jul 2025).

These strategies require the decoder to “denoise” latent codes, directly regularizing the encoder to produce robust and reconstructable representations.

3. Methodology and Implementation

l-DeTok typically adopts an encoder–decoder architecture, often implemented with Vision Transformers (ViTs) as the backbone. During training:

The encoder converts an input (e.g., an image patch sequence) into latent tokens.
The tokens are subjected to interpolative noise and/or random masking.
The decoder reconstructs the original input from these corrupted latents.

A simplified implementation is as follows:

def denoise(x, encoder, decoder, max_mask_ratio=0.7, gamma=3.0):
    # Encode image with optional masking
    z, ids_restore = encoder(x, max_mask_ratio=max_mask_ratio)
    # Variational latent embeddings
    posteriors = diagonal_gaussian_dist(z)
    z_sampled = posteriors.sample()
    # Uniform sampling of noise level τ in [0, 1]
    noise_level = torch.rand(batch_size, 1, 1).expand_as(z_sampled)
    # Generate scaled Gaussian noise
    noise = gamma * torch.randn_like(z_sampled)
    z_noised = (1 - noise_level) * z_sampled + noise_level * noise
    # Reconstruction from corrupted latents
    recon = decoder(z_noised, ids_restore)
    return recon

(Yang et al., 21 Jul 2025)

At inference, the encoder emits clean latent tokens; noise and masking are disabled, ensuring full signal fidelity for downstream applications.

4. Benchmark Results and Performance Analysis

Comprehensive experiments on ImageNet 256×256 illustrated that l-DeTok outperforms standard tokenizers across a variety of representative generative models, including:

Non-autoregressive diffusion models (e.g., DiT, SiT, LightningDiT)
Autoregressive models (e.g., MAR, RandomAR, RasterAR)

Notable findings include:

Consistent reduction in Fréchet Inception Distance (FID) compared to standard tokenizers, for both AR and non-AR models.
For example, with the MAR model and classifier-free guidance, FID improved from 3.31 to 2.65; for an extended model (MAR-B), FID improved from 2.31 to 1.55 (Yang et al., 21 Jul 2025).
Robust generalization across model classes: while semantic-distilled tokenizers benefit some model types, l-DeTok yields robust improvements regardless of whether the downstream model is AR or non-AR.

These results underscore the effectiveness of aligning the tokenizer’s latent space with the denoising demands of the generative task.

The rationale for l-DeTok is situated within a broader trend in generative modeling: nearly all state-of-the-art models for vision or text employ objectives that involve denoising corrupted input—either through masking, noise injection, or sequential prediction. The l-DeTok design thus provides an explicit encoder regularization aligned with this core paradigm.

The approach relates conceptually to the use of denoising autoencoders in text (where corruption is applied to sentences and the model reconstructs the original; see also denoising adversarial autoencoders in text generation (Shen et al., 2019)), as well as latent denoising via residual mapping in communication systems (Xu et al., 11 Feb 2025). However, l-DeTok uniquely formulates both the noise process (interpolative with per-token variable strengths) and the random masking to train ViT-based encoders–decoders explicitly for generative denoising demands (Yang et al., 21 Jul 2025).

6. Practical Impact, Limitations, and Future Directions

l-DeTok marks a significant shift in tokenizer design, emphasizing not just reconstruction accuracy but explicit robustness to latent corruption—a property directly relevant to the operational regime of diffusion or AR generative models. The unified denoising training principle offers several practical advantages:

Improved generation quality across a diverse array of downstream generative models.
Decoupled reliance on semantic distillation, simplifying and potentially scaling the tokenizer training process.

Open research questions include:

Extension to discrete/tokenized latent spaces, such as those using vector quantization (Qiu et al., 11 Mar 2025).
Mitigation of the training–inference gap (as the decoder is trained on heavily corrupted latents but used on clean latents at deployment).
Application to broader and more complex datasets (e.g., video, multi-modal).
Systematic comparison to plug-and-play and residual denoising frameworks (Xu et al., 11 Feb 2025), and to the family of robust tokenizers developed for sampling error synthesis (Qiu et al., 11 Mar 2025).

A summary table of loss components and roles in l-DeTok:

Loss Term	Encourages	Typical Implementation
MSE	Pixel accuracy	$\\|x - x_{recon}\\|^2$
KL	Latent structure	VAE-style regularization
Perceptual	Visual similarity	VGG/ConvNeXt features
Adversarial	Realism/sharpness	GAN discriminator loss

(Yang et al., 21 Jul 2025)

l-DeTok’s training regime—reconstructing targets from heavily perturbed latent embeddings—positions it to serve as a foundation for robust generative modeling across modalities where denoising is central to the generative process.

PDF Markdown Chat (Pro)

References (4)

Latent Denoising Makes Good Visual Tokenizers (2025)

Educating Text Autoencoders: Latent Representation Guidance via Denoising (2019)

Learnable Residual-Based Latent Denoising in Semantic Communication (2025)

Robust Latent Matters: Boosting Image Generation with Sampling Error Synthesis (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Latent Denoising Tokenizer (l-DeTok).

Latent Denoising Tokenizer (l-DeTok)

1. Foundational Principles and Motivation

2. Objective Function and Latent Deconstruction

3. Methodology and Implementation

4. Benchmark Results and Performance Analysis

6. Practical Impact, Limitations, and Future Directions

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Latent Denoising Tokenizer (l-DeTok)

1. Foundational Principles and Motivation

2. Objective Function and Latent Deconstruction

3. Methodology and Implementation

4. Benchmark Results and Performance Analysis

5. Theoretical Context and Connections to Related Work

6. Practical Impact, Limitations, and Future Directions

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research