Advancing End-to-End Pixel Space Generative Modeling via Self-supervised Pre-training (2510.12586v1)

Published 14 Oct 2025 in cs.CV

Abstract: Pixel-space generative models are often more difficult to train and generally underperform compared to their latent-space counterparts, leaving a persistent performance and efficiency gap. In this paper, we introduce a novel two-stage training framework that closes this gap for pixel-space diffusion and consistency models. In the first stage, we pre-train encoders to capture meaningful semantics from clean images while aligning them with points along the same deterministic sampling trajectory, which evolves points from the prior to the data distribution. In the second stage, we integrate the encoder with a randomly initialized decoder and fine-tune the complete model end-to-end for both diffusion and consistency models. Our training framework demonstrates strong empirical performance on ImageNet dataset. Specifically, our diffusion model reaches an FID of 2.04 on ImageNet-256 and 2.35 on ImageNet-512 with 75 number of function evaluations (NFE), surpassing prior pixel-space methods by a large margin in both generation quality and efficiency while rivaling leading VAE-based models at comparable training cost. Furthermore, on ImageNet-256, our consistency model achieves an impressive FID of 8.82 in a single sampling step, significantly surpassing its latent-space counterpart. To the best of our knowledge, this marks the first successful training of a consistency model directly on high-resolution images without relying on pre-trained VAEs or diffusion models.

Summary

The paper introduces a novel two-stage framework that leverages self-supervised pre-training and end-to-end fine-tuning to bypass the complexity of VAEs.
The methodology employs contrastive and representation consistency losses, achieving state-of-the-art FID scores for both diffusion and consistency models on ImageNet.
Empirical results demonstrate improved efficiency and scalability, including one-step high-resolution generation that outperforms prior pixel-space approaches.

End-to-End Pixel-Space Generative Modeling via Self-Supervised Pre-Training

Introduction and Motivation

Pixel-space generative models, particularly diffusion and consistency models, have historically lagged behind their latent-space counterparts in both sample quality and computational efficiency. The dominant paradigm leverages pre-trained VAEs to compress the data manifold, enabling more tractable and efficient generative modeling. However, this introduces significant complexity: VAE training is nontrivial, reconstructions are imperfect for out-of-distribution latents, and the overall pipeline requires maintaining and optimizing multiple large models. The paper "Advancing End-to-End Pixel Space Generative Modeling via Self-supervised Pre-training" (2510.12586) addresses these limitations by proposing a two-stage training framework that enables high-fidelity, efficient pixel-space generation without reliance on VAEs or external models.

Methodology

Two-Stage Training Framework

The proposed approach decomposes the generative modeling pipeline into two distinct stages:

Self-Supervised Pre-Training of the Encoder: The encoder is trained to capture robust visual semantics from images at varying noise levels, using a combination of contrastive and representation consistency losses. The contrastive loss (InfoNCE) encourages semantic alignment between augmented views of clean images, while the representation consistency loss enforces alignment between temporally adjacent points along the same deterministic sampling trajectory (i.e., points connected by the probability flow ODE of the diffusion process).
End-to-End Fine-Tuning with a Decoder:

After pre-training, the encoder is combined with a randomly initialized decoder and fine-tuned end-to-end for either diffusion or consistency modeling. The decoder is conditioned on the encoder's representations and trained to generate pixels directly, using standard denoising or consistency objectives.

Figure 1: Overview of the two-stage training framework: (Left) self-supervised pre-training with contrastive and representation consistency losses; (Right) end-to-end fine-tuning for generative modeling.

Representation Consistency Learning

The pre-training objective is:

$\mathbb{E}\left[ \text{InfoNCE}(L_\theta(E_\theta(x_1, t_0)), L_{\theta^-}(E_{\theta^-}(x_2, t_0))) + \text{InfoNCE}(E_\theta(x_{t_n}, t_n), E_{sg(\theta)}(x_{t_{n-1}}, t_{n-1})) \right]$

where $x_1, x_2$ are augmented views, $x_{t_n}, x_{t_{n-1}}$ are temporally adjacent points on the same ODE trajectory, $L_\theta$ is a projector, and $\theta^-$ is an EMA of $\theta$ . The temperature parameter $\tau$ is annealed to stabilize early training and improve representation quality at high noise levels.

Fine-Tuning and Auxiliary Loss

During fine-tuning, the projector is discarded. The encoder and decoder are trained jointly for the generative task. For consistency models, an auxiliary contrastive loss is introduced between the model output and the clean image, using a frozen copy of the pre-trained encoder. This provides additional supervision and accelerates convergence, especially in the early stages of training.

Empirical Results

Diffusion Models

The EPG (End-to-end Pixel-space Generative) model achieves an FID of 2.04 on ImageNet-256 and 2.35 on ImageNet-512 with only 75 function evaluations (NFE), outperforming all prior pixel-space methods and matching or surpassing leading latent-space models at comparable training cost.

Figure 2: (Left) EPG achieves state-of-the-art generation quality with lower inference cost compared to both pixel-space and latent-space models. (Right) Images generated by the EPG diffusion model.

Consistency Models

The EPG consistency variant achieves an FID of 8.82 on ImageNet-256 in a single sampling step, significantly outperforming latent-space consistency models and prior pixel-space approaches. This is the first demonstration of successful high-resolution, one-step pixel-space generation without reliance on pre-trained VAEs or diffusion models.

Figure 3: Images generated by EPG-L via one-step sampling, demonstrating high-fidelity synthesis in a single forward pass.

Scaling and Efficiency

The framework exhibits strong scaling properties: increasing pre-training batch size and model parameters leads to monotonic improvements in downstream generative performance.

Figure 4: Downstream generative model performance scales with pre-training compute budgets, indicating efficient utilization of additional resources.

Qualitative Results

Uncurated samples generated by EPG-XL across various classes and guidance scales demonstrate high diversity and fidelity, with competitive visual quality to state-of-the-art latent diffusion models.

Figure 5: Uncurated samples from EPG-XL (class 1, guidance 4.5).

Figure 6: Uncurated samples from EPG-XL (class 22, guidance 4.5).

Figure 7: Uncurated samples from EPG-XL (class 89, guidance 2.5).

Ablation and Analysis

Ablation studies confirm that both the representation consistency loss and the auxiliary contrastive loss are critical for stable training and optimal generative performance. The temperature schedule for the InfoNCE loss is shown to be important for avoiding early training collapse. Comparisons with alternative pre-training strategies (e.g., MoCo v3, REPA, rRCM) demonstrate that the proposed method yields superior downstream FID scores in both diffusion and consistency modeling regimes.

Theoretical and Practical Implications

This work demonstrates that with appropriate self-supervised pre-training, pixel-space generative models can match or exceed the performance of latent-space models, while eliminating the need for complex VAE pipelines. The approach is theoretically grounded in the alignment of representations along diffusion trajectories and empirically validated at scale. The method is modular, requiring no external models, and is compatible with both diffusion and consistency modeling frameworks.

Practically, this enables more streamlined, efficient, and robust generative modeling pipelines, particularly for high-resolution image synthesis. The framework is also well-suited for extension to multi-modal and conditional generation tasks, given its strong scaling properties and independence from VAE bottlenecks.

Future Directions

Potential avenues for further research include:

Incorporating external supervision (e.g., off-the-shelf SSL models) to further accelerate training.
Extending the framework to multi-modal or text-conditional generation.
Investigating the integration with flow matching and other generative paradigms.
Exploring the limits of scaling in terms of model size, data, and compute.

Conclusion

The two-stage self-supervised pre-training and end-to-end fine-tuning framework presented in this work closes the longstanding performance and efficiency gap between pixel-space and latent-space generative models. By decoupling semantic representation learning from pixel generation and leveraging robust contrastive and consistency objectives, the method achieves state-of-the-art results in both diffusion and consistency modeling, with strong scaling and practical efficiency. This establishes a new baseline for pixel-space generative modeling and provides a foundation for future advances in high-fidelity, efficient image synthesis.