Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 150 tok/s

Gemini 2.5 Pro 47 tok/s Pro

GPT-5 Medium 33 tok/s Pro

GPT-5 High 34 tok/s Pro

GPT-4o 113 tok/s Pro

Kimi K2 211 tok/s Pro

GPT OSS 120B 444 tok/s Pro

Claude Sonnet 4.5 37 tok/s Pro

2000 character limit reached

PixelFlow: Pixel-Space Generative Models with Flow (2504.07963v1)

Published 10 Apr 2025 in cs.CV

Abstract: We present PixelFlow, a family of image generation models that operate directly in the raw pixel space, in contrast to the predominant latent-space models. This approach simplifies the image generation process by eliminating the need for a pre-trained Variational Autoencoder (VAE) and enabling the whole model end-to-end trainable. Through efficient cascade flow modeling, PixelFlow achieves affordable computation cost in pixel space. It achieves an FID of 1.98 on 256$\times$256 ImageNet class-conditional image generation benchmark. The qualitative text-to-image results demonstrate that PixelFlow excels in image quality, artistry, and semantic control. We hope this new paradigm will inspire and open up new opportunities for next-generation visual generation models. Code and models are available at https://github.com/ShoufaChen/PixelFlow.

Summary

PixelFlow: Pixel-Space Generative Models with Flow

PixelFlow introduces a novel paradigm in the field of image generation by directly operating in raw pixel space, diverging from the traditional latent-space models. This innovative approach simplifies the image generation pipeline by eschewing the need for pre-trained variational autoencoders (VAEs), thus enabling an entirely end-to-end trainable model. The central thrust of PixelFlow lies in its efficient cascade flow modeling, which allows for managing computational costs effectively in pixel space.

A significant numerical achievement of the PixelFlow model is its Fréchet Inception Distance (FID) of 1.98 on the 256x256 ImageNet class-conditional image generation benchmark. Such results suggest that PixelFlow not only competes favorably with existing generative models but also establishes itself as a formidable technique in achieving high-quality image synthesis.

Key Features and Methodology

PixelFlow utilizes the Flow Matching algorithm to facilitate a seamless, end-to-end image generation process in raw pixel space. By generating images directly from pixels, the model follows a multi-scale generation procedure, incrementally transforming high-noise, low-resolution samples into fully denoised high-resolution images through several resolution stages. This eliminates the traditional dichotomy between the VAE and diffusion components, promoting a holistic optimization approach.

The architecture of the model is enhanced by adopting a Transformer-based framework, particularly the Diffusion Transformer (DiT), which has been adapted to match the nuances of pixel-space image generation. Critical modifications include:

Patchify Framework: This converts raw pixel inputs into sequences of tokens, adapting the Vision Transformer (ViT) approach, ensuring compatibility with varying image resolutions.
Rotary Positional Embedding (RoPE) for enhanced sequence handling across multiple resolution levels.
Resolution Embedding: Distinguishes various resolutions, facilitating seamless transition during multi-scale generation.

Performance and Comparative Analysis

On the ImageNet 256x256 benchmark, PixelFlow demonstrates commendable performance, competing effectively against both latent- and pixel-space models. It distinguishes itself through its ability to provide high-fidelity image synthesis, preserving intricate details without the computational overhead traditionally associated with VAEs. Compared to latent-space models like LDM and DiT, PixelFlow achieves competitive FID scores, holding its ground against state-of-the-art pixel-based models.

PixelFlow's approach is validated through extensive experimentation focusing on various architectural and hyperparameter configurations. Through these experiments, various aspects of PixelFlow's implementation are optimized, including kickoff sequence length and patch size, to ensure alignment with both quality and computational efficiency standards.

Implications and Future Directions

The introduction of PixelFlow paves the way for broader implications within the generative models landscape, pushing forward the potential for more streamlined, efficient, and integrated image generation frameworks. By operating directly in pixel space, this model inspires new lines of inquiry into fully end-to-end optimized generative processes, thus challenging the existing reliance on latent variable decompositions.

Future research propelled by PixelFlow's premises could potentially explore more sophisticated flow-based architectures that incorporate real-time, adaptive resolution sampling techniques. There is also potential for broader applications across various modalities, including video, audio, and 3D generations, due to the flexibility of the Flow Matching mechanism.

In summary, PixelFlow not only heralds a promising avenue for image synthesis but also offers a valuable template for re-engineering generative models under pixel space paradigms. The approach holds the promise of reshaping the landscape of generative modeling by enhancing both performance and simplicity, encouraging further exploration beyond conventional frameworks.